Agentic Adventures - Using llama.cpp Part 4
gemma-4-31B-it-UD-Q4_K_XL
For this demo I am going to use the gemma models
https://huggingface.co/unsloth/ge
mma-4-31B-it-GGUF
Test 7 Mac gemma-4-31B-it-UD-Q4_K_XL
On first impressions this model is quite slow, but even more it is a strange resource hog, It doesn’t seem to use all the GPU but hammers the efficiency cores and makes the machine un-responsive (so can only really use one app).

It took a long time to run but generated code, however it seemed to be unable to write it to main.cpp, in the end I had to stop the server and manually copy the code across to the main.py file. Once this was done the program ran first time and produced this

Again it failed to use the git worktrees, however all other elements work as expected, and the code has docstrings and partial type hints (It ignored the more complex Qt ones!). See AgentChat1.md for details.
I need to see if I can tune the parameters to make it work faster. I discovered from this blog
llama-server --api-key 12345 -m gemma-4-31B-it-UD-Q4_K_XL.gguf -ngl 99 \\
-c 36000 \
--temp 1.0 \
--top-p 0.95 \
--top-k 64
Which seems to work better. I did try a -c 0 but it ran out of memory so I tuned it to just about fit in my context.
Interestingly it is now also using the AGENTS.md rules and creating a worktree, however as I had forgot to add the current files to the repo it sort of went wrong. I am going to add the files and try again, it is however working much better. See AgentChat2.md.
For the third attempt I upped the context size to -c 36000 as I ran out before, this is working well now, seems the param changes really help. The app was re-created and worked first time AgentChat3.md.
So what do the params do?
The –ngl N param determines how much is offloaded ot the GPU so :-
- 0 = CPU only
- 20 = first 20 layers on GPU
- 99 = effectively “put as many layers as possible on the GPU”
For a 31B model, if your GPU has enough VRAM, all transformer layers will be placed on the GPU so can be quite fast.
I have already mentioned the -c for the context window but this table helps to figure out the sizes.
| Tokens | Rough English words |
|---|---|
| 8k | 6,000 |
| 16k | 12,000 |
| 36k | 27,000 |
| 128k | 95,000 |
In general a larger contex uses more RAM (CPU and GPU) and can increase prompt processing time.
The –temp controls randomness in the model
| Value | Behaviour |
|---|---|
| 0.0 | Nearly deterministic |
| 0.2 | Very focused |
| 0.7 | Balanced |
| 1.0 | Default/random |
| 1.5+ | Creative but can become unstable |
The –top-p flag is the nucleus sampling and determines how the model samples the tokens
- Sorts candidate tokens by probability.
- Keeps only enough tokens whose cumulative probability reaches 95%.
- Samples from that subset.
So it determines what to throw away for example
0.8 more focused
0.9 conservative
0.95 common default
1.0 disable top-p
This works in conjunction with –top-k 64 which says how many tokens to concider. So in this case only consider the 64 most likely next tokens.
For more information this article has some good info, and from further reading around the topic (most of this is new to me!) I have found that the following are used
--temp 0.3 \
--top-p 0.9 \
--top-k 40
as they give a balance between creativity and coherence (more conservative). I will use these setting next time when I try under linux.
Test 8 Linux gemma-4-31B-it-UD-Q3_K_XL
For the linux version I decided to use the new parameters discussed above.
llama-server --api-key 12345 -m gemma-4-31B-it-UD-Q4_K_XL.gguf -ngl 99 \
-c 36000 \
--temp 0.3 \
--top-p 0.9 \
--top-k 40
Unfortunatly the 18Gb model would not fit on the linux machine so I had to find a smaller version using the Q3 dataset and I let llama.cpp decide the ngl ammount. Initial impressions is that this is very slow. Perhaps the params need a tweak.
Once this is done, it seems to work ok but slower than the mac version. It has immediatly created a work tree (I think in some of the previous examples I have forgotten to add the repo to git but this time I did!). As the first run was slow and had partial work you will see in AgentChat1.md that there are some issues with it creating a worktree as a partial one already exists, I guess the user needs to improve their git hygiene!
In the end I decided to delete the existing worktree and start again. It’s still slow, but seems to be working. Initial worktree created and now creating the app.
It says it has put it into the worktree, but I can’t find the actual executable, I will ask the agent AgentChat2.md it seems it has just dumped it into the actual main.py in the project and not the worktree! Once I found it, the program ran first time and worked correctly.
The image scaling is a little odd but the basics are there.
Analysis
Both files implement the same core application but with notable differences in quality and approach.
They have the same core features as requested such as drag and drop support, file menu with Open/Exit, scrollable image display and the samesupported formats: .png, .jpg, .jpeg, .bmp, .gif.
Same drag event logic: both implement dragEnterEvent and dropEvent with URL MIME type checking.
Key Differences
| Aspect | linux | mac |
|---|---|---|
| Image widget class | ImageLabel | ImageWidget |
| Image scaling | Fixed 400×300 pixels (setFixedSize) — distorts aspect ratio | Scales to fit within 800×800, preserving aspect ratio (KeepAspectRatio + SmoothTransformation) |
| Error handling | None — silently fails on a bad pixmap | Detects null pixmaps and shows a red error message |
| Batch loading | add_image() takes a single path, called in a loop | add_images() takes a list of paths — cleaner API |
| Drop filtering | No pre-filtering in dropEvent — delegates to add_image | Filters non-image URLs before adding, and calls event.ignore() if nothing valid was dropped |
| Layout margins/spacing | No explicit spacing or margins | 20px spacing and 20px margins on all sides |
| Layout alignment | AlignCenter | AlignTop |
| Qt enum style | Uses bare Qt.AlignCenter (older style) | Uses fully qualified Qt.AlignmentFlag.AlignCenter (PySide6 best practice) |
| Window size | 600×800 | 1000×800 |
| Window title | "Image Drop App" | "ImageDrop" |
| Type annotations | Partial | More complete (e.g. -> None on all methods) |
| Docstrings | Present but minimal | More thorough, includes a module-level docstring |
The mac version is the more polished and correct implementation. Its aspect-ratio-preserving scaling, null pixmap error handling, proper Qt6 enum usage, and cleaner drop event filtering make it notably more robust.
The linux version feels like an earlier draft (most likely due to the smaller quant and different parameters), it is functional but with a fixed image size that will distort non-4:3 images and no error handling if a file fails to load.