Agentic Adventures -Side Quest LlamaLauncher Part 7

Jun 12, 2026 8 min read

Adding more features

Now that I can do larger setups I think I will try to add multiple features in one go. I already have a todo list with different elements in it so lets start with adding the rest of the parametrers

[@TODO.md](file:///Volumes/teaching/Code/LLamaLauncher/TODO.md) add the Phase 2 : Context Parameters for the most common section using the same setup as present. Note some parameters are integer some double and some just boolean so ensure the correct widget is used.

It’s done quite a lot but I needed to be more specific about the panel layout I think

Let’s re-arange

re-arrange the panel so it is in a grid layout with at least 3 items per row

That’s a bit better

Don’t like the justtification tho.

make all the widgets in the grid left justified

This is taking a lot longer and it’s getting quite confused and eventually run out of context.

Every 2.0s: curl -s http://localhost:8787/stats | jq ".recent_requests[-1]"           Mac70048: Mon 15 Jun 12:54:01 2026
                                                                                                           in 0.039s (0)
{
  "request_id": "hr_1781524308_000034",
  "timestamp": "2026-06-15T12:53:05.925957",
  "provider": "openai",
  "model": "LLamaServerCurrentModel",
  "input_tokens_original": 98289,
  "input_tokens_optimized": 94022,
  "output_tokens": 2433,
  "tokens_saved": 4267,
  "savings_percent": 4.341279288628432,
  "optimization_latency_ms": 85.14904975891113,
  "total_latency_ms": 77600.16202926636,
  "tags": {
    "memory_skip_reason": "no_handler",
    "client": "zed"
  },
  "cache_hit": true,
  "transforms_applied": [
    "router:excluded:tool",
    "router:html:0.00",
    "router:html:0.00",
    "router:html:0.00",
    "router:html:0.00"
  ],
  "waste_signals": null,
  "error": null,
  "turn_id": "7faa870822da900d"
}

As you can see from headroom there is very little compression going on here. I will revert the ui file and try again with a better prompt combining all the extras. I am going to try the zed new from summary and add. For the full context see AgentChat22.md

[@TODO.md](file:///Volumes/teaching/Code/LLamaLauncher/TODO.md) add the Phase 2 : Context Parameters for the most common section using the same setup as present. Note some parameters are integer some double and some just boolean so ensure the correct widget is used. the panel should be  a grid layout with at least 3 items per row with the items left justified.

My mac is also getting very hot! Fans runing full. But it is working again, AgentChat23.md

Now looking very good

Whilst this was running I updated the server starting script to allow headroom stats to be shown it now looks like this

#!/usr/bin/env bash
#
# launch-headroom-llama.sh
#
# Opens a tmux session with three panes:
#
#   Top-left     -> Headroom proxy
#   Top-right    -> Live stats watcher (curl + jq on Headroom /stats)
#   Bottom (full width) -> llama-server (Qwen3.6-35B-A3B)
#
# Usage:
#   chmod +x launch-headroom-llama.sh
#   ./launch-headroom-llama.sh
#
# If a session with the same name already exists, this script
# kills it first so you always get a fresh start. Detach with
# Ctrl-b d, and reattach later with:
#   tmux attach -t headroom-llama

SESSION="headroom-llama"

# Kill any existing session with this name (ignore error if none exists)
tmux kill-session -t "$SESSION" 2>/dev/null

# Start a new detached session - this is the top-left pane
tmux new-session -d -s "$SESSION" -n main
PANE_PROXY="${SESSION}:main.0"

# --- Top-left pane: Headroom proxy ---
tmux send-keys -t "$PANE_PROXY" \
'export OPENAI_TARGET_API_URL=http://localhost:8080/v1
export HEADROOM_CODE_AWARE_ENABLED=1
export OPENAI_API_KEY=12345
export HEADROOM_LOG_LEVEL=DEBUG
headroom proxy --port 8787' C-m

# --- Split off the bottom pane (full width) for llama-server ---
PANE_LLAMA=$(tmux split-window -v -t "$PANE_PROXY" -P -F '#{pane_id}')
tmux send-keys -t "$PANE_LLAMA" \
'llama-server --api-key 12345 \
    -m ~/teaching/AI_Models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
    -ngl 50 -np 1 --flash-attn on -fit off --no-mmap --jinja --ctx-size 32000 \
    --temp 0.6 --min-p 0.0 --top-p 0.80 --top-k 20 --repeat-penalty 1.05' C-m

# --- Split the top-left pane horizontally: stats watcher to the right ---
PANE_STATS=$(tmux split-window -h -t "$PANE_PROXY" -P -F '#{pane_id}')
tmux send-keys -t "$PANE_STATS" \
"watch -n 2 'curl -s http://localhost:8787/stats | jq \".recent_requests[-1] | {input_tokens_original,input_tokens_optimized,tokens_saved,savings_percent,optimization_latency_ms,total_latency_ms}\"'" C-m

# --- Attach to the session ---
tmux attach-session -t "$SESSION"

Other parameters

Whilst I will not support all the params, I have AI generated a list of other parameters that could be useful by runing llama-server –help and dumping it into ChatGPT and got it to generate a TODO list for it.

## Performance

- [ ] Configure GPU acceleration

  - [ ] Set GPU layer offload

    | Parameter | Purpose | Typical Value |

    |---|---|---|

    | `-ngl` / `--n-gpu-layers` | Number of model layers offloaded to GPU | `99` (full offload if memory allows) |

  - [ ] Configure CPU threading

    | Parameter | Purpose | Typical Value |

    |---|---|---|

    | `-t` / `--threads` | Number of CPU threads used for generation | Number of performance cores or auto |

    | `--threads-batch` | CPU threads used during prompt processing | Higher than generation threads |

  - [ ] Tune batching

    | Parameter | Purpose | Typical Value |

    |---|---|---|

    | `-b` / `--batch-size` | Number of tokens processed per batch | `512–2048` |

    | `-ub` / `--ubatch-size` | Physical micro-batch size | `128–512` |

  - [ ] Configure context size

    | Parameter | Purpose | Typical Value |

    |---|---|---|

    | `-c` / `--ctx-size` | Maximum context window size | `8192–131072` depending on model |

    | `--n-predict` | Maximum tokens generated | `2048–8192` |

  - [ ] Enable attention optimisations

    | Parameter | Purpose | Typical Value |

    |---|---|---|

    | `--flash-attn` | Enable Flash Attention to reduce memory use and improve speed | `on` |

  - [ ] Optimise KV cache

    | Parameter | Purpose | Typical Value |

    |---|---|---|

    | `--cache-type-k` | KV cache key precision | `f16`, `q8_0`, `q4_0` |

    | `--cache-type-v` | KV cache value precision | `f16`, `q8_0`, `q4_0` |

  - [ ] Configure memory handling

    | Parameter | Purpose | Typical Value |

    |---|---|---|

    | `--mmap` | Memory-map model file | Enabled |

    | `--mlock` | Lock model into RAM to prevent swapping | Enable if enough RAM |

  - [ ] Configure server batching

    | Parameter | Purpose | Typical Value |

    |---|---|---|

    | `--cont-batching` | Enable continuous batching for multiple requests | `on` |

    | `--parallel` | Number of concurrent sequences | `1–8+` |

    | `--defrag-thold` | KV cache defragmentation threshold | `0.1–0.5` |

---

# Advanced Generation

- [ ] Configure speculative decoding / MTP support

  | Parameter | Purpose | Typical Value |

  |---|---|---|

  | `--draft-model` | Small draft model used for speculative decoding | Smaller compatible model |

  | `--draft-max` | Maximum number of draft tokens | `4–8` |

  | `--draft-min` | Minimum number of draft tokens | `1–2` |

- [ ] Configure adaptive sampling

  | Parameter | Purpose | Typical Value |

  |---|---|---|

  | `--mirostat` | Adaptive sampling algorithm | `0` (off), `1` or `2` (enabled) |

  | `--mirostat-lr` | Mirostat learning rate | `0.1` |

  | `--mirostat-ent` | Target entropy level | `5–7` |

- [ ] Configure deterministic generation

  | Parameter | Purpose | Typical Value |

  |---|---|---|

  | `--seed` | Random seed for reproducible output | Fixed integer (e.g. `42`) |

- [ ] Configure structured output constraints

  | Parameter | Purpose | Typical Value |

  |---|---|---|

  | `--grammar` | Apply grammar constraints to generation | JSON / custom grammar file |

  | `--json-schema` | Force JSON schema compliant output | Schema file |

- [ ] Configure model behaviour

  | Parameter | Purpose | Typical Value |

  |---|---|---|

  | `--rope-scaling` | Extend model context length using RoPE scaling | Model dependent |

  | `--rope-freq-base` | Modify RoPE frequency base | Model dependent |

  | `--rope-freq-scale` | Adjust positional scaling | Model dependent |

I will dump this into the TODO and get the agent to run this now.

[@TODO.md](file:///Volumes/teaching/Code/LLamaLauncher/TODO.md) Implement the Performance section of the TODO in a new Tab in the parameter_tab QTabWidget it should be laid out in the same way as the Sampling parameters

Interesting it has added it to the wrong tab

You have added the perfomance tab to the wrong place.

The agent responded with

I added the Performance tab between the Model and Server tabs (in that order). Where would you like it instead?

1. **After the Server tab** (Model → Server → Performance)
2. **Inside the Model tab** alongside Sampling Parameters as a group box
3. **Somewhere else entirely?**

Could you clarify the intended layout?

Oh well I did tell you where but here we go again.

It should be part of the Sampling parmeters tab group box called parameter_tab

Better but it has cut off the bottom of the app now!

I have started a new theread and asked

When running the app the bottom launch button is not visible due to the ammount of parameters in the Sampling parameters and performance tabs. Can they be in a scrollable area or made to fit

Thats better See AgentChat24.md and AgentChat25.md for full details.

Advanced Generation Parameters

Lets see if I can get a better prompt for the next tab.

[@TODO.md](file:///Volumes/teaching/Code/LLamaLauncher/TODO.md) Implement the Advanced Generation of the TODO in a new Tab in the parameter_tab QTabWidget it should be laid out in the same way as the other two tabs using a scroll area for the base of all the items.

That took a while but all working now.

I’m going to manually tweak a few things in QtCreator but this will do for now.

Neet to do some testing to see how it all goes. Full output in AgentChat26.md

Testing

A few parameters were not working.

-fa,   --flash-attn [on|off|auto]       set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
                                        (env: LLAMA_ARG_FLASH_ATTN)

Sent the output to the model with

Flash attention is not working -fa,   --flash-attn [on|off|auto]       set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
                                        (env: LLAMA_ARG_FLASH_ATTN)

The thread reached the token limit again so I have compacted and started again.

It has now fixed this but added other errors.

0.00.182.962 W DEPRECATED: --defrag-thold is deprecated and no longer necessary to specify

error while handling argument "--draft-max": the argument has been removed. use --spec-draft-n-max or --spec-ngram-mod-n-max

usage:
--draft, --draft-n, --draft-max N       the argument has been removed. use --spec-draft-n-max or
                                        --spec-ngram-mod-n-max
                                        (env: LLAMA_ARG_DRAFT_MAX)


to show complete usage, run with -h


--- Process exited with code 1 ---

Dumped this into the Agent with fix these too!

All now working see AgentChat27.md and AgentChat28.md

Agentic Adventures -Side Quest LlamaLauncher Part 7

Adding more features

Other parameters

Advanced Generation Parameters

Testing

Related