now talking in #Web Development
Hello, World
The first post on the redesigned site.
Welcome
First post on the redesigned site.
Typography & Prose
Heading level 3 — section subheading
Heading level 4 — minor label
Short paragraph. One sentence with a link.
Long paragraph with everything inline: bold, italic, a
hyperlink, and inline_code(). It also contains a
deliberately unbreakable path to test wrapping inside the reading measure —
/home/aaddrick/source/llama/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf — and a
bare URL: https://example.com/very/long/path/segment/that/keeps/going/and/going/without/spaces.
The prose should stay at a comfortable measure and never trigger a horizontal scrollbar.
A pull-quote / blockquote at the maximum: italic, accent rule on the left, secondary color. Good for highlighting a single strong line lifted from the body text.
Unordered list (short)
- One item.
Unordered list (long, nested)
- First-level item with a fair amount of text that will wrap onto a second line at this measure to confirm the hanging indent behaves.
- Second item
- Nested item A
- Nested item B with
code
- Third item
Ordered list
- Probe hardware.
- Build from source.
- Wire up the agent.
Callouts
Minimal — body text only, no heading line:
All six variants, each with a heading line:
Maximal — multi-paragraph callout with nested content:
First paragraph explaining the situation in enough detail that it wraps across multiple lines and carries genuine weight in the layout.
It can include inline code, a link, and a short list:
- Point one
- Point two
Stat Grids
Single card
Two cards
Extremes — tiny number vs. huge number & long label
Six cards (in-column wrap)
Six cards, breakout (.wide) — roomier across the full column
Tables
Tables break out to the full column by default, so wide data tables aren't crammed into the prose measure. This text stays at the reading measure. The eight-column table below uses the room to its right, so every column gets space, Notes included.
| n-cpu-moe | Prompt 512 | Prompt 4096 | Gen | VRAM | KV cache | Status | Notes |
|---|---|---|---|---|---|---|---|
| 5 | — | — | — | OOM | — | OOM | No headroom for KV cache. |
| 6 | 1708 | 1670 | 119 | 13.6 GB | 1.9 GB | Best | Minimum offload that loads. |
| 7 | 1592 | 1540 | 115 | 13.9 GB | 1.9 GB | OK | One more expert layer on CPU. |
| 16 | 1084 | 1010 | 89 | 11.2 GB | 1.9 GB | Server | Extra room for KV cache. |
| 24 | 813 | 770 | 70 | 9.4 GB | 1.9 GB | Slow | Half the experts on CPU. |
Small table — opt back in with .narrow
A 2–3 column table looks sparse at full width. Add .narrow to keep it inline with the
prose measure.
| Flag | Meaning |
|---|---|
-ngl 99 | Offload all layers to GPU. |
-fa on | Enable Flash Attention. |
Complex — colspan & long cell text
| Model | Params | Verdict |
|---|---|---|
| Qwen3-Coder-30B-A3B | 30B / 3B active | Chosen. The MoE architecture stays fast even when some experts spill to CPU, and the Coder variant is fine-tuned on multi-turn function-calling traces, which is the thing that breaks first on local agents. |
| GLM-4.5-Air | ~100B+ MoE | Better tool-call reliability, but too large for 16 GB VRAM. |
| Everything below this row was ruled out for needing more than 16 GB at usable quantization. | ||
| Llama 3.3 70B | 70B dense | Possible at Q3, but ~3 tok/s. Unusable for agent loops. |
Figures & Charts
Figure with caption
Wrap an image, diagram, or screenshot in a <figure>. Swap the placeholder below for an
<img> in a real article. Add .wide for large visuals.
CSS bar chart
Good for simple number comparisons when you don't want to pull in a charting library. Set each bar's width
inline with --val:
Wide figure (.wide)
.wide figure spans the full column. Good for wide charts or architecture diagrams.Timeline
Single step
A timeline with one node.
Many steps, with badges, code, and long text
Confirmed RTX 5080 Laptop with 16 GB VRAM.
Used sudo dnf install cuda-devel cmake; added cuda-cudart-static separately because the compiler probe links statically and fails without it. This step took the longest because of a sequence of small packaging quirks that each looked fatal until the next one was fixed.
Configured CMake with -DCMAKE_CUDA_ARCHITECTURES=120.
Swept --n-cpu-moe from 6 to 24.
Added a provider entry and set it as default.
Key Terms
Single term
Many terms — short and very long definitions
Breakout (.wide) grid
Glossary (.glossary) — definition list for longer lists
The card grid gets heavy once you have more than a handful of terms. A <dl class="glossary">
reads lighter: a cyan term over an indented definition, with a hairline between entries.
- GGUF
- llama.cpp's container format.
- Quantization
- Squashing 16-bit weights down to 4–8 bits. Cuts memory roughly 4× and, with a good scheme, most models barely notice the loss in quality — which is why nobody runs a 30B model in full precision at home.
- KV Cache
- The model's per-conversation scratchpad for attention. It grows with context length, so a 32k window can quietly consume 1–3 GB of VRAM on its own.
- Flash Attention
- A memory-efficient rewrite of the attention computation.
Layout Blocks
Two-up comparison (.cols)
Good for pros/cons, before/after, or option A vs. B. It collapses to one column on narrow screens. (Pair it with callouts for color.)
- Near-zero water use
- Fill once at construction
- Higher upfront cost
- 70–80% of water lost to air
- Cheap to build
- Hard to permit in dry regions
Card grid (.card-grid)
Good for profiles, options, or per-item summaries, like vendor or product cards in a research article. Shown
here with .wide.
Crusoe
Closed-loop direct-to-chip cooling.
- ~50,000 gal/yr per building
- $1.375B Series E
Submer
Single-phase immersion cooling.
- Servers bathed in dielectric fluid
- Based in Spain
Corintis
Microfluidic chip-level cooling.
- 3× better heat removal in trials
- Channels modeled on leaf veins
Stacked card grid (.card-grid.stack)
Add .stack to force one card per row. This works better when each card carries a lot of detail
(long bullet lists, multiple paragraphs) that the auto-fit columns would otherwise squeeze.
Crusoe — Abilene, Texas
Closed-loop, non-evaporative direct-to-chip cooling.
- ~50,000 gal/yr per building — roughly 60% of one U.S. household's annual water footprint
- $1.375B Series E
Submer — Barcelona, Spain
Single-phase immersion cooling, servers bathed in dielectric fluid.
- Near-zero on-site water use
- $20M debt facility
Collapsible details
Tuck extra content, FAQs, or long configs behind a toggle so the main flow stays readable.
Show the full benchmark methodology
Measured with llama-bench on commit 19e92c3, CUDA 13.0.88, Blackwell sm_120a build, averaged over three runs.
llama-bench -m model.gguf -ngl 99 -ncmoe 6 -fa 1 -p 512,2048 -n 128 -r 3
Frequently asked: why not just use Ollama?
Ollama wraps llama.cpp but lags on bleeding-edge architecture support and doesn't expose the
--n-cpu-moe tuning that unlocked the throughput here.
Badges
Inline within a sentence: status went from OOM to
Best after dropping --n-cpu-moe to 6, with
slow and server configs in between.
Standalone row: blue green amber red
Code Blocks
Inline
Run llama-server --port 8080 and check curl -s http://127.0.0.1:8080/health.
Minimal block (one line)
pkill -f llama-server
Maximal — long lines (horizontal scroll) + syntax spans
# Start the server — note the very long absolute paths that force horizontal scrolling inside the code block rather than stretching the page /home/aaddrick/source/llama/llama.cpp/build/bin/llama-server \ -m /home/aaddrick/source/llama/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf \ --chat-template-file /home/aaddrick/source/llama/llama.cpp/models/templates/Qwen3-Coder.jinja \ -ngl 99 --n-cpu-moe 16 -fa on --jinja -np 1 -c 32768 \ --host 127.0.0.1 --port 8080
Block with JSON + comment/string highlighting
// ~/.pi/agent/models.json (excerpt) { "baseUrl": "http://127.0.0.1:8080/v1", "api": "openai-completions", "contextWindow": 32768 }
Breakout (.wide) code block
# A wide code block uses the full column, leaving fewer lines to wrap/scroll
./llama.cpp/build/bin/llama-bench -m models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf -ngl 99 -ncmoe 6 -fa 1 -p 512,2048 -n 128 -r 3
Footnotes & Sources
Footnotes
Inline citations1 link down to a numbered list, and each note
links back2 to where it was referenced. These are pure anchors,
no JavaScript, and the jump respects scroll-behavior.
- First footnote, with a source link and supporting detail. ↩
- Second footnote. ↩