now talking in #Web Development

Hello, World

The first post on the redesigned site.

Welcome

First post on the redesigned site.

What's hereA showcase of the components so I can see what breaks as I make changes.

Typography & Prose

Heading level 3 — section subheading

Heading level 4 — minor label

Short paragraph. One sentence with a link.

Long paragraph with everything inline: bold, italic, a hyperlink, and inline_code(). It also contains a deliberately unbreakable path to test wrapping inside the reading measure — /home/aaddrick/source/llama/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf — and a bare URL: https://example.com/very/long/path/segment/that/keeps/going/and/going/without/spaces. The prose should stay at a comfortable measure and never trigger a horizontal scrollbar.

A pull-quote / blockquote at the maximum: italic, accent rule on the left, secondary color. Good for highlighting a single strong line lifted from the body text.

Unordered list (short)

  • One item.

Unordered list (long, nested)

  • First-level item with a fair amount of text that will wrap onto a second line at this measure to confirm the hanging indent behaves.
  • Second item
    • Nested item A
    • Nested item B with code
  • Third item

Ordered list

  1. Probe hardware.
  2. Build from source.
  3. Wire up the agent.

Callouts

Minimal — body text only, no heading line:

Just a single line of supporting information.

All six variants, each with a heading line:

Note (blue)Neutral aside or context.
Aside (teal)A tangent worth setting apart.
Takeaway (green)The thing to remember.
Caution (amber)A caveat or gotcha.
Warning (red)Something that will bite you.
Emphasis (accent)Uses the brand accent for the strongest pull.

Maximal — multi-paragraph callout with nested content:

A callout can hold real structure

First paragraph explaining the situation in enough detail that it wraps across multiple lines and carries genuine weight in the layout.

It can include inline code, a link, and a short list:

  • Point one
  • Point two

Stat Grids

Single card

120 t/s
Token generation

Two cards

1708 t/s
Prompt processing
119 t/s
Generation

Extremes — tiny number vs. huge number & long label

3
Layers in the stack
1,708,000
tokens processed per second under the synthetic micro-benchmark with a 512-token prompt
requests, no rate limit

Six cards (in-column wrap)

16 GB
VRAM
30 GB
System RAM
24
CPU threads
48
Layers
128
Experts
8
Active / token

Six cards, breakout (.wide) — roomier across the full column

16 GB
VRAM
30 GB
System RAM
24
CPU threads
48
Layers
128
Experts
8
Active / token

Tables

Tables break out to the full column by default, so wide data tables aren't crammed into the prose measure. This text stays at the reading measure. The eight-column table below uses the room to its right, so every column gets space, Notes included.

n-cpu-moePrompt 512Prompt 4096GenVRAMKV cacheStatusNotes
5OOMOOMNo headroom for KV cache.
61708167011913.6 GB1.9 GBBestMinimum offload that loads.
71592154011513.9 GB1.9 GBOKOne more expert layer on CPU.
16108410108911.2 GB1.9 GBServerExtra room for KV cache.
24813770709.4 GB1.9 GBSlowHalf the experts on CPU.

Small table — opt back in with .narrow

A 2–3 column table looks sparse at full width. Add .narrow to keep it inline with the prose measure.

FlagMeaning
-ngl 99Offload all layers to GPU.
-fa onEnable Flash Attention.

Complex — colspan & long cell text

ModelParamsVerdict
Qwen3-Coder-30B-A3B30B / 3B activeChosen. The MoE architecture stays fast even when some experts spill to CPU, and the Coder variant is fine-tuned on multi-turn function-calling traces, which is the thing that breaks first on local agents.
GLM-4.5-Air~100B+ MoEBetter tool-call reliability, but too large for 16 GB VRAM.
Everything below this row was ruled out for needing more than 16 GB at usable quantization.
Llama 3.3 70B70B densePossible at Q3, but ~3 tok/s. Unusable for agent loops.

Figures & Charts

Figure with caption

Wrap an image, diagram, or screenshot in a <figure>. Swap the placeholder below for an <img> in a real article. Add .wide for large visuals.

image / diagram placeholder
Figure 1. Captions sit below the figure in secondary text.

CSS bar chart

Good for simple number comparisons when you don't want to pull in a charting library. Set each bar's width inline with --val:

n-cpu-moe 6
119 t/s
7
115 t/s
16
89 t/s
24
70 t/s

Wide figure (.wide)

full-column chart / screenshot
Figure 2. A .wide figure spans the full column. Good for wide charts or architecture diagrams.

Timeline

Single step

Done

A timeline with one node.

Many steps, with badges, code, and long text

1. Probed hardware 2 min

Confirmed RTX 5080 Laptop with 16 GB VRAM.

2. Installed CUDA 13 + cmake done

Used sudo dnf install cuda-devel cmake; added cuda-cudart-static separately because the compiler probe links statically and fails without it. This step took the longest because of a sequence of small packaging quirks that each looked fatal until the next one was fixed.

3. Built llama.cpp from source

Configured CMake with -DCMAKE_CUDA_ARCHITECTURES=120.

4. Benchmarked offload configs slow

Swept --n-cpu-moe from 6 to 24.

5. Wired pi.dev to llama-server

Added a provider entry and set it as default.

Key Terms

Single term

VRAM
Memory on the GPU board.

Many terms — short and very long definitions

GGUF
llama.cpp's container format.
Quantization
Squashing 16-bit weights down to 4–8 bits. Cuts memory roughly 4× and, with a good scheme, most models barely notice the loss in quality — which is why nobody runs a 30B model in full precision at home.
MoE
Mixture of Experts.
KV Cache
The model's per-conversation scratchpad for attention. It grows with context length, so a 32k window can quietly consume 1–3 GB of VRAM on its own and dominate your offload math.
Flash Attention
A memory-efficient rewrite of the attention computation.

Breakout (.wide) grid

Tokens/sec
Throughput. Generation is the number you feel.
Context window
How many tokens the model can see at once.
sm_120
NVIDIA's Blackwell compute-capability string.

Glossary (.glossary) — definition list for longer lists

The card grid gets heavy once you have more than a handful of terms. A <dl class="glossary"> reads lighter: a cyan term over an indented definition, with a hairline between entries.

GGUF
llama.cpp's container format.
Quantization
Squashing 16-bit weights down to 4–8 bits. Cuts memory roughly 4× and, with a good scheme, most models barely notice the loss in quality — which is why nobody runs a 30B model in full precision at home.
KV Cache
The model's per-conversation scratchpad for attention. It grows with context length, so a 32k window can quietly consume 1–3 GB of VRAM on its own.
Flash Attention
A memory-efficient rewrite of the attention computation.

Layout Blocks

Two-up comparison (.cols)

Good for pros/cons, before/after, or option A vs. B. It collapses to one column on narrow screens. (Pair it with callouts for color.)

Closed-loop cooling
  • Near-zero water use
  • Fill once at construction
  • Higher upfront cost
Evaporative cooling
  • 70–80% of water lost to air
  • Cheap to build
  • Hard to permit in dry regions

Card grid (.card-grid)

Good for profiles, options, or per-item summaries, like vendor or product cards in a research article. Shown here with .wide.

Crusoe

Closed-loop direct-to-chip cooling.

  • ~50,000 gal/yr per building
  • $1.375B Series E

Submer

Single-phase immersion cooling.

  • Servers bathed in dielectric fluid
  • Based in Spain

Corintis

Microfluidic chip-level cooling.

  • 3× better heat removal in trials
  • Channels modeled on leaf veins

Stacked card grid (.card-grid.stack)

Add .stack to force one card per row. This works better when each card carries a lot of detail (long bullet lists, multiple paragraphs) that the auto-fit columns would otherwise squeeze.

Crusoe — Abilene, Texas

Closed-loop, non-evaporative direct-to-chip cooling.

  • ~50,000 gal/yr per building — roughly 60% of one U.S. household's annual water footprint
  • $1.375B Series E

Submer — Barcelona, Spain

Single-phase immersion cooling, servers bathed in dielectric fluid.

  • Near-zero on-site water use
  • $20M debt facility

Collapsible details

Tuck extra content, FAQs, or long configs behind a toggle so the main flow stays readable.

Show the full benchmark methodology

Measured with llama-bench on commit 19e92c3, CUDA 13.0.88, Blackwell sm_120a build, averaged over three runs.

llama-bench -m model.gguf -ngl 99 -ncmoe 6 -fa 1 -p 512,2048 -n 128 -r 3
Frequently asked: why not just use Ollama?

Ollama wraps llama.cpp but lags on bleeding-edge architecture support and doesn't expose the --n-cpu-moe tuning that unlocked the throughput here.

Badges

Inline within a sentence: status went from OOM to Best after dropping --n-cpu-moe to 6, with slow and server configs in between.

Standalone row: blue green amber red

Code Blocks

Inline

Run llama-server --port 8080 and check curl -s http://127.0.0.1:8080/health.

Minimal block (one line)

pkill -f llama-server

Maximal — long lines (horizontal scroll) + syntax spans

# Start the server — note the very long absolute paths that force horizontal scrolling inside the code block rather than stretching the page
/home/aaddrick/source/llama/llama.cpp/build/bin/llama-server \
  -m /home/aaddrick/source/llama/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf \
  --chat-template-file /home/aaddrick/source/llama/llama.cpp/models/templates/Qwen3-Coder.jinja \
  -ngl 99 --n-cpu-moe 16 -fa on --jinja -np 1 -c 32768 \
  --host 127.0.0.1 --port 8080

Block with JSON + comment/string highlighting

// ~/.pi/agent/models.json (excerpt)
{
  "baseUrl": "http://127.0.0.1:8080/v1",
  "api": "openai-completions",
  "contextWindow": 32768
}

Breakout (.wide) code block

# A wide code block uses the full column, leaving fewer lines to wrap/scroll
./llama.cpp/build/bin/llama-bench -m models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf -ngl 99 -ncmoe 6 -fa 1 -p 512,2048 -n 128 -r 3

Footnotes & Sources

Footnotes

Inline citations1 link down to a numbered list, and each note links back2 to where it was referenced. These are pure anchors, no JavaScript, and the jump respects scroll-behavior.

  1. First footnote, with a source link and supporting detail.
  2. Second footnote.

Short source list

Long source list