now talking in #Personal

Hello, World

The first post on the redesigned site.

May 26, 2026 Personal Web Development

Welcome

First post on the redesigned site.

What's hereA showcase of the components so I can see what breaks as I make changes.

Typography & Prose

Heading level 3 — section subheading

Heading level 4 — minor label

Short paragraph. One sentence with a link.

Long paragraph with everything inline: bold, italic, a hyperlink, and inline_code(). It also contains a deliberately unbreakable path to test wrapping inside the reading measure — /home/aaddrick/source/llama/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf — and a bare URL: https://example.com/very/long/path/segment/that/keeps/going/and/going/without/spaces. The prose should stay at a comfortable measure and never trigger a horizontal scrollbar.

A pull-quote / blockquote at the maximum: italic, accent rule on the left, secondary color. Good for highlighting a single strong line lifted from the body text.

Unordered list (short)

One item.

Unordered list (long, nested)

First-level item with a fair amount of text that will wrap onto a second line at this measure to confirm the hanging indent behaves.
Second item
- Nested item A
- Nested item B with code
Third item

Ordered list

Probe hardware.
Build from source.
Wire up the agent.

Callouts

Minimal — body text only, no heading line:

Just a single line of supporting information.

All six variants, each with a heading line:

Note (blue)Neutral aside or context.

Aside (teal)A tangent worth setting apart.

Takeaway (green)The thing to remember.

Caution (amber)A caveat or gotcha.

Warning (red)Something that will bite you.

Emphasis (accent)Uses the brand accent for the strongest pull.

Maximal — multi-paragraph callout with nested content:

A callout can hold real structure

First paragraph explaining the situation in enough detail that it wraps across multiple lines and carries genuine weight in the layout.

It can include inline code, a link, and a short list:

Point one
Point two

Stat Grids

Single card

120 t/s

Token generation

Two cards

1708 t/s

Prompt processing

119 t/s

Generation

Extremes — tiny number vs. huge number & long label

Layers in the stack

1,708,000

tokens processed per second under the synthetic micro-benchmark with a 512-token prompt

∞

requests, no rate limit

Six cards (in-column wrap)

16 GB

VRAM

30 GB

System RAM

CPU threads

Layers

128

Experts

Active / token

Six cards, breakout (`.wide`) — roomier across the full column

16 GB

VRAM

30 GB

System RAM

CPU threads

Layers

128

Experts

Active / token

Tables

Tables break out to the full column by default, so wide data tables aren't crammed into the prose measure. This text stays at the reading measure. The eight-column table below uses the room to its right, so every column gets space, Notes included.

n-cpu-moe	Prompt 512	Prompt 4096	Gen	VRAM	KV cache	Status	Notes
5	—	—	—	OOM	—	OOM	No headroom for KV cache.
6	1708	1670	119	13.6 GB	1.9 GB	Best	Minimum offload that loads.
7	1592	1540	115	13.9 GB	1.9 GB	OK	One more expert layer on CPU.
16	1084	1010	89	11.2 GB	1.9 GB	Server	Extra room for KV cache.
24	813	770	70	9.4 GB	1.9 GB	Slow	Half the experts on CPU.

Small table — opt back in with `.narrow`

A 2–3 column table looks sparse at full width. Add .narrow to keep it inline with the prose measure.

Flag	Meaning
`-ngl 99`	Offload all layers to GPU.
`-fa on`	Enable Flash Attention.

Complex — colspan & long cell text

Model	Params	Verdict
Qwen3-Coder-30B-A3B	30B / 3B active	Chosen. The MoE architecture stays fast even when some experts spill to CPU, and the Coder variant is fine-tuned on multi-turn function-calling traces, which is the thing that breaks first on local agents.
GLM-4.5-Air	~100B+ MoE	Better tool-call reliability, but too large for 16 GB VRAM.
Everything below this row was ruled out for needing more than 16 GB at usable quantization.
Llama 3.3 70B	70B dense	Possible at Q3, but ~3 tok/s. Unusable for agent loops.

Figures & Charts

Figure with caption

Wrap an image, diagram, or screenshot in a <figure>. Swap the placeholder below for an <img> in a real article. Add .wide for large visuals.

image / diagram placeholder

Figure 1. Captions sit below the figure in secondary text.

CSS bar chart

Good for simple number comparisons when you don't want to pull in a charting library. Set each bar's width inline with --val:

n-cpu-moe 6

119 t/s

115 t/s

89 t/s

70 t/s

Wide figure (`.wide`)

full-column chart / screenshot

Figure 2. A .wide figure spans the full column. Good for wide charts or architecture diagrams.

Timeline

Single step

Done

A timeline with one node.

Many steps, with badges, code, and long text

1. Probed hardware 2 min

Confirmed RTX 5080 Laptop with 16 GB VRAM.

2. Installed CUDA 13 + cmake done

Used sudo dnf install cuda-devel cmake; added cuda-cudart-static separately because the compiler probe links statically and fails without it. This step took the longest because of a sequence of small packaging quirks that each looked fatal until the next one was fixed.

3. Built llama.cpp from source

Configured CMake with -DCMAKE_CUDA_ARCHITECTURES=120.

4. Benchmarked offload configs slow

Swept --n-cpu-moe from 6 to 24.

5. Wired pi.dev to llama-server

Added a provider entry and set it as default.

Key Terms

Single term

VRAM

Memory on the GPU board.

Many terms — short and very long definitions

GGUF

llama.cpp's container format.

Quantization

Squashing 16-bit weights down to 4–8 bits. Cuts memory roughly 4× and, with a good scheme, most models barely notice the loss in quality — which is why nobody runs a 30B model in full precision at home.

MoE

Mixture of Experts.

KV Cache

The model's per-conversation scratchpad for attention. It grows with context length, so a 32k window can quietly consume 1–3 GB of VRAM on its own and dominate your offload math.

Flash Attention

A memory-efficient rewrite of the attention computation.

Breakout (`.wide`) grid

Tokens/sec

Throughput. Generation is the number you feel.

Context window

How many tokens the model can see at once.

sm_120

NVIDIA's Blackwell compute-capability string.

Glossary (`.glossary`) — definition list for longer lists

The card grid gets heavy once you have more than a handful of terms. A <dl class="glossary"> reads lighter: a cyan term over an indented definition, with a hairline between entries.

GGUF: llama.cpp's container format.
Quantization: Squashing 16-bit weights down to 4–8 bits. Cuts memory roughly 4× and, with a good scheme, most models barely notice the loss in quality — which is why nobody runs a 30B model in full precision at home.
KV Cache: The model's per-conversation scratchpad for attention. It grows with context length, so a 32k window can quietly consume 1–3 GB of VRAM on its own.
Flash Attention: A memory-efficient rewrite of the attention computation.

Layout Blocks

Two-up comparison (`.cols`)

Good for pros/cons, before/after, or option A vs. B. It collapses to one column on narrow screens. (Pair it with callouts for color.)

Closed-loop cooling

Near-zero water use
Fill once at construction
Higher upfront cost

Evaporative cooling

70–80% of water lost to air
Cheap to build
Hard to permit in dry regions

Card grid (`.card-grid`)

Good for profiles, options, or per-item summaries, like vendor or product cards in a research article. Shown here with .wide.

Crusoe

Closed-loop direct-to-chip cooling.

~50,000 gal/yr per building
$1.375B Series E

Submer

Single-phase immersion cooling.

Servers bathed in dielectric fluid
Based in Spain

Corintis

Microfluidic chip-level cooling.

3× better heat removal in trials
Channels modeled on leaf veins

Stacked card grid (`.card-grid.stack`)

Add .stack to force one card per row. This works better when each card carries a lot of detail (long bullet lists, multiple paragraphs) that the auto-fit columns would otherwise squeeze.

Crusoe — Abilene, Texas

Closed-loop, non-evaporative direct-to-chip cooling.

~50,000 gal/yr per building — roughly 60% of one U.S. household's annual water footprint
$1.375B Series E

Submer — Barcelona, Spain

Single-phase immersion cooling, servers bathed in dielectric fluid.

Near-zero on-site water use
$20M debt facility

Collapsible details

Tuck extra content, FAQs, or long configs behind a toggle so the main flow stays readable.

Show the full benchmark methodology

Measured with llama-bench on commit 19e92c3, CUDA 13.0.88, Blackwell sm_120a build, averaged over three runs.

llama-bench -m model.gguf -ngl 99 -ncmoe 6 -fa 1 -p 512,2048 -n 128 -r 3

Frequently asked: why not just use Ollama?

Ollama wraps llama.cpp but lags on bleeding-edge architecture support and doesn't expose the --n-cpu-moe tuning that unlocked the throughput here.

Badges

Inline within a sentence: status went from OOM to Best after dropping --n-cpu-moe to 6, with slow and server configs in between.

Standalone row: blue green amber red

Code Blocks

Inline

Run llama-server --port 8080 and check curl -s http://127.0.0.1:8080/health.

Minimal block (one line)

pkill -f llama-server

Maximal — long lines (horizontal scroll) + syntax spans

# Start the server — note the very long absolute paths that force horizontal scrolling inside the code block rather than stretching the page
/home/aaddrick/source/llama/llama.cpp/build/bin/llama-server \
  -m /home/aaddrick/source/llama/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf \
  --chat-template-file /home/aaddrick/source/llama/llama.cpp/models/templates/Qwen3-Coder.jinja \
  -ngl 99 --n-cpu-moe 16 -fa on --jinja -np 1 -c 32768 \
  --host 127.0.0.1 --port 8080

Block with JSON + comment/string highlighting

// ~/.pi/agent/models.json (excerpt)
{
  "baseUrl": "http://127.0.0.1:8080/v1",
  "api": "openai-completions",
  "contextWindow": 32768
}

Breakout (`.wide`) code block

# A wide code block uses the full column, leaving fewer lines to wrap/scroll
./llama.cpp/build/bin/llama-bench -m models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf -ngl 99 -ncmoe 6 -fa 1 -p 512,2048 -n 128 -r 3

Footnotes & Sources

Footnotes

Inline citations1 link down to a numbered list, and each note links back2 to where it was referenced. These are pure anchors, no JavaScript, and the jump respects scroll-behavior.

First footnote, with a source link and supporting detail. ↩
Second footnote. ↩

Short source list

Reference: ggml-org/llama.cpp

Long source list

Section sources: EESI: Data Centers and Water Consumption • Lincoln Institute: Data Drain • Microsoft: Zero-Water Cooling • Goldman Sachs: Liquid Cooling Forecast • Data Center Watch: Q2 2025 Opposition Report • WilmerHale: State Regulation Trends • EU: Data Centre Energy Performance

Welcome

Typography & Prose

Heading level 3 — section subheading

Heading level 4 — minor label

Unordered list (short)

Unordered list (long, nested)

Ordered list

Callouts

Stat Grids

Single card

Two cards

Extremes — tiny number vs. huge number & long label

Six cards (in-column wrap)

Six cards, breakout (.wide) — roomier across the full column

Tables

Small table — opt back in with .narrow

Complex — colspan & long cell text

Figures & Charts

Figure with caption

CSS bar chart

Wide figure (.wide)

Timeline

Single step

Many steps, with badges, code, and long text

Key Terms

Single term

Many terms — short and very long definitions

Breakout (.wide) grid

Glossary (.glossary) — definition list for longer lists

Layout Blocks

Two-up comparison (.cols)

Card grid (.card-grid)

Crusoe

Submer

Corintis

Stacked card grid (.card-grid.stack)

Crusoe — Abilene, Texas

Submer — Barcelona, Spain

Collapsible details

Badges

Code Blocks

Inline

Minimal block (one line)

Maximal — long lines (horizontal scroll) + syntax spans

Block with JSON + comment/string highlighting

Breakout (.wide) code block

Footnotes & Sources

Footnotes

Short source list

Long source list

Six cards, breakout (`.wide`) — roomier across the full column

Small table — opt back in with `.narrow`

Wide figure (`.wide`)

Breakout (`.wide`) grid

Glossary (`.glossary`) — definition list for longer lists

Two-up comparison (`.cols`)

Card grid (`.card-grid`)

Stacked card grid (`.card-grid.stack`)

Breakout (`.wide`) code block