Part 7 of a five-day series on the 2026 memory crunch. Part 6 showed why renting hides the bill; this one prices the alternative — running the models yourself.

If you’ve followed the series this far, you already know the punchline the cloud chapter set up: for steady, high-utilization AI work, owning the hardware beats renting it. So the obvious question for anyone who wants to run models locally — to keep prompts private, to cut a cloud bill that now only goes up, to actually own the thing — is what does that cost in 2026, and where does the money go?

The answer is unintuitive, and it’s good news for disciplined buyers. The most expensive local-inference rig is almost never the smartest one. Here’s how the math actually works.

The Real Cost of a Local-Inference Rig — The Memory Squeeze, Part 7

AI Dispatch · Reality Check · The Memory Squeeze · Part 7 of 10

The real cost of a local-inference rig

Owning beats renting for steady AI work — so what does a local rig cost in 2026? The unintuitive, good news: the most expensive build is almost never the smartest one. It all comes down to one rule.

The one rule — the VRAM cliff

40–50
tok/s

Fits in VRAM
fast — faster than you read

1–2 tok/s

Spills to system RAM
5–20× collapse · unusable

Same card. Same model.

The difference is only whether the weights fit. LLM inference is memory-bandwidth-bound — VRAM capacity is the hard limit you build around. Compute specs are mostly noise.

Match the model to the memory (Q4)

Model class

VRAM

Hardware

Speed

7–8B

~6–8GB

RTX 5070 Ti 16GB · used 3090

100+ t/s

26–32B

~20GB

single 24GB (3090 / 4090)

30–40 t/s

70B

~43GB

RTX 5090 32GB · dual 3090 · M4 Max 64GB

40–50 t/s

100B+ / 405B

60–130GB+

Mac 128GB+ unified · quad 3090 (96GB)

slower

~5×

A used RTX 3090 (24GB, $600–850) delivers roughly 5× the VRAM-per-dollar of a 5090 — and keeps NVLink. Four of them = 96GB pooled for under ~$3,200, enough for a 70B at high quality. For inference, newest ≠ smartest — VRAM-per-dollar wins.

Build tiers — buy for the model class you actually run

Entry 7–14B · 5070 Ti 16GB (~$750) Mid 26–32B · single 24GB Pro 70B · 5090 / dual-3090 / M4 Max Frontier 100B+ · Mac 128GB+ / multi-GPU

The take

The squeeze reframes the rig like everything else in this series: discipline beats maximalism. VRAM is exactly the memory under most pressure, so over-buying it is the 128GB-“to-be-safe” trap, only worse per gigabyte. Take the cheap, high-value step to 24GB (the gateway to the 30B class), reach for used 3090s and MoE models, and use quantization to climb a tier without buying silicon. Sized right, the rig pays for itself against the cloud’s ever-rising hidden bill. Next: Apple Silicon’s quiet memory advantage.

Sources: Core Lab; Kunal Ganglani; BSWEN; Local AI Master; Compute Market; IntuitionLabs; Overchat. tok/s figures reflect community benchmarks. Prices point-in-time, late June 2026, fast-moving. Not financial advice.

thorstenmeyerai.com

The one rule: the VRAM cliff

Everything about a local-inference build reduces to a single, unforgiving rule. If the model fits in your GPU’s video memory, it runs fast. If it doesn’t, it falls off a cliff.

This isn’t a gentle slope. A benchmark that recurs across the community: an RTX 5090 running a 70B model entirely in VRAM produces around 40–50 tokens per second — faster than you can read. The same card, same model, spilling even partially into system RAM, collapses to 1–2 tokens per second — slower than reading speed, and unusable for real work. That’s a 5-to-20× cliff, and it governs every decision you make.

The reason is that LLM inference is memory-bandwidth-bound, not compute-bound. The GPU can do the arithmetic far faster than memory can feed it weights, so the bottleneck is how fast data moves through VRAM. This is why raw compute specs — CUDA core counts, teraflops — are mostly noise for this use, and why VRAM capacity is the hard limit you build around. Fit the model you want in fast memory, and the rest is detail. Miss, and no amount of GPU horsepower saves you.

NVIDIA GeForce RTX 3090 Founders Edition Graphics Card (Renewed)

Item Package Dimension – 15.0L x 12.25W x 4.25H inches

As an affiliate, we earn on qualifying purchases.

Sizing the model to the memory

The arithmetic is simple enough to do in your head. A model needs roughly 2GB of memory per billion parameters at full (FP16) precision. Quantization — compressing the weights — cuts that hard: Q8 halves it, Q4 quarters it, with surprisingly modest quality loss, which is exactly why Q4 is what most people actually run.

So the map from model to memory looks like this (at Q4):

7–8B models (Llama, Qwen, Mistral Small): ~6–8GB. Run on almost anything modern.
26–32B models (Qwen3 32B ~20GB, Gemma 4 ~18–20GB): fit a single 24GB card with room to spare. This is where local models start replacing API calls.
70B models (Llama 3.3 70B ~43GB): need more than one 24GB card — a 32GB RTX 5090, dual GPUs, a 48–64GB Mac, or aggressive Q3 to squeeze under 30GB.
100B+ and MoE (and the 405B / 671B giants): need 60–130GB+ — multi-GPU or large-unified-memory Macs, and the truly enormous ones stay impractical without heavy offload.

A note that matters for value: Mixture-of-Experts models punch above their weight. Qwen3’s 30B MoE activates only ~3B parameters per token, so it runs at small-model speed while delivering near-32B quality — a free lunch the squeeze makes worth seeking out.

ASUS ROG Astral GeForce RTX 5090 White OC Edition GPU, 32GB GDDR7, 3352 AI Tops, DLSS 4, 512-bit, DP 2.1b x3, HDMI 2.1b x2, AI Content Creation, LLM Inference, with GPU Holder

[3352 AI TOPS, 5th Gen Tensor Cores, AI Content Creation] Accelerate AI-powered photo and video workflows like upscaling,…

As an affiliate, we earn on qualifying purchases.

The counterintuitive value play: VRAM-per-dollar, not newest

Here’s where most buyers overspend. Faced with the cliff, the instinct is to buy the newest, biggest card. But for inference, the smart metric isn’t performance — it’s gigabytes of VRAM per dollar, and on that metric the newest cards lose badly.

A used RTX 3090 (24GB) runs about $600–850 and delivers roughly five times the VRAM-per-dollar of an RTX 5090. It’s a generation old, sold without warranty, often ex-mining — and for inference, where VRAM capacity beats raw speed, it’s the value champion. It also keeps a feature the 4090 and 5090 dropped: NVLink, which lets two 3090s present a single unified 48GB pool. That makes multi-3090 the cheapest serious path to big models: four used 3090s give you 96GB of pooled VRAM for under ~$3,200 in cards — enough to run a 70B model at high quality or a 120B at Q4, on a budget a single flagship can’t touch.

The flagship still has its place. The RTX 5090 (32GB) is the only single consumer card that fits a Q4 70B model entirely in VRAM at 40–50 tok/s, and its ~1,792 GB/s bandwidth (about 78% more than the 4090) translates directly into speed because inference is bandwidth-bound. If you want one card, no NVLink fuss, and gaming on the side, it’s the pick — at ~$2,000 MSRP and often a good deal more on the street, drawing 575W. But “one expensive card” and “smartest dollar” are rarely the same answer in 2026.

NVD RTX PRO 6000 Blackwell Professional Workstation Edition Graphics Card for AI, Design, Simulation, Engineering – 96GB DDR7 ECC Memory – 4th Gen RT/5th Gen Tensor Core GPU – OEM Packaging

[NVIDIA Blackwell Streaming Multiprocessor] The new SM features increased processing throughput, and new neural shaders that integrate neural…

As an affiliate, we earn on qualifying purchases.

The build tiers

Map your “target intelligence” — the model class you’ll actually run daily — to hardware, and stop there:

Entry (7–14B): RTX 5070 Ti 16GB (~$750, the current value sweet spot) or a used 3090. Runs coding assistants and local agents at 100+ tok/s.
Mid (26–32B): a single 24GB card — used 3090 or 4090. The point where a local model genuinely replaces many API calls; Qwen3 32B and Gemma 4 live here.
Pro (70B): an RTX 5090 32GB (Q4, single card), or dual/quad 3090s for the pooled VRAM, or an M4 Max with 48–64GB unified memory.
Frontier (100B+): large-unified-memory Macs (128GB+) or multi-GPU rigs — the only consumer routes to models that rival commercial APIs.

The high-value threshold, the one upgrade worth stretching for, is getting to 24GB. A 24GB card costs only marginally more than a 16GB one but unlocks the entire 26–32B class — the tier where local inference becomes a real substitute for the cloud. Past that, every additional gigabyte should be justified by a model you genuinely run, not a model you might.

AI Workstation for Beginners: A Practical Step-by-Step Guide to Choosing Hardware, Configuring Software, and Running Local Models Privately

As an affiliate, we earn on qualifying purchases.

The Apple Silicon shortcut

There’s a second path that breaks the GPU rules entirely, and it’s the natural home for the biggest local models: Apple Silicon’s unified memory. On an M-series chip, system RAM is VRAM — any of it is usable by the GPU — which makes Macs the only consumer machines that reach 100GB+ of effective video memory at all. An M5 Max with 64GB can run models that would otherwise demand an H100; a Mac Studio with 128GB+ becomes a genuine local rival to commercial-grade models.

The trade-off is speed: a big Mac generates tokens slower than a discrete GPU — a Mac Ultra in the teens of tokens per second against an RTX 4090’s 50+. You buy capacity, not raw throughput. But for running the largest models that simply won’t fit on consumer GPUs, capacity is the whole game — and it’s exactly why this path gets its own chapter next. (NVIDIA’s announced $3,000 Project DIGITS desktop, 128GB unified memory for 200B+ models, signals this category is now a product, not just a DIY hack — when and at what real spec it ships remains to be seen.)

The rest of the rig

Memory is the story, but the supporting cast matters. Budget a fast NVMe SSD with 100–500GB free — model files routinely exceed 200GB and you’ll reload them often (and yes, NAND prices climbed too, per Part 4, so size this deliberately). System RAM wants 32GB as a comfortable floor, 128GB if you plan to offload large models to the CPU. The CPU itself barely matters when a GPU is present — any modern 8-core chip is fine. And on software, Ollama, llama.cpp, and vLLM are the runtimes that matter, with NVIDIA’s 2026 optimizations adding up to ~35% faster token generation for free.

The take

The memory squeeze reframes the local rig the same way it reframed every other front in this series: the winning move is discipline, not maximalism. VRAM and unified memory are precisely the memory under the most pressure, so over-buying capacity is the same expensive mistake as the 128GB “to be safe” DDR5 kit — only worse, because GPU VRAM costs far more per gigabyte. Size the build to the model class you actually run; take the cheap, high-value step to 24GB; reach for used 3090s and MoE models where they beat the flagships on value; and use quantization to reach the next tier without buying more silicon — a lever Part 9 is built around.

Do that, and the rig pays for itself against the cloud’s ever-rising, ever-hidden bill — which was the whole point. The squeeze made memory expensive everywhere; it also made owning the right amount of it one of the few moves that still puts you ahead.

Next in the series, the path that quietly turned the memory shortage into an advantage: Apple Silicon’s Quiet Memory Advantage.

Sources: Core Lab, Kunal Ganglani, BSWEN, Local AI Master, Compute Market, IntuitionLabs, Overchat AI Hub (VRAM-per-dollar tiers, GPU prices and bandwidth, model-to-VRAM sizing, multi-GPU/NVLink configurations, Apple Silicon unified-memory capability, runtime performance); benchmark figures for tokens/sec reflect community testing (r/LocalLLaMA and cited labs). Hardware prices are point-in-time, late June 2026, and fast-moving. Analysis and recommendations are the author’s and not financial advice.

The Real Cost of a Local-Inference Rig in 2026

Up next

When One Agent Isn’t Enough: Claude Now Builds Its Own Team of Agents on the Fly

Author

Thorsten Meyer

Share article

The real cost of a local-inference rig

The one rule: the VRAM cliff

NVIDIA GeForce RTX 3090 Founders Edition Graphics Card (Renewed)

Sizing the model to the memory

ASUS ROG Astral GeForce RTX 5090 White OC Edition GPU, 32GB GDDR7, 3352 AI Tops, DLSS 4, 512-bit, DP 2.1b x3, HDMI 2.1b x2, AI Content Creation, LLM Inference, with GPU Holder

The counterintuitive value play: VRAM-per-dollar, not newest

NVD RTX PRO 6000 Blackwell Professional Workstation Edition Graphics Card for AI, Design, Simulation, Engineering – 96GB DDR7 ECC Memory – 4th Gen RT/5th Gen Tensor Core GPU – OEM Packaging

The build tiers

AI Workstation for Beginners: A Practical Step-by-Step Guide to Choosing Hardware, Configuring Software, and Running Local Models Privately

The Apple Silicon shortcut

The rest of the rig

The take

Bridging Design Systems and Agentic AI: How Figma’s Variables & Model Context Protocol Transform the Future of Work

The New Productivity Equation: Agentic AI, Capital Deepening, and the Risk of a Distribution Mismatch

The Six Chokepoints: How AI Stopped Being a Utility and Became a Lever

The File Was Never the Product: What Legal Template Vendors Were Actually Selling

When One Agent Isn’t Enough: Claude Now Builds Its Own Team of Agents on the Fly

Software-Defined Warfare: How Ukraine’s Delta Turned the Battlefield Into a Shared, Real-Time Map

The Eye Over the City: How Wide-Area Motion Imagery Works — and Where It Goes Blind

Cloud’s Hidden Memory Bill

The Real Cost of a Local-Inference Rig in 2026

Up next

Author

Thorsten Meyer

Share article

The real cost of a local-inference rig

The one rule: the VRAM cliff

NVIDIA GeForce RTX 3090 Founders Edition Graphics Card (Renewed)

Sizing the model to the memory

ASUS ROG Astral GeForce RTX 5090 White OC Edition GPU, 32GB GDDR7, 3352 AI Tops, DLSS 4, 512-bit, DP 2.1b x3, HDMI 2.1b x2, AI Content Creation, LLM Inference, with GPU Holder

The counterintuitive value play: VRAM-per-dollar, not newest

NVD RTX PRO 6000 Blackwell Professional Workstation Edition Graphics Card for AI, Design, Simulation, Engineering – 96GB DDR7 ECC Memory – 4th Gen RT/5th Gen Tensor Core GPU – OEM Packaging

The build tiers

AI Workstation for Beginners: A Practical Step-by-Step Guide to Choosing Hardware, Configuring Software, and Running Local Models Privately

The Apple Silicon shortcut

The rest of the rig

The take

You May Also Like