Weights, Harnesses,and the Machine That Thinks

Large Language Models are the defining technology of this decade. Yet for most practitioners — and even many builders — what actually lives inside them, how they become intelligent, and what the infrastructure around them does remains opaque. This article closes that gap: a ground-up explainer covering what LLMs are, what weights are, and the critical but underappreciated concept of the harness — the system that turns raw model parameters into a working AI product.

Part I — What Is a Large Language Model?

A Large Language Model is, at its most fundamental level, a probability engine for tokens. Given a sequence of tokens — the basic units into which text is broken — it predicts what token should come next, assigning a probability to every possible continuation. Do this repeatedly, sampling from those probabilities at each step, and you get generated text.

This description sounds almost banal. But the architecture that makes it work at scale — the transformer — is a remarkable engineering achievement, and the scale at which modern LLMs operate is difficult to overstate.

Tokens: The Atomic Unit

Before any model sees text, that text is broken into tokens — chunks that roughly correspond to word fragments. The word unbelievable might be three tokens: un, believ, able. A space, a comma, or a numeral might each be a single token. Modern LLMs use tokenisers (typically BPE — Byte Pair Encoding) trained on large corpora to find the optimal vocabulary of 30,000 to 100,000 token types.

This matters because the model never sees raw characters or words — it sees integers. Each token maps to an integer, and that integer maps to a high-dimensional vector (an embedding). That vector is what the transformer actually operates on.

The Transformer Architecture

Introduced in the landmark 2017 paper Attention Is All You Need by Vaswani et al. at Google, the transformer replaced recurrent networks (RNNs, LSTMs) with a mechanism called self-attention. This turned out to be transformative — pun entirely intended.

The core insight of self-attention: to understand any token in a sequence, you should be able to attend to — look at and weight — every other token in the sequence simultaneously. This resolves the fatal flaw of RNNs, which processed tokens sequentially and struggled to retain context across long distances.

Input Tokens: “The cat sat on the mat” ↓ ↓ ↓ ↓ ↓ ↓ [Embed into high-dimensional vectors] ↓ ┌──────────────────────────────────┐ │ Transformer Block ×N │ │ │ │ ┌────────────────────────────┐ │ │ │ Multi-Head Self-Attention │ │ │ │ (every token attends to │ │ │ │ every other token) │ │ │ └────────────┬───────────────┘ │ │ ↓ │ │ ┌────────────────────────────┐ │ │ │ Feed-Forward Network (FFN) │ │ │ │ (per-token transformation) │ │ │ └────────────────────────────┘ │ └──────────────────────────────────┘ ↓ [Logits over vocabulary → probabilities] ↓ Next predicted token

A modern LLM stacks dozens to hundreds of these transformer blocks. Each block refines the representation of every token, allowing the model to build up increasingly abstract, context-sensitive understanding. By the final layer, each token’s representation has been influenced by every other token in the context window — a form of distributed meaning.

Scale: Why “Large” Matters

The “large” in LLM is not marketing. Scale — measured in both parameter count and training data — produces qualitative capability jumps that researchers refer to as emergent abilities: capabilities that appear suddenly above certain thresholds and were not present in smaller versions of the same architecture.

Model Era	Parameters	Training Tokens	Capability Horizon
GPT-2 (2019)	1.5B	~40B	Coherent paragraphs
GPT-3 (2020)	175B	300B	Few-shot learning, reasoning
PaLM (2022)	540B	780B	Chain-of-thought, multilingual
GPT-4 / Claude 3 era	~1T+ (est.)	Trillions	Expert-level problem solving
Frontier 2025+	Undisclosed	Undisclosed	Multimodal, agentic reasoning

The Chinchilla scaling laws (DeepMind, 2022) offered a key refinement: optimal training requires roughly 20 tokens of training data per parameter. This shifted industry practice from building ever-larger models toward training right-sized models on far more data.

ALMULOO 20L/5.2 Gallon Storage Box Compatible with Can-Am Maverick Defender Commander Models Plastic Black Replacement for 715007112

Compatibility: This storage box is compatible with Commander/commander MAx,Compatible withDefender /Defender MAX/Maverick X3/MaverickX3 MAX/ Maverick Trail/Maverick Sport/Maverick Sport…

As an affiliate, we earn on qualifying purchases.

Part II — What Are Weights?

If the transformer architecture is the blueprint, weights are the substance. Weights — also called parameters — are the numerical values stored in every matrix in the network. They are what the model actually is. Strip away the code, the training infrastructure, the API layer — and what you have left is a file (or set of files) containing billions of floating-point numbers.

What Weights Actually Represent

At initialisation, weights are random noise. Training is the process of adjusting them — through billions of gradient descent steps — until the model reliably predicts tokens well. After training, those adjusted numbers encode something extraordinary: a compressed, lossy representation of a vast swath of human knowledge, linguistic structure, logical reasoning patterns, and world models.

When you ask a model who wrote Hamlet, no lookup table is consulted. Instead, the sequence of tokens for your query activates specific patterns across the weight matrices, and the resulting computation produces the correct continuation. The knowledge is implicit in the weights, not stored explicitly. This is at once the most powerful and the most puzzling property of neural language models.

“The weights of a large language model are a strange kind of object: they are both the program and the data, both the algorithm and its output. They do not compute — they are the computation, crystallised.” — Conceptual framing, ThorstenmeyerAI.com

Weight Structure: A Taxonomy

Not all weights do the same thing. Within a transformer, weights exist in several functional groups:

Embedding matrix — maps token IDs to dense vectors. This is where discrete symbols become continuous geometry.
Attention weights (Q, K, V, O matrices) — control which tokens attend to which, and how information is mixed. These encode syntactic and semantic relationships.
Feed-forward weights — the FFN layers, which operate on each token independently. Research suggests these act as a kind of key-value factual memory store.
Layer normalisation parameters — small but critical for training stability.
Unembedding matrix — projects the final hidden state back onto the vocabulary to produce next-token logits.

How Weights Are Created: The Training Pipeline

Weights are not designed — they are discovered through optimisation. The pipeline has three major phases:

Phase 1 — Pre-Training

The model is trained on a massive corpus of text — web pages, books, code, academic papers, multilingual data — using next-token prediction as the self-supervised objective. No human labels are required. The model simply tries to predict the next token in each training example, its predictions are compared to the actual token via cross-entropy loss, and gradients flow backward through the network, nudging every weight slightly in the direction that reduces error.

This process, repeated across trillions of examples over weeks on thousands of GPUs, produces a base model: an extraordinarily capable predictor that has absorbed vast knowledge but is not yet a useful assistant. It will complete text in any direction — including harmful, incoherent, or unhelpful directions.

Phase 2 — Supervised Fine-Tuning (SFT)

The base model is fine-tuned on a smaller, high-quality dataset of (instruction, ideal response) pairs, curated or written by human trainers. This teaches the model to behave as an assistant — following instructions, formatting outputs helpfully, staying on topic. Fine-tuning adjusts a fraction of the weights, layering new behaviour on top of the pre-trained foundation.

Phase 3 — Alignment (RLHF / RLAIF / DPO)

Reinforcement Learning from Human Feedback (RLHF) is the technique that turned instruction-tuned models into the polished assistants the public encountered with ChatGPT and Claude. Human raters compare pairs of model outputs and indicate which is better. A separate reward model learns to predict these preferences, then guides further training of the main model via RL (typically PPO). More recently, Direct Preference Optimisation (DPO) achieves similar alignment goals without requiring a separate reward model, and RLAIF (RL from AI feedback) scales the feedback process using a model itself as the rater.

┌─────────────────────────────────────────────────────┐ │ TRAINING PIPELINE │ │ │ │ [Raw Text Corpus, Trillions of Tokens] │ │ ↓ Next-Token Prediction Loss │ │ [BASE MODEL] ─── capable, unaligned, raw │ │ ↓ Supervised Fine-Tuning │ │ [SFT MODEL] ─── instruction-following │ │ ↓ RLHF / DPO / RLAIF │ │ [ALIGNED MODEL] ── helpful, harmless, honest │ │ ↓ │ │ Weights saved as checkpoint files (.safetensors, │ │ .bin, GGUF, etc.) │ └─────────────────────────────────────────────────────┘

Weight Formats and Quantisation

Full-precision weights are stored as 32-bit or 16-bit floating point numbers. A 70-billion parameter model in fp16 requires ~140 GB of VRAM — beyond what a single consumer GPU holds. Quantisation reduces this footprint by representing weights in fewer bits (int8, int4, even int2), trading a small quality degradation for dramatic hardware accessibility. The GGUF format, popularised by the llama.cpp project, became the standard for quantised local inference, enabling models like Llama 3 to run on a MacBook.

Weights are immutable at inference time. The model does not learn from conversations. Every time you send a message, the same frozen weights process your input. What changes is only what’s in the context window — the runtime state. This is why the harness matters so much.

Hands-on TinyML: Harness the power of Machine Learning on the edge devices (English Edition)

As an affiliate, we earn on qualifying purchases.

Part III — The Harness

Here is the insight that separates practitioners who truly understand LLM deployment from those who don’t: the model weights are necessary but not sufficient. A raw model file, no matter how capable, does nothing useful without the infrastructure that surrounds it. That infrastructure — in its totality — is what we call the harness.

The term “harness” is borrowed from testing engineering (a “test harness” is infrastructure for running tests) and has expanded in AI to describe any scaffolding that controls, constrains, and routes model execution. It operates at multiple layers, each with distinct responsibilities.

Layer 1 — The Training Harness

Before inference, there is training. The training harness is the orchestration layer that manages the pre-training and fine-tuning process itself. Its components include:

Distributed training framework — tools like DeepSpeed, Megatron-LM, or PyTorch FSDP that partition the model and data across thousands of GPUs, coordinating gradient accumulation across a cluster.
Data pipeline — the preprocessing, tokenisation, shuffling, and streaming of training data. Quality filtering at this stage has an outsized effect on the resulting model.
Optimiser state — during training, the harness maintains momentum buffers and second-moment estimates (for optimisers like Adam) that are orders of magnitude larger than the weights themselves.
Checkpointing and experiment tracking — periodic snapshots of weight state, loss curves, and evaluation metrics logged to tools like Weights & Biases or MLflow.

The training harness is invisible to end users but shapes every property of the resulting model. The data mix, the learning rate schedule, the batch size, the sequence length — all of these decisions, implemented in the training harness, directly determine what the weights learn to do.

Layer 2 — The Evaluation Harness

How do you know if a model is good? The evaluation harness answers this. The most widely used open-source example is EleutherAI’s lm-evaluation-harness, which provides a standardised framework for running language models against hundreds of benchmark tasks — MMLU, HellaSwag, ARC, GSM8K, HumanEval, and many more.

# Running a model through the eval harness
lm_eval \
  --model hf \
  --model_args pretrained=meta-llama/Llama-3-70B-Instruct \
  --tasks mmlu,arc_challenge,gsm8k \
  --num_fewshot 5 \
  --batch_size 8 \
  --output_path ./results

The evaluation harness matters because benchmarks are not neutral. How prompts are formatted, how few-shot examples are chosen, whether chain-of-thought is elicited — all of these are decisions made in the harness, and they can shift benchmark scores dramatically. The same weights, evaluated with different harness configurations, can produce strikingly different numbers. This is why model leaderboards should always be read with awareness of harness methodology.

Benchmark contamination is the risk that training data includes examples from evaluation datasets, inflating scores. A rigorous evaluation harness includes contamination detection — checking whether benchmark examples appear in the training corpus. Without this, leaderboard numbers can be deeply misleading.

Layer 3 — The Inference Harness

At serving time, the inference harness is the runtime environment that loads weights, manages computation, and handles requests. This is where raw capability is turned into low-latency, scalable service.

Key Inference Harness Components

KV Cache: During autoregressive generation, the model repeatedly processes the same preceding tokens. A KV (key-value) cache stores the attention representations of already-processed tokens, so each new token only requires computing attention over the new additions. Without this, inference cost grows quadratically with context length. Managing the KV cache — its memory, eviction policy, and sharing across requests — is a primary concern of every production inference system.

Continuous Batching: Naive batching waits until a batch is full before starting computation. Continuous batching (introduced by Orca, implemented in vLLM and TensorRT-LLM) interleaves requests at the token level, dramatically improving GPU utilisation when request lengths vary.

Speculative Decoding: A small “draft” model generates several tokens at once, and the large model verifies them in parallel. If the large model agrees with the draft, all tokens are accepted in a single forward pass. When it disagrees, execution falls back to standard decoding. This can yield 2–3× speedups with no change in output quality.

┌────────────────────────────────────────────────────────┐ │ INFERENCE HARNESS STACK │ │ │ │ HTTP / gRPC endpoint ←─── Client requests │ │ ↓ │ │ Request queue + continuous batch scheduler │ │ ↓ │ │ Tokeniser → Input IDs │ │ ↓ │ │ ┌──────────────────────────────────────┐ │ │ │ Transformer forward pass │ │ │ │ (weights loaded onto GPU/CPU/NPU) │ │ │ │ with KV cache management │ │ │ └──────────────────────────────────────┘ │ │ ↓ │ │ Logits → Sampling (temp, top-p, top-k) │ │ ↓ │ │ Output token → detokenise → stream to client │ └────────────────────────────────────────────────────────┘

Popular open-source inference harnesses include vLLM (PagedAttention for memory efficiency, OpenAI-compatible API), Ollama (developer-friendly local deployment, GGUF support), TGI — Text Generation Inference (Hugging Face, production-grade), and llama.cpp (cross-platform CPU+GPU inference, the foundation of many local tools).

Layer 4 — The Prompt Harness

This is where most practitioners spend most of their time, and where the harness concept is most practically important. The prompt harness is the layer that shapes what the model receives — structuring inputs, injecting context, enforcing formats, and post-processing outputs.

The System Prompt

Every modern LLM deployed as an assistant accepts a system prompt — an instruction block prepended to the conversation that is invisible to end users but defines the model’s persona, constraints, knowledge, and behaviour. The system prompt is the most powerful single lever in the prompt harness.

# Minimal system prompt structure
{
  "role": "system",
  "content": "You are an expert regulatory compliance
  assistant for a GxP pharmaceutical environment.
  Always cite 21 CFR Part 11 requirements.
  Never speculate about clinical outcomes.
  Respond in structured sections with explicit
  confidence levels."
}

A well-engineered system prompt is one of the highest-leverage activities in LLM product development. It can radically change the effective behaviour of a model without touching any weights. This is why companies guard their system prompts closely — they represent genuine engineering work and competitive differentiation.

Retrieval-Augmented Generation (RAG)

The prompt harness is also responsible for RAG — the practice of retrieving relevant documents at query time and injecting them into the context before the model generates a response. RAG addresses the fundamental limitation that weights are frozen: they cannot know about events after training, or about private data.

User query: “What does our SOP-042 say about batch release?” ↓ [Embed query → vector search against document store] ↓ Retrieved chunks: SOP-042 §3.2, §4.1, change log entry ↓ Harness constructs prompt: ┌──────────────────────────────────────────────┐ │ SYSTEM: You are a QA compliance assistant. │ │ CONTEXT: [SOP-042 §3.2 text] [§4.1 text] │ │ USER: What does SOP-042 say about batch │ │ release? │ └──────────────────────────────────────────────┘ ↓ Model generates answer grounded in retrieved docs

Tool Use and Function Calling

Modern LLMs can be prompted to invoke tools — external functions, APIs, or services — rather than simply generating text. The harness defines what tools are available, formats the tool definitions into the model’s context, intercepts tool-call outputs, executes the actual function, and feeds results back to the model for continued generation. This is the foundation of agentic AI.

# Tool definition in the prompt harness
tools = [
  {
    "name": "search_regulatory_database",
    "description": "Search FDA guidance documents",
    "input_schema": {
      "query": { "type": "string" },
      "regulation": { "type": "string",
                      "enum": ["21CFR11", "21CFR820", "ICH-Q10"] }
    }
  }
]
# Harness: call model → detect tool_use block
# → execute search → inject result → call model again

Layer 5 — The Governance Harness

The final and increasingly important harness layer sits above individual requests: the governance harness. This encompasses systems that ensure model behaviour aligns with policy, law, and organisational risk tolerance.

Input filters — classifiers that detect and block policy-violating inputs before they reach the model.
Output classifiers — scanning model responses for harmful, confidential, or non-compliant content before delivery.
Rate limiting and access control — ensuring model capabilities are appropriately gated by user role and tier.
Audit logging — capturing inputs and outputs for compliance, debugging, and fine-tuning data collection.
Model routing — directing different query types to different models based on cost, latency, and quality trade-offs.

The governance harness is where GxP-regulated industries (pharma, medical devices, clinical trials) have the most work to do. 21 CFR Part 11, Annex 11, and ICH-Q10 all impose requirements on auditability, access control, and validation that translate directly into harness engineering requirements — not model requirements. A compliant LLM deployment is primarily a compliant harness.

Advanced Language Tool Kit: Teaching the Structure of the English Language

As an affiliate, we earn on qualifying purchases.

Part IV — Weights vs. Harness: The Real Capability Equation

The AI industry talks obsessively about model capabilities — benchmark scores, parameter counts, context lengths. But for anyone building real products, a more honest formulation is this: the same model weights, wrapped in a thoughtless harness, will dramatically underperform compared to the same weights in a carefully engineered deployment. Conversely, a modestly capable model in an excellent harness can outperform a frontier model in a poor one on specific tasks.

Chain-of-thought prompting — asking the model to “think step by step” — reliably improves reasoning performance on arithmetic and logic tasks. Same weights, different harness, measurable capability delta. Few-shot examples shift model behaviour toward specific output formats without any fine-tuning. RAG consistently matches or exceeds fine-tuning for knowledge-intensive tasks, with lower cost and more updateable knowledge. Tool use enables frontier models to achieve near-perfect accuracy on mathematical tasks that the same models score around 50% on without tool access. The calculator is part of the harness, not the weights.

The Harness as Competitive Moat

As base model capabilities commoditise — driven by open-weights models from Meta, Mistral, and others — the harness becomes the primary site of competitive differentiation. For builders, this is good news. You don’t need to train a frontier model to build a frontier product. Domain-specific RAG pipelines, well-crafted system prompts, thoughtful tool integration, and rigorous output validation can produce product experiences that a raw frontier model cannot replicate without them.

Optimizing Small Language Models for Production Systems: Designing, Training, Quantizing, and Deploying Lightweight Transformer Models with Python, LoRA, and Modern Compression Techniques

As an affiliate, we earn on qualifying purchases.

Part V — The Emerging Architecture: Harnesses All the Way Down

The most significant shift in 2024–2025 is the emergence of agentic systems — where the harness doesn’t just wrap a single model call, but orchestrates sequences of model calls, tool invocations, memory operations, and sub-agent delegation. In these systems, the harness becomes the architecture.

Multi-Agent Harnesses

An orchestrator model receives a high-level task and decomposes it into subtasks, each delegated to specialised sub-agents — each with their own system prompt, their own tool access, and potentially their own weight checkpoint. The orchestrator synthesises their outputs. No single set of weights does everything; the harness coordinates the ensemble.

┌──────────────────────────────────────────────────┐ │ MULTI-AGENT HARNESS │ │ │ │ User task: “Audit our Q3 regulatory filings” │ │ ↓ │ │ [Orchestrator Agent] (Claude Opus / GPT-4o) │ │ ↓ decomposes task │ │ ┌────┴────────────────┐ │ │ ↓ ↓ │ │ [Document ] [Compliance ] │ │ [Retrieval ] [Analysis Agent] │ │ [Agent ] [+ tools ] │ │ ↓ ↓ │ │ └────────┬────────────┘ │ │ ↓ │ │ [Synthesis Agent] → structured report │ └──────────────────────────────────────────────────┘

Memory Harnesses

Since weights don’t update at inference time, long-term memory is entirely a harness concern. Emerging memory harnesses include vector stores for semantic memory (Pinecone, Weaviate, pgvector), episodic memory via summarised conversation histories injected into future contexts, procedural memory storing learned task strategies as text, and external state managed through tool-accessible databases. Each of these is harness engineering, not model engineering.

Provider-Agnostic Routing Harnesses

As organisations deploy multiple models from multiple providers — Anthropic, OpenAI, Google, Mistral, open-source — a routing harness becomes necessary. This layer abstracts the model interface, routes requests based on cost, latency, capability, and availability, handles fallbacks, and maintains a unified audit log. Projects like LiteLLM and OpenRouter are mature implementations; many sophisticated deployments maintain in-house AI gateways for tighter control.

Invest at least as much engineering effort in your harness as in model selection. A good harness can swap models underneath without changing product behaviour. A bad harness makes every model look bad, and turns every model upgrade into a migration project.

Conclusion — The Inseparability of Weights and Harness

LLMs are genuinely remarkable artefacts. The transformer architecture, trained at scale on the sum of human text, produces weights that encode something that resembles understanding — pattern recognition so sophisticated it crosses into apparent reasoning. That is not nothing. It may be the most significant engineering achievement of the last decade.

But weights alone are inert. They are crystallised potential — the distillation of trillion-token exposure into a static numerical object. The harness is what brings that potential to life: the training harness that shapes what the weights learn, the evaluation harness that tells us what the weights know, the inference harness that serves those weights at scale, the prompt harness that guides what the weights say, and the governance harness that ensures those words are trustworthy and compliant.

To understand LLMs deeply — to build on them wisely — is to understand both halves of this equation. The weights are the brain. The harness is everything else that makes it a mind worth trusting.

Weights, Harnesses,and the Machine That Thinks

Up next

Author

Thorsten Meyer

Share article

Part I — What Is a Large Language Model?

Tokens: The Atomic Unit

The Transformer Architecture

Scale: Why “Large” Matters

ALMULOO 20L/5.2 Gallon Storage Box Compatible with Can-Am Maverick Defender Commander Models Plastic Black Replacement for 715007112

Part II — What Are Weights?

What Weights Actually Represent

Weight Structure: A Taxonomy

How Weights Are Created: The Training Pipeline

Phase 1 — Pre-Training

Phase 2 — Supervised Fine-Tuning (SFT)

Phase 3 — Alignment (RLHF / RLAIF / DPO)

Weight Formats and Quantisation

Hands-on TinyML: Harness the power of Machine Learning on the edge devices (English Edition)

Part III — The Harness

Layer 1 — The Training Harness

Layer 2 — The Evaluation Harness

Layer 3 — The Inference Harness

Key Inference Harness Components

Layer 4 — The Prompt Harness

The System Prompt

Retrieval-Augmented Generation (RAG)

Tool Use and Function Calling

Layer 5 — The Governance Harness

Advanced Language Tool Kit: Teaching the Structure of the English Language

Part IV — Weights vs. Harness: The Real Capability Equation

The Harness as Competitive Moat

Optimizing Small Language Models for Production Systems: Designing, Training, Quantizing, and Deploying Lightweight Transformer Models with Python, LoRA, and Modern Compression Techniques

Part V — The Emerging Architecture: Harnesses All the Way Down

Multi-Agent Harnesses

Memory Harnesses

Provider-Agnostic Routing Harnesses

Conclusion — The Inseparability of Weights and Harness

You May Also Like