OpenAI’s gpt‑oss‑120b and gpt‑oss‑20b push the frontier of open‑weight reasoning models

Table of Contents

Introduction and context

OpenAI recently released gpt‑oss‑120b and gpt‑oss‑20b, a pair of “open‑weight” language models designed for strong reasoning and agentic tool use. The models are available under the permissive Apache 2.0 licence, meaning developers can download, fine‑tune and deploy them without copyleft restrictionsgithub.com. They are text‑only and are designed to work with OpenAI’s new harmony chat format, which supports system, developer and user rolesgithub.com. OpenAI emphasises that gpt‑oss models combine instruction following, reasoning ability, tool use and full chain‑of‑thought (CoT) visibilitygithub.com. Because the models are open weights, they can be run on users’ own infrastructure (a single H100 GPU or consumer‑grade hardware, depending on model size), or via hosted servicesgithub.com.

Unlike proprietary models that run behind an API, open‑weight models can be fine‑tuned or modified by anyone. OpenAI notes in the model card that this creates a different risk profile; once released, attackers could fine‑tune the weights to bypass safety refusals and OpenAI cannot revoke access【273†L0-L9】. The company therefore publishes extensive safety analyses and urges developers to implement additional safeguards【273†L9-L14】.

Model architecture and training

Mixture‑of‑Experts design

Both gpt‑oss models are Mixture‑of‑Experts (MoE) transformers, building on GPT‑2/3 architectures. Each layer routes tokens to a small number of experts to reduce computation. gpt‑oss‑120b has 36 layers with 116.8 billion parameters in total and roughly 5.13 billion active parameters per forward passcdn.openai.com. gpt‑oss‑20b has 24 layers with 20.9 billion parameters and 3.61 billion active parameterscdn.openai.com. The MoE blocks use a top‑4 expert router (i.e., the router selects the four most relevant experts for each token) with 128 experts in the 120b model and 32 experts in the 20b modelcdn.openai.com. Experts employ a gated SwiGLU activation and root‑mean‑square layer normalisationcdn.openai.com.

The attention mechanism alternates banded‑window and fully dense patterns with a bandwidth of 128 tokens, uses Group Query Attention (GQA) with 64 query heads and 8 key‑value heads, and employs rotary positional embeddingscdn.openai.com. Through the YaRN technique, dense layers can attend over sequences up to 131 072 tokens, supporting extremely long CoTscdn.openai.com.

Quantisation and memory efficiency

Most model parameters reside in the MoE weights. OpenAI post‑trains these weights in a 4.25‑bit MXFP4 format, reducing memory usage by over 90 % and allowing gpt‑oss‑120b to fit on a single 80 GB GPU and gpt‑oss‑20b to run on systems with as little as 16 GB RAMcdn.openai.com. The quantised checkpoints are 60.8 GiB (120b) and 12.8 GiB (20b)cdn.openai.com.

Tokeniser and pre‑training data

The models use OpenAI’s o200k_harmony tokenizer, a 201 088‑token byte‑pair encoding that extends the o200k tokenizer with special tokens for the harmony chat format【520†L0-L8】. Pre‑training uses a trillion‑token text‑only corpus emphasising STEM, programming and general knowledge【520†L10-L19】. Data were filtered to remove harmful content (e.g., hazardous biosecurity information) using filters from GPT‑4o【520†L10-L19】. Training was performed on NVIDIA H100 GPUs using PyTorch and expert‑optimised Triton kernels; gpt‑oss‑120b required 2.1 million H100‑hours, while gpt‑oss‑20b needed about one‑tenth of that【520†L17-L19】. Flash Attention was employed to reduce memory requirements and speed training【520†L17-L19】. The models have a knowledge cutoff of June 2024【520†L10-L19】.

Post‑training for reasoning and tool use

After pre‑training, the models were post‑trained to follow instructions, provide step‑by‑step reasoning and interact with tools. Key techniques include:

Harmony format – a structured conversation format with system, developer and user roles. The system message can specify the desired reasoning level (low, medium or high) and available function toolsgithub.com. Both models should be used via this format; using raw generate without the harmony template may produce incorrect outputsgithub.com.
Variable reasoning effort – at inference time, users can request low, medium or high reasoning. Lower levels yield faster, shorter responses, while high reasoning uses longer CoTs for complex tasks. The Hugging Face model card notes that the reasoning level can be set in the system prompt, for example by including “Reasoning: high,” and describes the trade‑offs between speed and detailhuggingface.co.
Agentic tool use – the models were trained to call functions, browse the web, execute Python code and produce structured outputs. The GitHub README highlights that gpt‑oss provides built‑in capabilities for function calling, web browsing and Python executiongithub.com.

Performance and capabilities

Core reasoning and knowledge tasks

OpenAI evaluates gpt‑oss models on mathematics (AIME), science (GPQA), humanities (HLE) and general knowledge (MMLU). The model card notes that gpt‑oss‑120b reaches near‑parity with OpenAI’s o4‑mini on these tasks, especially in mathematics where the models can exploit long CoTs【792†L0-L7】. The smaller gpt‑oss‑20b performs well on math but lags on knowledge‑heavy tasks because of its smaller parameter count【792†L0-L8】. Figures in the model card show that both gpt‑oss models achieve >95 % on AIME with tool use, while gpt‑oss‑120b scores around the high‑80s on GPQA and performs comparably to o4‑mini on the HLE benchmark (expert‑level questions). Although the exact numbers are in charts, the key takeaway is that gpt‑oss‑120b approaches frontier closed models at a fraction of the compute cost, while gpt‑oss‑20b achieves performance similar to OpenAI’s o3‑mini.

Coding and tool‑use tasks

The models excel at coding and agentic tasks. According to the model card, they perform strongly on Codeforces, SWE‑Bench Verified (a software‑engineering benchmark) and τ‑Bench (a benchmark measuring tool‑agent‑user interaction). The authors note that gpt‑oss‑120b performs close to o4‑mini and that gpt‑oss‑20b benefits from long CoTs during agentic tasks【792†L4-L10】. This suggests that open‑weight models can tackle complex programming problems and effectively decide when to call tools.

Health and multilingual performance

OpenAI evaluates the models on HealthBench (realistic health conversations) and HealthBench Hard (challenging health scenarios). At high reasoning, gpt‑oss‑120b nearly matches the larger OpenAI o3 model and outperforms GPT‑4o, o1, o3‑mini and o4‑mini【815†L0-L8】. The authors describe this as a large Pareto improvement: strong health‑care reasoning at a much lower inference cost【815†L0-L8】. They argue that open models could be especially valuable in global health settings where privacy and cost are critical【815†L0-L8】.

Multilingual ability was assessed using MMMLU, a multi‑language version of MMLU covering 14 languages. The model card reports that gpt‑oss‑120b at high reasoning comes close to o4‑mini‑high and that gpt‑oss‑20b performs reasonably well despite its smaller size【930†L0-L9】. The results suggest the models can handle knowledge questions across major languages when given sufficient reasoning effort.

Safety performance, hallucinations and bias

OpenAI emphasises that releasing open weights poses unique safety challenges. The model card states that gpt‑oss‑120b, in its default configuration, does not reach OpenAI’s “High capability” threshold in biological/chemical, cyber or AI self‑improvement categories of the Preparedness Framework【273†L14-L21】. Adversarial fine‑tuning by OpenAI researchers still did not elevate gpt‑oss‑120b to high capability in these domains【273†L14-L21】. Moreover, the authors found that existing open models already match the fine‑tuned gpt‑oss‑120b on most biological evaluations【273†L21-L24】.

Jailbreak resilience: The models were tested against StrongReject jailbreak prompts and an Instruction Hierarchy evaluation. gpt‑oss‑120b and gpt‑oss‑20b perform similarly to o4‑mini on general jailbreaks【1618†L5-L17】 but underperform on the instruction hierarchy tests (for example, preventing prompt injection or protecting secret phrases)【1760†L0-L6】. The model card notes that developers can fine‑tune the models to improve robustness but should not assume system‑level safety equal to OpenAI’s proprietary models【1760†L6-L13】.

Hallucinations: When tested without browsing, gpt‑oss models show lower accuracy and higher hallucination rates than o4‑mini on SimpleQA and PersonQA. For SimpleQA, gpt‑oss‑120b achieves 16.8 % accuracy with a 78.2 % hallucination rate, while gpt‑oss‑20b achieves 6.7 % accuracy with a 91.4 % hallucination rate【1830†L0-L8】. PersonQA results are similar; gpt‑oss‑20b has a particularly high hallucination rate【1860†L1-L7】. The authors note this is expected for smaller models and emphasise that browsing reduces hallucinations【1860†L6-L8】.

Fairness and bias: On the BBQ bias benchmark, gpt‑oss models perform at parity with o4‑mini. For example, gpt‑oss‑120b scores 0.87 on ambiguous questions and 0.90 on disambiguated questions, comparable to o4‑mini’s 0.82 and 0.89【1880†L0-L4】. Thus, the models do not introduce notable additional bias relative to similar‑sized closed models.

Hallucinated CoT visibility: The models expose their chain‑of‑thought reasoning. OpenAI deliberately avoids training the models to hide or sanitise their CoT because they view monitorability as important for safety research【1799†L0-L9】. However, they warn that CoTs may contain hallucinated or policy‑violating content and should not be shown directly to end users【1799†L9-L14】.

Usage and licensing

Both gpt‑oss models are released under the Apache 2.0 license, enabling free commercial usegithub.com. Users must comply with applicable law but can download the weights from OpenAI’s GitHub or the Hugging Face Hub. The gpt‑oss repository includes reference implementations in PyTorch, Triton and Apple’s Metal, as well as sample client applicationsgithub.com.

The Hugging Face model cards outline how to run the models with Transformers, vLLM, Ollama and LM Studio, and emphasise the need to use the harmony chat templatehuggingface.co. Developers can adjust reasoning levels (low, medium, high) via the system prompthuggingface.co. The models support function calling, web browsing and Python code executiongithub.com; these agentic capabilities make them well‑suited for building autonomous assistants.

OpenAI strongly advises developers to implement additional safety filters when deploying open models. The gpt‑oss usage policy (a short note in the repository) reminds users to use the models safely and responsibly and to comply with all applicable laws.

Significance and outlook

The release of gpt‑oss‑120b and gpt‑oss‑20b marks a notable shift in the open‑source language‑model landscape. Previous open models such as LLaMA‑3‑70B provided strong general‑purpose capabilities but lacked chain‑of‑thought visibility and integrated tool use. gpt‑oss models provide:

Strong reasoning on math, coding and health tasks at near‑frontier performance with a dramatically smaller computational footprint【792†L0-L7】【815†L0-L8】.
Adjustable reasoning levels and full CoT exposure, enabling developers to trade off latency versus accuracy and to monitor model reasoninghuggingface.co【1799†L0-L14】.
Native support for function calling, web browsing and Python execution, which can reduce the engineering required to build agentic systemsgithub.com.
Open weights under the Apache 2.0 licence, allowing commercial reuse and fine‑tuninggithub.com.

At the same time, the model card reveals limitations: the models hallucinate more than larger closed models, are less robust to prompt‑injection attacks, and can expose sensitive CoT content if not filtered【1760†L0-L13】【1830†L0-L8】. Developers must therefore implement safety measures and may need to fine‑tune the models for their specific domain or language.

Overall, gpt‑oss‑120b and gpt‑oss‑20b push the frontier of open‑weight reasoning models by combining high reasoning performance, agentic tool use and open licensing. They demonstrate that with careful training, quantisation and tool integration, open models can approach the capabilities of proprietary systems, enabling researchers and developers worldwide to experiment with advanced AI without relying on a closed API.

OpenAI’s gpt‑oss‑120b and gpt‑oss‑20b push the frontier of open‑weight reasoning models

Up next

Apple’s AI Search Engine: A Strategic Pivot Amid Market Pressure

Author

Thorsten Meyer

Share article

Introduction and context