putilov_denis - stock.adobe.com

Feature

OpenAI o3 and o4 explained: Everything you need to know

OpenAI o3 and o4-mini are successors to the o1 model and are releases from the OpenAI reasoning model branch. These models became widely available on April 16, 2025.

Sean Michael Kerner

Published: 13 Jun 2025

On Sept. 12, 2024, ChatGPT creator OpenAI introduced its first reasoning model known as o1, the first in the o-series of models. While GPT-4 excels at general language tasks, the o-series focuses specifically on reasoning capabilities.

Originally developed under the code name Strawberry, o1 is a different, more thoughtful and reasoned approach for large language models (LLMs) than OpenAI's GPT-4o. The o1 model became generally available on Dec. 5, 2024.

On Dec. 20, 2024, during its "12 Days of OpenAI" event, OpenAI CEO Sam Altman announced a preview for the next generation of o1, known as o3. The news followed the announcement of the general availability of OpenAI's Sora video model.

The timing of the o3 model announcement was just a day after Google announced its Gemini 2.0 model preview, which also integrated some reasoning capabilities. The goal with o3 is to further extend the reasoning model with improved performance, capabilities and safety.

On April 16, 2025, o3 became generally available alongside o4-mini as the most advanced versions of OpenAI's reasoning models. Two months later on June 10, 2025, OpenAI announced the debut of o3-pro, providing the highest level of performance in the o-series at the time of release.

What is OpenAI o3?

OpenAI considers the o-series of models to be on the leading edge of LLM development, in a class sometimes referred to as frontier models. The model family includes two variants:

o3. The base model.
o3-mini. The smaller model optimized for performance and cost efficiency.
o3-pro. The top-end model provides the best performance in the o3 family.

As a reasoning model, o3 aims to handle more complex tasks than existing model types, such as GPT-4o. Unlike traditional AI models, o3 is specifically designed to excel at tasks requiring deep analytical thinking, problem-solving and complex reasoning.

Similar to other generative AI models, OpenAI's o3 is a transformer-based model that uses deep learning techniques to process and generate output. However, what sets o3 apart is its enhanced ability to understand context and reason through complex problems.

The o3 model uses a process called simulated reasoning, which enables the model to pause and reflect on its internal thought processes before responding. Simulated reasoning goes beyond chain-of-thought (CoT) prompting to provide a more advanced integrated and autonomous approach to self-analysis and reflection on model output. Simulated reasoning is an approach that mimics human reasoning by identifying patterns and drawing conclusions based on those patterns.

What is o3-mini?

Similar to the o1 model family, there are multiple variants of o3.

The o1 base model is the full-scale model offering maximum capabilities but requiring significant computational resources.

In contrast, o3-mini is a scaled-down version optimized for performance and cost efficiency. The o3-mini model sacrifices some capabilities for reduced computational requirements, while maintaining core innovations in reasoning.

The o3-mini model became generally available Jan. 31, 2025. With the release, OpenAI provided users with three variants known as o3-mini-low, o3-mini-medium and o3-mini-high. The low/medium/high designation refers to the level of reasoning. The o3-mini-high uses the highest level of reasoning, which requires more time to generate an output.

What is o3-pro?

OpenAI originally had a pro variant of its first o-class model o1, which isn't something it had previously with its GPT-class models. The pro variant is intended to offer the highest level of performance, with the deepest level of reasoning available within the model family.

According to OpenAI, o3-pro thinks longer and provides the most reliable responses in the o3 family of models. OpenAI's own evaluations of o3-pro show the model outperforms both o3 and the o1-pro models across multiple areas -- including scientific analysis, personal writing, computer programming and data analysis.

It is also a slower model, as it takes more time to reason -- meaning it also takes more time to generate responses than either o3-mini or the base o3 model. Since o3-pro uses more compute to think harder and provide consistently better answers, OpenAI warns some API requests may take several minutes to complete.

The o3-pro model also has tool access that enables it to search the web, analyze files, reason about visual inputs and use Python. The tool integration allows o3-pro to handle complex, multi-step problems that require multiple capabilities working together.

What is o4 mini?

Alongside o3, OpenAI released o4-mini on April 16, 2025.

The o4-mini model is the first o4 model. Similar to OpenAI's other mini variants, it is a smaller model that aims to be highly cost-efficient. As an o-class model, it is a reasoning model and aims to offer competitive performance for its size and computational requirements. According to OpenAI, o4-mini provides improved performance over o3 mini across all key benchmarks.

As with other OpenAI mini variants, o4-mini is positioned as a good option for high-volume, high-throughput applications, thanks to its lower cost and higher usage limits than o3.

The o4-mini model comes in two variants:

o4-mini. The standard version is optimized for a balance of performance and efficiency.
o4-mini-high. A high-reasoning variant that uses more extensive reasoning to tackle complex problems.

What are the new safety techniques in o3 and o4-mini?

The o3 and o4-mini models introduce a new safety technique known as deliberative alignment, which uses the o3 and o4-mini models' reasoning capabilities to understand and evaluate the safety implications of user requests.

With a traditional safety training approach for an LLM, the model reviews examples of safe and unsafe prompts to establish a decision boundary. In contrast, the deliberative alignment approach uses the model's reasoning capabilities to analyze and evaluate prompts.

With deliberative alignment, the model reasons over a prompt using a safety specification and can identify hidden intentions or attempts to trick the system. According to OpenAI, deliberative alignment represents an improvement in accurately rejecting unsafe content and avoiding unnecessary rejections of safe content.

How deliberative alignment works

Deliberative alignment introduces a series of innovations to the o3 and o-4 mini models not present in earlier OpenAI models.

Deliberative alignment operates through a multistage process.

Initial training stage

A base model is trained for general helpfulness without safety-specific data.
The model has direct access to the actual text of safety specifications and policies.

Data generation process

Safety-categorized prompts are paired with relevant safety specifications.
The prompts are fed to a base model, which generates CoT reasoning about the prompt.

Training implementation

The first phase includes supervised fine-tuning (SFT) to optimize reasoning using labeled data for a specific task.
After SFT, the next phase is reinforcement learning, which further refines the model's use of CoT reasoning.

Inference process

When receiving a prompt, the model automatically generates CoT reasoning, analyzes the prompt against safety specifications and produces a policy-compliant response.

What happened to OpenAI o2?

Typically, it would be common sense to expect a logical numerical progression for a new release. However, there is no OpenAI o2 model as OpenAI advanced to o3.

The name o2 is the trademarked name of a mobile phone service in the U.K., operated by Telefonica UK. OpenAI decided to name the new model o3, out of respect for Telefonica.

What is visual reasoning?

One of the key advancements in the o3 and o4-mini models is visual reasoning. OpenAI also refers to this as - thinking with images. Unlike previous approaches that "see" images, these models can actively "think with" visual content, integrating images directly into their chain of thought.

How visual reasoning works:

Visual reasoning works differently from traditional image recognition in several different ways, including the following:

Integrated visual processing. Rather than treating images as separate inputs that require translation to text, models integrate visual information directly into their reasoning process.
Mid-reasoning image manipulation. The models can modify, transform, or analyze images during the visual reasoning process-- rotating, zooming, cropping, or otherwise manipulating visual content to extract more information.
Multimodal problem-solving. Visual and text reasoning are blended together, enabling the models to solve problems that require understanding both modalities simultaneously, such as interpreting charts, diagrams or hand-drawn sketches.

What can OpenAI o3 and o4-mini do?

As a transformer-based model, it can handle common activities of LLMs, including knowledge-based answers, summarization and text generation.

Similar to its predecessor o1, the o3 and o4-mini models have advanced capabilities across multiple domains, including the following:

Advanced reasoning. The model is capable of step-by-step logical reasoning and can handle increasingly complex tasks requiring detailed analysis.
Programming and coding. The o3 and o4-mini models are highly proficient at coding. The o3 model achieved 69.1% and o4-mini scored 68.1% accuracy on SWE-bench Verified, a benchmark that consists of real-world software tasks.
Mathematics. Users can execute complex mathematical operations with the model with an ability that surpasses o1. OpenAI reported that o3 scored 88.9% accuracy on the American Invitational Mathematics Examination (AIME) 2025 competition in math. The o4-mini model was even better, scoring 92.7% accuracy.
Science. The o3 and o4-mini models will also be helpful for scientific research. On GPQA Diamond, a benchmark testing Ph.D.-level science questions, o3 scored 83.3 while 04-mini scored 81.4.
Self-fact checking. O3 and o4-mini can self-fact check, improving the accuracy of its responses.
Tool use. OpenAI o3 and o4-mini represent the first reasoning models that can use tools directly in an agentic AI approach. The tools available include what's available within ChatGPT—including web browsing, Python code execution, file operations and image generation—with the ability to strategically determine when and how to use these tools to solve complex, multi-step problems efficiently.

How to use OpenAI o3 and o4-mini?

The initial release of o3 was extremely restricted and limited in availability.

Rather than an immediate public launch, o3 was initially only available for public safety testing. OpenAI o3 became generally available on April 16, 2025. The o3-pro model became available on June 10, 2025. The o4-mini models also became generally available on the same date.

The models are available in a variety of ways, including the following:

ChatGPT access

ChatGPT Plus, Pro and Team users get access to both o3 and o4-mini and the models will replace the o1 and o3-mini options. The o3-pro model is available in the model picker for Pro and Team users, replacing o1-pro.
ChatGPT Free users can try out o4-mini using the 'Think' option within the ChatGPT interface.

API access

The o3 model is available through the API for developers with initial pricing of $10 per million input tokens and $40 per million output tokens. On June 10, 2025, OpenAI cut the cost for o3 via API by 80%, with pricing of $2 per million input tokens and $8 per million output tokens.
The o3-pro model's initial pricing was $20 per million input tokens and $80 per million output tokens.
The o4-mini model is also accessible using the OpenAI API. Its initial pricing is $1.10 per million input tokens and $4.40 per million output tokens.

OpenAI o1 vs. OpenAI o3 vs. o4-mini

Both o1 and o3 are reasoning models with the same core functionality. The two models show significant differences in performance across various tasks.

For example, widely used coding scores, such as the Codeforces Elo rating, measure the relative level of programming skill. An Elo rating is a rating scale originally used to rate chess player performance.

The following chart outlines the key differences and benchmark performance scores of o1 vs. o3 vs. o4-mini.

Feature	OpenAI o1	OpenAI o3	OpenAI o4-mini
Release date	Dec. 5, 2024	April 16, 2025 o3-pro: June 10, 2025	April 16, 2025
Model variants	Three: o1, o1-mini and o1 pro	Three: o3, o3 pro and o3-mini	Two: o4-mini and o4-mini high
AIME 2024 score (mathematics)	74.3%	o3 (base) -- 90% o3-pro -- 93%	93.4%
Codesforces Elo rating (coding)	1891 (Expert)	o3 (base) -- 2,517 o3-pro -- 2,748	2,719 (International Grandmaster)
SWE-bench Verified score (coding)	48.9%	69.1%	68.1%
Reasoning capabilities	Basic	Advanced (simulated reasoning), Visual thinking	Advanced (simulated reasoning), Visual thinking
Safety features	Basic	Enhanced (deliberative alignment)	Enhanced (deliberative alignment)

Sean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He has pulled Token Ring, configured NetWare and been known to compile his own Linux kernel. He consults with industry and media organizations on technology issues.