putilov_denis - stock.adobe.com
OpenAI o3 explained: Everything you need to know
OpenAI o3 is the successor to the o1 reasoning model. It is the second release from the OpenAI reasoning model branch. The technology was first announced on Dec. 20, 2024.
On Sept. 12, 2024, ChatGPT creator OpenAI introduced its first reasoning model known as o1, the first in the o-series of models. While GPT-4 excels at general language tasks, the o-series focuses specifically on reasoning capabilities.
Originally developed under the code name Strawberry, o1 is a different, more thoughtful and reasoned approach for large language models (LLMs) than OpenAI's GPT-4o. The o1 model became generally available on Dec. 5, 2024.
On Dec. 20, 2024, during its "12 Days of OpenAI" event, OpenAI CEO Sam Altman announced a preview for the next generation of o1, known as o3. The news followed the announcement of the general availability of OpenAI's Sora video model.
The timing of the o3 model announcement was just a day after Google announced its Gemini 2.0 model preview, which also integrated some reasoning capabilities. The goal with o3 is to further extend the reasoning model with improved performance, capabilities and safety.
What is OpenAI o3?
OpenAI considers the o1 and o3 models to be on the leading edge of LLM development, in a class sometimes referred to as frontier models. The model family includes two variants:
- o3. The base model.
- o3-mini. The smaller model optimized for performance and cost efficiency.
As a reasoning model, o3 aims to handle more complex tasks than existing model types, such as GPT-4o. Unlike traditional AI models, o3 is specifically designed to excel at tasks requiring deep analytical thinking, problem-solving and complex reasoning.
Similar to other generative AI models, OpenAI's o3 is a transformer-based model that uses deep learning techniques to process and generate output. However, what sets o3 apart is its enhanced ability to understand context and reason through complex problems.
The o3 model uses a process called simulated reasoning, which enables the model to pause and reflect on its internal thought processes before responding. Simulated reasoning goes beyond chain-of-thought (CoT) prompting to provide a more advanced integrated and autonomous approach to self-analysis and reflection on model output. Simulated reasoning is an approach that mimics human reasoning by identifying patterns and drawing conclusions based on those patterns.
What is o3-mini?
Similar to the o1 model family, there are multiple variants of o3.
The o1 base model is the full-scale model offering maximum capabilities but requiring significant computational resources.
In contrast, o3-mini is a scaled-down version optimized for performance and cost efficiency. The o3-mini model sacrifices some capabilities for reduced computational requirements, while maintaining core innovations in reasoning.
What are the new safety techniques in o3?
The o3 model uses a new safety technique known as deliberative alignment, which uses the o3 model's reasoning capabilities to understand and evaluate the safety implications of user requests.
With a traditional safety training approach for an LLM, the model reviews examples of safe and unsafe prompts to establish a decision boundary. In contrast, the deliberative alignment approach uses the model's reasoning capabilities to analyze and evaluate prompts.
With deliberative alignment, the model reasons over a prompt using a safety specification and can identify hidden intentions or attempts to trick the system. According to OpenAI, deliberative alignment represents an improvement in accurately rejecting unsafe content and avoiding unnecessary rejections of safe content.
How deliberative alignment works
Deliberative alignment introduces a series of innovations to the o3 models that are not present in earlier OpenAI models.
Deliberative alignment operates through a multistage process.
Initial training stage
- A base model is trained for general helpfulness without safety-specific data.
- The model has direct access to the actual text of safety specifications and policies.
Data generation process
- Safety-categorized prompts are paired with relevant safety specifications.
- The prompts are fed to a base model, which generates CoT reasoning about the prompt.
Training implementation
- The first phase includes supervised fine-tuning (SFT) to optimize reasoning using labeled data for a specific task.
- After SFT, the next phase is reinforcement learning, which further refines the model's use of CoT reasoning.
Inference process
- When receiving a prompt, the model automatically generates CoT reasoning, analyzes the prompt against safety specifications and produces a policy-compliant response.
What happened to OpenAI o2?
Typically, it would be common sense to expect a logical numerical progression for a new release. However, there is no OpenAI o2 model as OpenAI advanced to o3.
The name o2 is the trademarked name of a mobile phone service in the U.K., operated by Telefonica UK. OpenAI decided to name the new model o3, out of respect for Telefonica.
What can OpenAI o3 do?
As a transformer-based model, it can handle common activities of LLMs, including knowledge-based answers, summarization and text generation.
Similar to its predecessor o1, the o3 model has advanced capabilities across multiple domains, including the following:
- Advanced reasoning. The model is capable of step-by-step logical reasoning and can handle increasingly complex tasks requiring detailed analysis.
- Programming and coding. The o3 model is highly proficient at coding, achieving 71.7% accuracy on SWE-bench Verified, a benchmark that consists of real-world software tasks, marking a 20% improvement over the o1 model.
- Mathematics. Users can execute complex mathematical operations with the model with an ability that surpasses o1. OpenAI reported that o3 scored 96.7% accuracy on American Invitational Mathematics Examination (AIME), compared to o1's 83.3%.
- Science. The o3 model will also be helpful for scientific research. According to OpenAI, the model achieved 87.7% accuracy on GPQA Diamond, a benchmark testing Ph.D.-level science questions.
- Self-fact checking. O3 can self-fact check, improving the accuracy of its responses.
- Adaptability toward artificial general intelligence. Among the big advances claimed by OpenAI for o3 is performance on the ARC-AGI benchmark. The ARC-AGI benchmark tests an AI model's ability to recognize patterns in unique situations and adapt knowledge to unfamiliar challenges. The o3 model achieved 87.5% accuracy, surpassing human-level performance (85%) and significantly improved over o1, which only scored 32%.
How to use OpenAI o3?
The initial release of o3 is extremely restricted and limited in availability.
Rather than an immediate public launch, both o3 and o3-mini are initially available for public safety testing.
In the public safety testing approach, prospective users must apply for access.
The goals of providing the model initially only for safety testing are to enable researchers to do the following:
- Develop extensive evaluations for safety implications.
- Create demonstrations of potential high-risk capabilities.
- Explore new threat models and security analyses.
Beyond the early safety testing, OpenAI plans to make o3-mini available at the end of January with the full o3 version to follow.
OpenAI o1 vs. OpenAI o3
Both o1 and o3 are reasoning models with the same core functionality. The two models show significant differences in performance across various tasks.
For example, widely used coding scores, such as the Codeforces Elo rating, measure the relative level of programming skill. An Elo rating is a rating scale originally used to rate chess player performance.
The following chart outlines the key differences and benchmark performance scores of o1 vs. o3.
Feature | OpenAI o1 | OpenAI o3 |
Release date | Dec. 5, 2024 | Expected January 2025 |
Model variants | Three: o1, o1-mini and o1 pro | Two: o3 and o3-mini |
ARC-AGI benchmark score | 32% | 87.5% |
AIME 2024 score (mathematics) | 83.3% | 96.7% |
Codesforces Elo rating (coding) | 1891 (Expert) | 2727 (International Grandmaster) |
SWE-bench Verified score (coding) | 48.9% | 71.7% |
Reasoning capabilities | Basic | Advanced (simulated reasoning) |
Safety features | Basic | Enhanced (deliberative alignment) |
Sean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He has pulled Token Ring, configured NetWare and been known to compile his own Linux kernel. He consults with industry and media organizations on technology issues.