Phi-4 Reasoning: Models, Architecture, Benchmarks & Usage

Written By:
May 2, 2025

The release of Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning marks a significant milestone in the evolution of small language models (SLMs). With this series, Microsoft challenges a long-standing assumption in AI: that high-quality reasoning capabilities require massive-scale models. Instead, these models demonstrate how targeted fine-tuning, meticulous data curation, and inference-time optimization can elevate compact models to a level of performance previously considered exclusive to multi-hundred-billion-parameter systems.

Dissecting the Phi-4-Reasoning Series

At the core of this advancement is Phi-4-reasoning, a 14-billion parameter open-weight model built on top of the original Phi-4 base. Unlike general-purpose LLMs, which often spread their capacity across a wide array of tasks, Phi-4-reasoning is optimized specifically for structured multi-step reasoning. This includes tasks like mathematical problem solving, scientific inference, and symbolic manipulation.

Key highlights:

  • Supervised Fine-Tuning (SFT): Phi-4-reasoning is trained on a curated dataset of high-quality reasoning traces. Notably, this includes synthetic data derived from OpenAI’s o3-mini, crafted to expose the model to clean, logically coherent step-by-step problem-solving patterns.

  • Reasoning Chain Generation: Rather than producing shallow, shortcut answers, Phi-4-reasoning is trained to emulate the intermediate reasoning steps that lead to a final conclusion—mimicking human cognitive strategies.

Taking this further, Phi-4-reasoning-plus extends the SFT approach with reinforcement learning (RLHF-style). The model is trained to better leverage inference-time compute, processing 1.5× more tokens than Phi-4-reasoning. This additional capacity allows it to hold longer chains of logic in memory, maintain coherence over larger contexts, and deliver higher accuracy on complex tasks.

Compact Models, Competitive Results

The standout achievement of these models lies in their ability to outperform or match models many times their size on demanding benchmarks. Both Phi-4-reasoning and Phi-4-reasoning-plus exceed:

  • OpenAI o1-mini, a smaller frontier model by OpenAI

  • DeepSeek-R1-Distill-Llama-70B, a 70B dense variant distilled from DeepSeek's flagship MoE model

  • And remarkably, on AIME 2025 (the 2025 qualifier for the USA Mathematical Olympiad), they outperform DeepSeek-R1, a 671B parameter MoE model

This raises a critical point in AI research: model scale is not synonymous with reasoning quality. With the right architectural choices and post-training strategies, smaller models can surpass even state-of-the-art giants in targeted domains.

Reasoning Performance Across Benchmarks

Phi-4-reasoning and Phi-4-reasoning-plus in context of state-of-the-art open models

Interpretation:
This figure presents comparative results across diverse reasoning benchmarks—mathematical and scientific in nature. It positions the Phi-4 variants against:
  • DeepSeek-R1 (671B MoE)

  • DeepSeek-R1-Distill-Llama-70B (dense distilled variant)

  • OpenAI’s o1-mini and o3-mini

Key insights:
  • Phi-4-reasoning-plus, despite being over 47× smaller than DeepSeek-R1, delivers competitive or better performance on tasks involving algebra, symbolic manipulation, logic puzzles, and science QA.

  • The margin of improvement between Phi-4 and Phi-4-reasoning reflects the power of reasoning-centric SFT.

  • Adding reinforcement learning on top of SFT (Phi-4-reasoning-plus) compounds these gains, especially on multi-hop reasoning tasks where stepwise logic is essential.

This showcases a core strength of Phi-4-reasoning: it’s not just about answering correctly, it’s about understanding how to answer correctly.

Broad Capabilities Beyond Reasoning

Although these models are optimized for reasoning, Microsoft has ensured they maintain general-purpose language capability. This dual strength makes the Phi-4-reasoning family not just a proof-of-concept for efficient reasoning, but also viable as general AI agents for complex tasks.

Accuracy on General Benchmarks

General-purpose performance across QA, coding, instruction following, safety, and factual knowledge

Benchmarks included:
  • FlenQA – Long-context question answering

  • IFEval – Instruction-following

  • HumanEvalPlus – Code generation and correctness

  • MMLUPro – Conceptual and factual understanding across subjects

  • ToxiGen – Toxic content detection and safety alignment

  • ArenaHard / PhiBench – General reasoning and multitask understanding

Interpretation:
Phi-4-reasoning models demonstrate high robustness across domains, rivaling models like DeepSeek-R1-Distill-70B and approaching DeepSeek-R1 itself. Their performance on:
  • HumanEvalPlus shows clear competence in code reasoning

  • IFEval and FlenQA reflect strong instruction-following and long-context comprehension

  • ToxiGen indicates well-calibrated safety mechanisms—often a weak spot for small models

This generalization is critical: reasoning-specialized models must also act responsibly and adaptably in broader language environments. Phi-4-reasoning achieves that.

Phi-4-Mini-Reasoning: Compact Model, Maximum Reasoning

As demand rises for small yet intelligent models, Microsoft introduces Phi-4-mini-reasoning—a 3.8B parameter transformer fine-tuned specifically for mathematical and symbolic reasoning. While its footprint is modest, its capabilities are anything but. Phi-4-mini-reasoning is crafted to thrive in resource-constrained environments—mobile apps, embedded systems, and edge devices—without compromising on logic depth or answer quality.

Built for Reasoning Under Constraints

Phi-4-mini-reasoning is fine-tuned using synthetic data generated by the DeepSeek-R1 model, which allowed Microsoft to maintain a high signal-to-noise ratio in the training dataset while covering a wide variety of math problems—spanning middle school algebra to Ph.D.-level combinatorics.

Unlike general-purpose small models, which often overfit to shallow instruction-following patterns, Phi-4-mini-reasoning is optimized for:

  • Step-by-step problem-solving

  • Long-context answer generation

  • Symbolic manipulation and multi-hop reasoning

This makes it ideally suited for:

  • Educational tools and math tutors

  • Low-latency reasoning assistants on mobile

  • Edge deployment in STEM-specific AI use cases

Outperforming Larger Models on Math Benchmarks

Phi-4-mini-reasoning vs. baseline and large models on AIME 24, MATH-500, and GPQA Diamond

This figure illustrates the performance of Phi-4-mini-reasoning (3.8B) against:

  • Its base model (Phi-4-mini)

  • Multiple distilled and instruction-tuned models, including:


    • OpenThinker-7B

    • DeepSeek-R1-Distill-Qwen-7B

    • DeepSeek-R1-Distill-Llama-8B

    • Bespoke-Stratos-7B

    • OpenAI’s o1-mini

Key Observations:
  • On AIME 24, Phi-4-mini-reasoning achieves 57.5%, a massive leap from its base model (10%) and comfortably ahead of larger models like DeepSeek-7B variants and OpenThinker-7B.

  • On MATH-500, it scores 94.6%, nearly matching the top-tier o1-mini (90%) and outperforming all other models.

  • On GPQA Diamond, which tests graduate-level reasoning and factual recall, it hits 52%, outperforming models up to 2× its size.

The takeaway is clear:
Targeted fine-tuning and reasoning-centric design allow Phi-4-mini-reasoning to rival or surpass 7B–8B models, despite being half the size.

This efficiency-to-performance ratio is a strong indicator that size alone doesn’t dictate capability—especially when the task is structured, symbolic, and domain-specific.

How to Use Phi-4-Reasoning Models

The Phi-4-reasoning models, including Phi-4-mini-reasoning, are open-source and available for use via Hugging Face. These models come with the pre-trained weights, and the code snippet below outlines how to load the models locally.

You can access the model on Hugging Face through the following link:
microsoft/Phi-4-reasoning · Hugging Face

Model Usage:

This code will load the Phi-4-mini-reasoning model, process a math problem, and generate a solution. You can replace the content in the messages list to query the model for various other reasoning tasks.

Installation Requirements

To run the models, first install the required dependencies:

Loading the Model Locally

Here’s a code snippet that shows how to load Phi-4-mini-reasoning locally for your own use case:

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

model_id = "microsoft/Phi-4-mini-reasoning"

model = AutoModelForCausalLM.from_pretrained(

   model_id,

   device_map="cuda",

   torch_dtype="auto",

   trust_remote_code=True,

)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [{

   "role": "user",

   "content": "How to solve 3*x^2+4*x+5=1?"

}]  

inputs = tokenizer.apply_chat_template(

   messages,

   add_generation_prompt=True,

   return_dict=True,

   return_tensors="pt",

)

outputs = model.generate(

   **inputs.to(model.device),

   max_new_tokens=32768,

   temperature=0.8,

   top_p=0.95,

   do_sample=True,

)

outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])

print(outputs[0])

When to Use Which Model?

Phi-4-reasoning models are versatile, but choosing the right one depends on the specific task at hand. Here's a guide on when to use each model variant:

1. Phi-4-Reasoning
  • Use Case: Ideal for complex, multi-step reasoning tasks, such as advanced mathematical problem solving, scientific inference, and symbolic manipulation.

  • Best For: Tasks requiring deeper, structured reasoning with a relatively larger context.

  • Example Tasks: Algebraic proofs, scientific problem-solving, logic puzzles, symbolic manipulation.

2. Phi-4-Reasoning-Plus
  • Use Case: When you need enhanced multi-step reasoning with the ability to handle larger contexts.

  • Best For: Complex reasoning tasks involving long-term logical connections (e.g., multi-hop reasoning).

  • Example Tasks: Multi-hop logical puzzles, complex scientific inquiries, scenarios requiring holding long sequences of reasoning.

3. Phi-4-Mini-Reasoning
  • Use Case: For applications requiring smaller models, such as those running in resource-constrained environments (e.g., mobile apps, edge devices).

  • Best For: Tasks that require compact models but still benefit from reasoning capabilities, such as educational tools, STEM applications, and real-time reasoning on low-latency platforms.

  • Example Tasks: Mobile math tutors, educational apps, edge devices with reasoning-specific tasks like algebra, and combinatorics.

Each model is optimized to balance size, reasoning quality, and performance, allowing developers to select based on their specific needs, whether they require lightweight models for portability or powerful models for extensive reasoning across vast contexts.

The Phi-4 Reasoning models prove that with smart fine-tuning and architectural choices, small models can deliver big on reasoning. From mobile apps to AI agents, they're setting new standards for efficient intelligence. At GoCodeo, we’re excited by what this means for building compact, high-performance AI into real-world developer workflows. The future of reasoning isn’t just large—it’s optimized.

Connect with Us