The release of Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning marks a significant milestone in the evolution of small language models (SLMs). With this series, Microsoft challenges a long-standing assumption in AI: that high-quality reasoning capabilities require massive-scale models. Instead, these models demonstrate how targeted fine-tuning, meticulous data curation, and inference-time optimization can elevate compact models to a level of performance previously considered exclusive to multi-hundred-billion-parameter systems.
At the core of this advancement is Phi-4-reasoning, a 14-billion parameter open-weight model built on top of the original Phi-4 base. Unlike general-purpose LLMs, which often spread their capacity across a wide array of tasks, Phi-4-reasoning is optimized specifically for structured multi-step reasoning. This includes tasks like mathematical problem solving, scientific inference, and symbolic manipulation.
Key highlights:
Taking this further, Phi-4-reasoning-plus extends the SFT approach with reinforcement learning (RLHF-style). The model is trained to better leverage inference-time compute, processing 1.5× more tokens than Phi-4-reasoning. This additional capacity allows it to hold longer chains of logic in memory, maintain coherence over larger contexts, and deliver higher accuracy on complex tasks.
The standout achievement of these models lies in their ability to outperform or match models many times their size on demanding benchmarks. Both Phi-4-reasoning and Phi-4-reasoning-plus exceed:
This raises a critical point in AI research: model scale is not synonymous with reasoning quality. With the right architectural choices and post-training strategies, smaller models can surpass even state-of-the-art giants in targeted domains.
Phi-4-reasoning and Phi-4-reasoning-plus in context of state-of-the-art open models
This showcases a core strength of Phi-4-reasoning: it’s not just about answering correctly, it’s about understanding how to answer correctly.
Although these models are optimized for reasoning, Microsoft has ensured they maintain general-purpose language capability. This dual strength makes the Phi-4-reasoning family not just a proof-of-concept for efficient reasoning, but also viable as general AI agents for complex tasks.
General-purpose performance across QA, coding, instruction following, safety, and factual knowledge
This generalization is critical: reasoning-specialized models must also act responsibly and adaptably in broader language environments. Phi-4-reasoning achieves that.
As demand rises for small yet intelligent models, Microsoft introduces Phi-4-mini-reasoning—a 3.8B parameter transformer fine-tuned specifically for mathematical and symbolic reasoning. While its footprint is modest, its capabilities are anything but. Phi-4-mini-reasoning is crafted to thrive in resource-constrained environments—mobile apps, embedded systems, and edge devices—without compromising on logic depth or answer quality.
Phi-4-mini-reasoning is fine-tuned using synthetic data generated by the DeepSeek-R1 model, which allowed Microsoft to maintain a high signal-to-noise ratio in the training dataset while covering a wide variety of math problems—spanning middle school algebra to Ph.D.-level combinatorics.
Unlike general-purpose small models, which often overfit to shallow instruction-following patterns, Phi-4-mini-reasoning is optimized for:
This makes it ideally suited for:
Phi-4-mini-reasoning vs. baseline and large models on AIME 24, MATH-500, and GPQA Diamond
This figure illustrates the performance of Phi-4-mini-reasoning (3.8B) against:
The takeaway is clear:
Targeted fine-tuning and reasoning-centric design allow Phi-4-mini-reasoning to rival or surpass 7B–8B models, despite being half the size.
This efficiency-to-performance ratio is a strong indicator that size alone doesn’t dictate capability—especially when the task is structured, symbolic, and domain-specific.
The Phi-4-reasoning models, including Phi-4-mini-reasoning, are open-source and available for use via Hugging Face. These models come with the pre-trained weights, and the code snippet below outlines how to load the models locally.
You can access the model on Hugging Face through the following link:
microsoft/Phi-4-reasoning · Hugging Face
This code will load the Phi-4-mini-reasoning model, process a math problem, and generate a solution. You can replace the content in the messages list to query the model for various other reasoning tasks.
Installation Requirements
To run the models, first install the required dependencies:
Here’s a code snippet that shows how to load Phi-4-mini-reasoning locally for your own use case:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
model_id = "microsoft/Phi-4-mini-reasoning"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [{
"role": "user",
"content": "How to solve 3*x^2+4*x+5=1?"
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
outputs = model.generate(
**inputs.to(model.device),
max_new_tokens=32768,
temperature=0.8,
top_p=0.95,
do_sample=True,
)
outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
print(outputs[0])
Phi-4-reasoning models are versatile, but choosing the right one depends on the specific task at hand. Here's a guide on when to use each model variant:
Each model is optimized to balance size, reasoning quality, and performance, allowing developers to select based on their specific needs, whether they require lightweight models for portability or powerful models for extensive reasoning across vast contexts.
The Phi-4 Reasoning models prove that with smart fine-tuning and architectural choices, small models can deliver big on reasoning. From mobile apps to AI agents, they're setting new standards for efficient intelligence. At GoCodeo, we’re excited by what this means for building compact, high-performance AI into real-world developer workflows. The future of reasoning isn’t just large—it’s optimized.