Microsoft Phi-4 Install Guide: All Platforms 2026
Quick Summary: Microsoft Phi-4 is the most efficient open-source language model of 2026 β its 14B parameters outperform many 30B+ models thanks to Microsoft's "data quality over data quantity" training philosophy. Released under the MIT license with commercial use allowed, Phi-4 is the ideal choice for developers who want high-quality AI inference on a laptop or an 8GB GPU.
What Is Microsoft Phi-4?
Phi-4 is Microsoft Research's fourth-generation small language model, part of the Phi series that started with Phi-1 in 2023. What distinguishes the Phi series is Microsoft's unconventional training approach: rather than simply scaling up data volume, Microsoft focused intensively on data quality and synthetic data generation. The training dataset for Phi-4 was carefully curated to include only high-quality educational and reasoning-focused content, then augmented with AI-generated "textbook-quality" synthetic examples.
The result is extraordinary: Phi-4 with just 14 billion parameters achieves scores on MATH and GPQA Diamond benchmarks that exceed models two to three times larger. On the MMLU-Pro benchmark, Phi-4 scores higher than Llama 3.3-70B on several categories despite being five times smaller β which translates to five times lower memory requirements and significantly faster inference speed.
Phi-4 is released under the MIT license, which is the most permissive open-source license available. This means you can use Phi-4 in commercial products, modify it freely, and distribute derivatives without any attribution requirement beyond preserving the license text. For startups and enterprises building AI-powered applications, this is a significant advantage over more restrictive licenses like Llama 4's community license.
Why Phi-4 Beats Bigger Models
The conventional wisdom in AI is that bigger models are better. Phi-4 challenges this assumption decisively:
| Model | Params | MATH Score | GPQA Diamond | Min VRAM |
|---|---|---|---|---|
| Phi-4 β | 14B | 80.4% | 56.1% | 8GB |
| Llama 3.3-70B | 70B | 77.0% | 50.5% | 40GB |
| Gemma 3-27B | 27B | 75.4% | 46.1% | 16GB |
| Mistral-22B | 22B | 68.5% | 43.2% | 12GB |
Phi-4's 14B model beats Llama 3.3-70B (5x larger) on both MATH and GPQA Diamond while requiring just 8GB VRAM vs 40GB. This is the power of training data quality over raw scale. For users with 8β16GB GPUs, Phi-4 delivers the best reasoning quality available in 2026.
Install Ollama (Windows, Mac, Linux)
macOS
Phi-4 runs exceptionally well on Apple Silicon. An M2 MacBook Pro (24GB) can run Phi-4 at 30+ tokens/second β some of the fastest local LLM inference available on a laptop in 2026.
brew install ollama
ollama pull phi4
ollama run phi4
Windows
Download OllamaSetup.exe from ollama.com. On RTX 3060 (12GB VRAM), Phi-4 runs at 20β30 t/s with CUDA acceleration, making it ideal for real-time coding assistance.
# After install:
ollama pull phi4
ollama run phi4
Linux
One-command install. Phi-4 with AMD ROCm on Linux performs comparably to NVIDIA CUDA, making it a great choice for AMD GPU owners who want MIT-licensed local AI.
curl -fsSL \
https://ollama.com/install.sh \
| sh
Pull and Run Phi-4 with Ollama
GPU Acceleration Notes: Windows users with NVIDIA GPUs get automatic CUDA support. AMD GPU users on Linux should install ROCm before Ollama for GPU acceleration. Apple Silicon Mac users get automatic Metal GPU support β an M2 MacBook Pro with 16GB handles Phi-4 at 25β40 tokens/second, making it one of the best laptops for running Phi-4 locally in 2026.
# Pull and run Phi-4 (one size β 14B is all you need):
ollama run phi4
# Download without starting immediately:
ollama pull phi4
# Ask a math problem directly:
ollama run phi4 "Solve: integrate x^2 * sin(x) dx step by step"
Phi-4 is a single-size model (14B), which simplifies the choice dramatically compared to multi-size families like Qwen3.5 or Gemma 3. You don't need to decide between variants β there's only one Phi-4, and it's already tuned for the best balance between quality, speed, and hardware requirements. There's no need to decide between multiple variants β just pull phi4 and you get the best performance Phi-4 offers. The model file is about 8GB in Q4 quantization format.
Phi-4 Excels at Code
Phi-4's training data included a large proportion of code examples and mathematical reasoning content. In practical tests, Phi-4 outperforms all other sub-20B models on code generation, debugging, and algorithm explanation. For developers using Cursor, VS Code with Continue, or any local coding assistant, Phi-4 is the top recommendation for the 8GB VRAM tier.
Beyond code generation, Phi-4 excels at explaining complex algorithms, reviewing code for bugs and security issues, suggesting performance optimizations, and generating comprehensive unit tests. Its deep understanding of programming logic β a direct result of Microsoft's training data curation methodology β makes it a genuinely useful pair programming partner even compared to much larger models.
Web Interface with Open WebUI
For a browser-based interface with conversation history, custom system prompts, and a user-friendly chat experience, Open WebUI is the most popular option in 2026. It connects directly to your local Ollama instance and provides a polished interface that feels like ChatGPT β but everything runs locally:
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000 in your browser, create a local account, and select phi4 from the model dropdown. Open WebUI saves your conversation history locally, lets you create custom system prompt templates for different tasks (coding assistant, document reviewer, math tutor), and supports multi-turn conversations with context preservation.
Use Phi-4 as a Coding Assistant in VS Code
Install the Continue extension for VS Code (free, open-source). In Continue's config, set the model provider to "Ollama" and select phi4. Continue will use your locally running Phi-4 for code autocomplete, docstring generation, and inline chat. Zero token costs, complete privacy, and surprising quality for a 14B model.
# Continue config.json snippet:
"models": [{"provider": "ollama", "model": "phi4", "title": "Phi-4 Local"}]
Run Phi-4 on Android
At 14B parameters, Phi-4 requires higher-end Android hardware. The 8GB Q4 file size and inference requirements push it beyond what mid-range Android devices can handle β but flagship devices released since 2024 are capable. Here's what works:
Flagship Android (12GB+ RAM)
Samsung Galaxy S24 Ultra (12GB RAM), Xiaomi 14 Ultra (16GB RAM), or Asus ROG Phone 8 Pro (16GB RAM) can run Q4 quantized Phi-4 using PocketPal AI or AnythingLLM. Expect 3β6 tokens per second β slow but functional for thoughtful queries. Ideal for private, offline AI access on the go.
Remote Desktop Connection (Recommended)
The most practical approach: run Phi-4 on your desktop via Ollama, then connect from your Android phone over local Wi-Fi. Use Enchanted, AnythingLLM, or any OpenAI-compatible client app. Desktop inference speed (20β30 t/s) makes this feel near-instant on your phone screen.
Run Phi-4 on iPhone / iPad
iPad Pro M4 (16GB RAM) is the sweet spot for running Phi-4 on iOS. Apple's M4 chip includes a 16-core Neural Engine with 38 TOPS of compute, and the MLX framework leverages this hardware efficiently for transformer inference. Phi-4 runs at approximately 15β20 tokens per second on iPad Pro M4 β comfortably fast for interactive conversations, document analysis, and coding assistance on the go:
LM Studio iOS
Search "phi-4" in LM Studio's model browser to find Microsoft's official GGUF-quantized versions. The Q4_K_M quantization is recommended β it's about 8GB and runs well on iPad Pro M4. LM Studio handles downloading, GPU allocation, and inference automatically with MLX optimization.
Enchanted (Mac Bridge)
If running on-device isn't fast enough, install Enchanted on your iPhone and bridge to a Mac running Ollama with phi4. MacBook Pro M3 handles Phi-4 at 35+ t/s, providing near-instant responses on your iPhone over local Wi-Fi.
API Usage and Integration
Ollama exposes Phi-4 through a REST API that's fully compatible with the OpenAI API specification. This means any application, library, or tool built for OpenAI's API will work with your local Phi-4 with just a base URL change β no code modifications needed beyond pointing to http://localhost:11434/v1 instead of https://api.openai.com/v1.
Phi-4 through Ollama's API works identically to the OpenAI format:
# Python example:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
result = client.chat.completions.create(
model="phi4",
messages=[{"role": "user", "content": "Review this Python code for bugs: ..."}]
)
print(result.choices[0].message.content)
Phi-4 for Production Applications
Because Phi-4 uses the MIT license, it's fully cleared for commercial production deployments. Many startups in 2026 are using Phi-4 as their primary AI backbone for document processing, customer support bots, and code review tools β running entirely on-premise to avoid per-token cloud AI costs. At $1.5/month, VPN07 ensures your team can pull model updates from HuggingFace and Microsoft's model registry at full speed from any location.
Advanced Configuration Options
Once Phi-4 is running, you can fine-tune its behavior through Ollama's configuration parameters. These settings are particularly useful for specialized use cases:
# Set temperature (lower = more deterministic, better for code/math):
ollama run phi4 --temperature 0.1 "Prove the Pythagorean theorem"
# Extend context window for long documents:
ollama run phi4 --num-ctx 16384
# Keep model loaded in memory between requests:
OLLAMA_KEEP_ALIVE=1h ollama serve
# Run as background service (Linux systemd):
sudo systemctl enable ollama && sudo systemctl start ollama
Troubleshooting
Problem: Download very slow from HuggingFace
Fix: Use VPN07 for downloading from HuggingFace. Microsoft hosts Phi-4 on HuggingFace Hub, which can be slow in some regions. VPN07's 1000Mbps bandwidth ensures fast, unthrottled access. The phi4 model file is ~8GB β under 5 minutes with a fast connection.
Problem: Phi-4 gives short or incomplete answers
Fix: Increase the default context and max tokens. Phi-4's default response length via Ollama may be conservative. Set a higher num_predict: ollama run phi4 --num-predict 4096. For complex reasoning tasks, also provide explicit instructions: "Think step by step and provide a detailed explanation."
Problem: 8GB GPU runs out of memory
Fix: Phi-4 at Q4 quantization needs about 8.5GB VRAM. If you have exactly 8GB, it may fail to load. Solution: use Q3_K_M quantization (ollama run phi4:q3_k_m if available, or use llama.cpp with manual GGUF download and specify Q3_K_M format). Alternatively, let Ollama run it in split GPU+CPU mode by reducing GPU layers.
Phi-4 in Production: Real-World Use Cases
Phi-4's MIT license and outstanding reasoning capability make it the top choice for production deployment in 2026. Here are the most successful use cases developers are building with Phi-4 locally:
π§βπ» AI-Powered Code Review Bot
Many engineering teams in 2026 integrate Phi-4 directly into their CI/CD pipeline. When a developer opens a pull request, an automated bot (powered by the local Phi-4 API) reviews the code diff, identifies potential bugs, suggests improvements, and checks for security vulnerabilities. Because it's local, sensitive proprietary code never reaches external AI providers, satisfying security and compliance requirements.
π Math Tutoring Application
Phi-4's exceptional mathematical reasoning makes it ideal for educational applications. Startups are building offline math tutors that guide students through problems step by step β explaining concepts, identifying where a student made a logical error, and generating similar practice problems. The MIT license allows these educational products to be distributed commercially without licensing fees.
βοΈ Legal Document Analysis
Law firms increasingly use Phi-4 locally to analyze contracts, extract obligations and deadlines, flag unusual clauses, and generate document summaries. The on-premise nature ensures client confidentiality. Phi-4's 16K context handles most standard legal documents in a single pass, and its precise reasoning correctly interprets complex conditional language that simpler models misunderstand.
For developers building commercial products with Phi-4, the MIT license is a significant competitive advantage. Unlike models under Llama's community license (which restricts use for platforms over 700M MAU) or MRL licenses (which restrict commercial deployment without a separate agreement), Phi-4 under MIT can be embedded in any product, redistributed, and even sold as part of an application without any licensing hurdles.
Another powerful use case for Phi-4 is as a local AI evaluation engine. Because Phi-4 excels at precise reasoning, many teams use it as a "judge model" to automatically evaluate the output quality of other AI systems β checking factual accuracy, logical consistency, and instruction adherence. This meta-AI role is perfectly suited to Phi-4's strengths and can run continuously on local hardware without cloud costs, enabling automated regression testing for AI-powered products.
For teams working with sensitive data β healthcare records, financial information, or proprietary business documents β Phi-4's local deployment provides genuine privacy guarantees that cloud AI cannot match. When you run Phi-4 on your own hardware through Ollama, no data is transmitted to Microsoft's servers, no prompts are logged externally, and no conversation history leaves your infrastructure. This is especially important for industries with strict data handling regulations, where even sending anonymized data to a third-party AI provider may require extensive compliance reviews. Local Phi-4 deployment eliminates these concerns entirely while delivering frontier-class AI capability.
Microsoft Research has indicated that the Phi series will continue expanding in capability with future releases. The Phi approach β training smaller, more efficient models on carefully curated, high-quality synthetic and real data β has proven remarkably successful and is now influencing how many other AI research labs worldwide approach small model training methodology and data curation strategies. Staying updated with Microsoft's releases via the HuggingFace model hub (where all Phi models are published) is worthwhile for developers who need the most capable small model available at any given time.
Phi-4 Setup Checklist
Phi-4 Performance Benchmarks by Platform
Phi-4's compact 14B size means impressive speeds across all hardware tiers. Here's what to expect:
| Hardware | Speed (t/s) | Memory Used | Acceleration | Rating |
|---|---|---|---|---|
| MacBook Pro M4 Pro 24GB | 35β50 | ~9GB RAM | Metal GPU | βββββ |
| MacBook Air M3 16GB | 25β40 | ~9GB RAM | Metal GPU | βββββ |
| RTX 4090 24GB (Windows) | 55β80 | ~9GB VRAM | CUDA | βββββ |
| RTX 3060 12GB | 25β40 | ~9GB VRAM | CUDA | ββββ |
| RX 7900 XTX 24GB (Linux) | 40β60 | ~9GB VRAM | ROCm | βββββ |
| iPad Pro M4 16GB | 18β25 | ~9GB RAM | MLX (Neural Engine) | ββββ |
| CPU only (AMD 7950X 64GB) | 5β10 | ~9GB RAM | CPU (AVX-512) | βββ |
Best Value Setup: For developers looking for maximum Phi-4 performance per dollar in 2026, the RTX 3060 12GB ($280 used market) paired with 32GB DDR5 RAM provides 25β40 tokens/second with Phi-4 β fast enough for real-time coding assistance and document analysis. This setup runs Phi-4 entirely in GPU VRAM with no CPU offloading required.
Phi-4's 14B parameter count fits perfectly in the "sweet spot" for 2026 GPU VRAM: 8β12GB GPUs can run it at full GPU speed without any CPU offloading. The result is consistently fast inference across the entire mid-range GPU tier, making Phi-4 accessible to a much wider audience than larger flagship models.
Phi-4 Quick Reference: Commands and Configuration
Complete command reference for using Phi-4 with Ollama across all platforms:
# ββ Install Ollama βββββββββββββββββββββββββββββββββββββ
brew install ollama # macOS
curl -fsSL https://ollama.com/install.sh | sh # Linux
# ββ Download and Run Phi-4 βββββββββββββββββββββββββββββ
ollama pull phi4 # download Phi-4 (8GB)
ollama run phi4 # start interactive chat
# ββ Example Prompts for Phi-4 ββββββββββββββββββββββββββ
ollama run phi4 "Find the bug in this Python code: def fact(n): return n*fact(n-1)"
ollama run phi4 "Prove that there are infinitely many prime numbers"
ollama run phi4 "Write a binary search tree implementation in Rust"
# ββ Custom Modelfile for Code Assistant ββββββββββββββββ
FROM phi4
PARAMETER temperature 0.2
SYSTEM "You are an expert software engineer. Analyze code thoroughly, identify all bugs, and suggest improvements with explanations."
# ollama create phi4-coder -f Modelfile
# ββ API ββββββββββββββββββββββββββββββββββββββββββββββββ
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" \
-d '{"model":"phi4","messages":[{"role":"user","content":"Review this code"}]}'
Frequently Asked Questions
Q: Is Phi-4 really better than Llama 3.3-70B?
On mathematical reasoning and science benchmarks β yes, definitively. Phi-4 at 14B parameters scores higher than Llama 3.3-70B on MATH, GPQA Diamond, and several coding benchmarks. On general knowledge and creative tasks, Llama 3.3-70B has an advantage due to its larger model capacity and broader training data. For STEM-focused applications where precision and reasoning are paramount, Phi-4 is the better choice despite being 5x smaller.
Q: Can I use Phi-4 in a commercial product?
Absolutely. Phi-4's MIT license is the most permissive available β you can use it in commercial products, SaaS applications, embedded devices, and enterprise software without any additional licensing fees or restrictions. You can also modify and redistribute it. The only requirement is preserving the MIT license notice. This makes Phi-4 uniquely attractive for startups and enterprises building AI-powered products.
Q: What hardware do I need for Phi-4?
Phi-4 at Q4 quantization requires approximately 8.5GB VRAM for GPU inference, or 16GB system RAM for CPU-only operation. Ideal minimum hardware: NVIDIA RTX 3060 (12GB) or better for smooth GPU acceleration. On Apple Silicon, an M2/M3 chip with 16GB unified memory handles Phi-4 smoothly at 25+ tokens per second. For CPU-only on an Intel/AMD system, a modern CPU with 32GB RAM gives acceptable performance around 5β8 tokens per second.
Q: Is Phi-4 good for creative writing?
Phi-4 is capable at creative tasks but this isn't its primary strength. Its training data was heavily weighted toward mathematical and scientific content, so while it can write well, larger models with broader training data (like Llama 4 Scout or Mistral Large 2) produce more varied, nuanced creative writing. For fiction, poetry, and creative content generation, those models may serve better. Phi-4 excels at tasks where precision and logical structure matter more than creative flair.
Q: How often is Phi-4 updated?
Microsoft releases Phi model updates on an irregular basis, typically every 6β12 months for major versions. Between major releases, they sometimes publish updated checkpoints (e.g., Phi-4-mini variants or fine-tuned versions for specific tasks). The best way to track updates is to follow Microsoft's HuggingFace profile at huggingface.co/microsoft or subscribe to the model's page notifications. Ollama typically adds new Phi variants within a few days of their HuggingFace publication.
VPN07 β Access HuggingFace & Ollama CDN at Full Speed
1000Mbps Β· 70+ Countries Β· Trusted Since 2015
Phi-4 and other Microsoft models are hosted on HuggingFace β which can be throttled or inaccessible in certain regions. VPN07's 1000Mbps network has optimized routes to HuggingFace, GitHub, and all major AI model repositories. For teams and developers building AI applications, VPN07 is the infrastructure backbone that ensures reliable access. 10 years of trust, 70+ countries, $1.5/month. Try free with a 30-day money-back guarantee.
Next Steps
Why Phi-4 Is the #1 Small Model for 2026
MIT License: Fully commercial, no restrictions on use, modification, or distribution
8GB VRAM fit: Runs entirely in GPU VRAM on mid-range hardware without CPU offloading
Beats 70B models: Outperforms Llama 3.3-70B on MATH and GPQA despite being 5Γ smaller
Code excellence: Best sub-20B model for coding, debugging, and algorithm design in 2026
Fast inference: 35β80 t/s depending on hardware, suitable for real-time applications
All platforms: Windows CUDA, macOS Metal, Linux ROCm, iOS MLX all supported
Speed Up Downloads
VPN07's 1000Mbps network: download Phi-4 from HuggingFace in minutes
Try Free βCoding Integration
Learn to integrate Phi-4 as a VS Code coding assistant with Continue
Read More βRelated Articles
Gemma 3 Local Install: Windows, Mac & Linux 2026
Install Google Gemma 3 on all platforms. Runs on just 4GB VRAM, 1B to 27B model sizes, vision support included.
Read More βMistral Large 2 Local Install: All Platforms 2026
Install Mistral Large 2 (123B) locally. Europe's top open model for code & multilingual tasks. Complete Ollama guide 2026.
Read More β