Microsoft Phi-4 Install Guide: All Platforms 2026

Quick Summary: Microsoft Phi-4 is the most efficient open-source language model of 2026 — its 14B parameters outperform many 30B+ models thanks to Microsoft's "data quality over data quantity" training philosophy. Released under the MIT license with commercial use allowed, Phi-4 is the ideal choice for developers who want high-quality AI inference on a laptop or an 8GB GPU.

What Is Microsoft Phi-4?

Phi-4 is Microsoft Research's fourth-generation small language model, part of the Phi series that started with Phi-1 in 2023. What distinguishes the Phi series is Microsoft's unconventional training approach: rather than simply scaling up data volume, Microsoft focused intensively on data quality and synthetic data generation. The training dataset for Phi-4 was carefully curated to include only high-quality educational and reasoning-focused content, then augmented with AI-generated "textbook-quality" synthetic examples.

The result is extraordinary: Phi-4 with just 14 billion parameters achieves scores on MATH and GPQA Diamond benchmarks that exceed models two to three times larger. On the MMLU-Pro benchmark, Phi-4 scores higher than Llama 3.3-70B on several categories despite being five times smaller — which translates to five times lower memory requirements and significantly faster inference speed.

Phi-4 is released under the MIT license, which is the most permissive open-source license available. This means you can use Phi-4 in commercial products, modify it freely, and distribute derivatives without any attribution requirement beyond preserving the license text. For startups and enterprises building AI-powered applications, this is a significant advantage over more restrictive licenses like Llama 4's community license.

14B

Parameters

MIT

License

8GB

Min VRAM

16K

Context

Why Phi-4 Beats Bigger Models

The conventional wisdom in AI is that bigger models are better. Phi-4 challenges this assumption decisively:

Model	Params	MATH Score	GPQA Diamond	Min VRAM
Phi-4 ⭐	14B	80.4%	56.1%	8GB
Llama 3.3-70B	70B	77.0%	50.5%	40GB
Gemma 3-27B	27B	75.4%	46.1%	16GB
Mistral-22B	22B	68.5%	43.2%	12GB

Phi-4's 14B model beats Llama 3.3-70B (5x larger) on both MATH and GPQA Diamond while requiring just 8GB VRAM vs 40GB. This is the power of training data quality over raw scale. For users with 8–16GB GPUs, Phi-4 delivers the best reasoning quality available in 2026.

Install Ollama (Windows, Mac, Linux)

macOS

Phi-4 runs exceptionally well on Apple Silicon. An M2 MacBook Pro (24GB) can run Phi-4 at 30+ tokens/second — some of the fastest local LLM inference available on a laptop in 2026.

brew install ollama
ollama pull phi4
ollama run phi4

Windows

Download OllamaSetup.exe from ollama.com. On RTX 3060 (12GB VRAM), Phi-4 runs at 20–30 t/s with CUDA acceleration, making it ideal for real-time coding assistance.

# After install:
ollama pull phi4
ollama run phi4

Linux

One-command install. Phi-4 with AMD ROCm on Linux performs comparably to NVIDIA CUDA, making it a great choice for AMD GPU owners who want MIT-licensed local AI.

curl -fsSL \
https://ollama.com/install.sh \
| sh

Pull and Run Phi-4 with Ollama

GPU Acceleration Notes: Windows users with NVIDIA GPUs get automatic CUDA support. AMD GPU users on Linux should install ROCm before Ollama for GPU acceleration. Apple Silicon Mac users get automatic Metal GPU support — an M2 MacBook Pro with 16GB handles Phi-4 at 25–40 tokens/second, making it one of the best laptops for running Phi-4 locally in 2026.

# Pull and run Phi-4 (one size — 14B is all you need):

ollama run phi4

# Download without starting immediately:

ollama pull phi4

# Ask a math problem directly:

ollama run phi4 "Solve: integrate x^2 * sin(x) dx step by step"

Phi-4 is a single-size model (14B), which simplifies the choice dramatically compared to multi-size families like Qwen3.5 or Gemma 3. You don't need to decide between variants — there's only one Phi-4, and it's already tuned for the best balance between quality, speed, and hardware requirements. There's no need to decide between multiple variants — just pull phi4 and you get the best performance Phi-4 offers. The model file is about 8GB in Q4 quantization format.

Phi-4 Excels at Code

Phi-4's training data included a large proportion of code examples and mathematical reasoning content. In practical tests, Phi-4 outperforms all other sub-20B models on code generation, debugging, and algorithm explanation. For developers using Cursor, VS Code with Continue, or any local coding assistant, Phi-4 is the top recommendation for the 8GB VRAM tier.

Beyond code generation, Phi-4 excels at explaining complex algorithms, reviewing code for bugs and security issues, suggesting performance optimizations, and generating comprehensive unit tests. Its deep understanding of programming logic — a direct result of Microsoft's training data curation methodology — makes it a genuinely useful pair programming partner even compared to much larger models.

Web Interface with Open WebUI

For a browser-based interface with conversation history, custom system prompts, and a user-friendly chat experience, Open WebUI is the most popular option in 2026. It connects directly to your local Ollama instance and provides a polished interface that feels like ChatGPT — but everything runs locally:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser, create a local account, and select phi4 from the model dropdown. Open WebUI saves your conversation history locally, lets you create custom system prompt templates for different tasks (coding assistant, document reviewer, math tutor), and supports multi-turn conversations with context preservation.

Use Phi-4 as a Coding Assistant in VS Code

Install the Continue extension for VS Code (free, open-source). In Continue's config, set the model provider to "Ollama" and select phi4. Continue will use your locally running Phi-4 for code autocomplete, docstring generation, and inline chat. Zero token costs, complete privacy, and surprising quality for a 14B model.

# Continue config.json snippet:
"models": [{"provider": "ollama", "model": "phi4", "title": "Phi-4 Local"}]

Run Phi-4 on Android

At 14B parameters, Phi-4 requires higher-end Android hardware. The 8GB Q4 file size and inference requirements push it beyond what mid-range Android devices can handle — but flagship devices released since 2024 are capable. Here's what works:

Flagship Android (12GB+ RAM)

Samsung Galaxy S24 Ultra (12GB RAM), Xiaomi 14 Ultra (16GB RAM), or Asus ROG Phone 8 Pro (16GB RAM) can run Q4 quantized Phi-4 using PocketPal AI or AnythingLLM. Expect 3–6 tokens per second — slow but functional for thoughtful queries. Ideal for private, offline AI access on the go.

Remote Desktop Connection (Recommended)

The most practical approach: run Phi-4 on your desktop via Ollama, then connect from your Android phone over local Wi-Fi. Use Enchanted, AnythingLLM, or any OpenAI-compatible client app. Desktop inference speed (20–30 t/s) makes this feel near-instant on your phone screen.

Run Phi-4 on iPhone / iPad

iPad Pro M4 (16GB RAM) is the sweet spot for running Phi-4 on iOS. Apple's M4 chip includes a 16-core Neural Engine with 38 TOPS of compute, and the MLX framework leverages this hardware efficiently for transformer inference. Phi-4 runs at approximately 15–20 tokens per second on iPad Pro M4 — comfortably fast for interactive conversations, document analysis, and coding assistance on the go:

LM Studio iOS

Search "phi-4" in LM Studio's model browser to find Microsoft's official GGUF-quantized versions. The Q4_K_M quantization is recommended — it's about 8GB and runs well on iPad Pro M4. LM Studio handles downloading, GPU allocation, and inference automatically with MLX optimization.

Enchanted (Mac Bridge)

If running on-device isn't fast enough, install Enchanted on your iPhone and bridge to a Mac running Ollama with phi4. MacBook Pro M3 handles Phi-4 at 35+ t/s, providing near-instant responses on your iPhone over local Wi-Fi.

API Usage and Integration

Ollama exposes Phi-4 through a REST API that's fully compatible with the OpenAI API specification. This means any application, library, or tool built for OpenAI's API will work with your local Phi-4 with just a base URL change — no code modifications needed beyond pointing to http://localhost:11434/v1 instead of https://api.openai.com/v1.

Phi-4 through Ollama's API works identically to the OpenAI format:

# Python example:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

result = client.chat.completions.create(

model="phi4",

messages=[{"role": "user", "content": "Review this Python code for bugs: ..."}]

)

print(result.choices[0].message.content)

Phi-4 for Production Applications

Because Phi-4 uses the MIT license, it's fully cleared for commercial production deployments. Many startups in 2026 are using Phi-4 as their primary AI backbone for document processing, customer support bots, and code review tools — running entirely on-premise to avoid per-token cloud AI costs. At $1.5/month, VPN07 ensures your team can pull model updates from HuggingFace and Microsoft's model registry at full speed from any location.

Advanced Configuration Options

Once Phi-4 is running, you can fine-tune its behavior through Ollama's configuration parameters. These settings are particularly useful for specialized use cases:

# Set temperature (lower = more deterministic, better for code/math):

ollama run phi4 --temperature 0.1 "Prove the Pythagorean theorem"

# Extend context window for long documents:

ollama run phi4 --num-ctx 16384

# Keep model loaded in memory between requests:

OLLAMA_KEEP_ALIVE=1h ollama serve

# Run as background service (Linux systemd):

sudo systemctl enable ollama && sudo systemctl start ollama

Troubleshooting

Problem: Download very slow from HuggingFace

Fix: Use VPN07 for downloading from HuggingFace. Microsoft hosts Phi-4 on HuggingFace Hub, which can be slow in some regions. VPN07's 1000Mbps bandwidth ensures fast, unthrottled access. The phi4 model file is ~8GB — under 5 minutes with a fast connection.

Problem: Phi-4 gives short or incomplete answers

Fix: Increase the default context and max tokens. Phi-4's default response length via Ollama may be conservative. Set a higher num_predict: ollama run phi4 --num-predict 4096. For complex reasoning tasks, also provide explicit instructions: "Think step by step and provide a detailed explanation."

Problem: 8GB GPU runs out of memory

Fix: Phi-4 at Q4 quantization needs about 8.5GB VRAM. If you have exactly 8GB, it may fail to load. Solution: use Q3_K_M quantization (ollama run phi4:q3_k_m if available, or use llama.cpp with manual GGUF download and specify Q3_K_M format). Alternatively, let Ollama run it in split GPU+CPU mode by reducing GPU layers.

Phi-4 in Production: Real-World Use Cases

Phi-4's MIT license and outstanding reasoning capability make it the top choice for production deployment in 2026. Here are the most successful use cases developers are building with Phi-4 locally:

🧑‍💻 AI-Powered Code Review Bot

Many engineering teams in 2026 integrate Phi-4 directly into their CI/CD pipeline. When a developer opens a pull request, an automated bot (powered by the local Phi-4 API) reviews the code diff, identifies potential bugs, suggests improvements, and checks for security vulnerabilities. Because it's local, sensitive proprietary code never reaches external AI providers, satisfying security and compliance requirements.

📐 Math Tutoring Application

Phi-4's exceptional mathematical reasoning makes it ideal for educational applications. Startups are building offline math tutors that guide students through problems step by step — explaining concepts, identifying where a student made a logical error, and generating similar practice problems. The MIT license allows these educational products to be distributed commercially without licensing fees.

⚖️ Legal Document Analysis

Law firms increasingly use Phi-4 locally to analyze contracts, extract obligations and deadlines, flag unusual clauses, and generate document summaries. The on-premise nature ensures client confidentiality. Phi-4's 16K context handles most standard legal documents in a single pass, and its precise reasoning correctly interprets complex conditional language that simpler models misunderstand.

For developers building commercial products with Phi-4, the MIT license is a significant competitive advantage. Unlike models under Llama's community license (which restricts use for platforms over 700M MAU) or MRL licenses (which restrict commercial deployment without a separate agreement), Phi-4 under MIT can be embedded in any product, redistributed, and even sold as part of an application without any licensing hurdles.

Another powerful use case for Phi-4 is as a local AI evaluation engine. Because Phi-4 excels at precise reasoning, many teams use it as a "judge model" to automatically evaluate the output quality of other AI systems — checking factual accuracy, logical consistency, and instruction adherence. This meta-AI role is perfectly suited to Phi-4's strengths and can run continuously on local hardware without cloud costs, enabling automated regression testing for AI-powered products.

For teams working with sensitive data — healthcare records, financial information, or proprietary business documents — Phi-4's local deployment provides genuine privacy guarantees that cloud AI cannot match. When you run Phi-4 on your own hardware through Ollama, no data is transmitted to Microsoft's servers, no prompts are logged externally, and no conversation history leaves your infrastructure. This is especially important for industries with strict data handling regulations, where even sending anonymized data to a third-party AI provider may require extensive compliance reviews. Local Phi-4 deployment eliminates these concerns entirely while delivering frontier-class AI capability.

Microsoft Research has indicated that the Phi series will continue expanding in capability with future releases. The Phi approach — training smaller, more efficient models on carefully curated, high-quality synthetic and real data — has proven remarkably successful and is now influencing how many other AI research labs worldwide approach small model training methodology and data curation strategies. Staying updated with Microsoft's releases via the HuggingFace model hub (where all Phi models are published) is worthwhile for developers who need the most capable small model available at any given time.

Phi-4 Setup Checklist

Ollama installed and service running

Phi-4 downloaded (ollama pull phi4)

GPU confirmed active during generation

API accessible at localhost:11434

Continue or other IDE plugin configured

VPN07 for fast HuggingFace downloads

Phi-4 Performance Benchmarks by Platform

Phi-4's compact 14B size means impressive speeds across all hardware tiers. Here's what to expect:

Hardware	Speed (t/s)	Memory Used	Acceleration	Rating
MacBook Pro M4 Pro 24GB	35–50	~9GB RAM	Metal GPU	⭐⭐⭐⭐⭐
MacBook Air M3 16GB	25–40	~9GB RAM	Metal GPU	⭐⭐⭐⭐⭐
RTX 4090 24GB (Windows)	55–80	~9GB VRAM	CUDA	⭐⭐⭐⭐⭐
RTX 3060 12GB	25–40	~9GB VRAM	CUDA	⭐⭐⭐⭐
RX 7900 XTX 24GB (Linux)	40–60	~9GB VRAM	ROCm	⭐⭐⭐⭐⭐
iPad Pro M4 16GB	18–25	~9GB RAM	MLX (Neural Engine)	⭐⭐⭐⭐
CPU only (AMD 7950X 64GB)	5–10	~9GB RAM	CPU (AVX-512)	⭐⭐⭐

Best Value Setup: For developers looking for maximum Phi-4 performance per dollar in 2026, the RTX 3060 12GB ($280 used market) paired with 32GB DDR5 RAM provides 25–40 tokens/second with Phi-4 — fast enough for real-time coding assistance and document analysis. This setup runs Phi-4 entirely in GPU VRAM with no CPU offloading required.

Phi-4's 14B parameter count fits perfectly in the "sweet spot" for 2026 GPU VRAM: 8–12GB GPUs can run it at full GPU speed without any CPU offloading. The result is consistently fast inference across the entire mid-range GPU tier, making Phi-4 accessible to a much wider audience than larger flagship models.

Phi-4 Quick Reference: Commands and Configuration

Complete command reference for using Phi-4 with Ollama across all platforms:

# ── Install Ollama ─────────────────────────────────────

brew install ollama # macOS

curl -fsSL https://ollama.com/install.sh | sh # Linux

# ── Download and Run Phi-4 ─────────────────────────────

ollama pull phi4 # download Phi-4 (8GB)

ollama run phi4 # start interactive chat

# ── Example Prompts for Phi-4 ──────────────────────────

ollama run phi4 "Find the bug in this Python code: def fact(n): return n*fact(n-1)"

ollama run phi4 "Prove that there are infinitely many prime numbers"

ollama run phi4 "Write a binary search tree implementation in Rust"

# ── Custom Modelfile for Code Assistant ────────────────

FROM phi4

PARAMETER temperature 0.2

SYSTEM "You are an expert software engineer. Analyze code thoroughly, identify all bugs, and suggest improvements with explanations."

# ollama create phi4-coder -f Modelfile

# ── API ────────────────────────────────────────────────

curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" \

-d '{"model":"phi4","messages":[{"role":"user","content":"Review this code"}]}'

Frequently Asked Questions

Q: Is Phi-4 really better than Llama 3.3-70B?

On mathematical reasoning and science benchmarks — yes, definitively. Phi-4 at 14B parameters scores higher than Llama 3.3-70B on MATH, GPQA Diamond, and several coding benchmarks. On general knowledge and creative tasks, Llama 3.3-70B has an advantage due to its larger model capacity and broader training data. For STEM-focused applications where precision and reasoning are paramount, Phi-4 is the better choice despite being 5x smaller.

Q: Can I use Phi-4 in a commercial product?

Absolutely. Phi-4's MIT license is the most permissive available — you can use it in commercial products, SaaS applications, embedded devices, and enterprise software without any additional licensing fees or restrictions. You can also modify and redistribute it. The only requirement is preserving the MIT license notice. This makes Phi-4 uniquely attractive for startups and enterprises building AI-powered products.

Q: What hardware do I need for Phi-4?

Phi-4 at Q4 quantization requires approximately 8.5GB VRAM for GPU inference, or 16GB system RAM for CPU-only operation. Ideal minimum hardware: NVIDIA RTX 3060 (12GB) or better for smooth GPU acceleration. On Apple Silicon, an M2/M3 chip with 16GB unified memory handles Phi-4 smoothly at 25+ tokens per second. For CPU-only on an Intel/AMD system, a modern CPU with 32GB RAM gives acceptable performance around 5–8 tokens per second.

Q: Is Phi-4 good for creative writing?

Phi-4 is capable at creative tasks but this isn't its primary strength. Its training data was heavily weighted toward mathematical and scientific content, so while it can write well, larger models with broader training data (like Llama 4 Scout or Mistral Large 2) produce more varied, nuanced creative writing. For fiction, poetry, and creative content generation, those models may serve better. Phi-4 excels at tasks where precision and logical structure matter more than creative flair.

Q: How often is Phi-4 updated?

Microsoft releases Phi model updates on an irregular basis, typically every 6–12 months for major versions. Between major releases, they sometimes publish updated checkpoints (e.g., Phi-4-mini variants or fine-tuned versions for specific tasks). The best way to track updates is to follow Microsoft's HuggingFace profile at huggingface.co/microsoft or subscribe to the model's page notifications. Ollama typically adds new Phi variants within a few days of their HuggingFace publication.

📥 Download Phi-4 & all top LLMs: Ollama install commands, hardware requirements & guides for all models. View Hub →

VPN07 — Access HuggingFace & Ollama CDN at Full Speed

1000Mbps · 70+ Countries · Trusted Since 2015

Phi-4 and other Microsoft models are hosted on HuggingFace — which can be throttled or inaccessible in certain regions. VPN07's 1000Mbps network has optimized routes to HuggingFace, GitHub, and all major AI model repositories. For teams and developers building AI applications, VPN07 is the infrastructure backbone that ensures reliable access. 10 years of trust, 70+ countries, $1.5/month. Try free with a 30-day money-back guarantee.

$1.5

Per Month

1000Mbps

Bandwidth

70+

Countries

30 Days

Money Back

Start Free Trial → View Pricing

Next Steps

Why Phi-4 Is the #1 Small Model for 2026

MIT License: Fully commercial, no restrictions on use, modification, or distribution

8GB VRAM fit: Runs entirely in GPU VRAM on mid-range hardware without CPU offloading

Beats 70B models: Outperforms Llama 3.3-70B on MATH and GPQA despite being 5× smaller

Code excellence: Best sub-20B model for coding, debugging, and algorithm design in 2026

Fast inference: 35–80 t/s depending on hardware, suitable for real-time applications

All platforms: Windows CUDA, macOS Metal, Linux ROCm, iOS MLX all supported

Get Phi-4 Now

One-click Ollama install commands and full hardware guide for Phi-4

AI Model Hub →

Speed Up Downloads

VPN07's 1000Mbps network: download Phi-4 from HuggingFace in minutes

Try Free →

Coding Integration

Learn to integrate Phi-4 as a VS Code coding assistant with Continue

Gemma 3 Local Install: Windows, Mac & Linux 2026

Install Google Gemma 3 on all platforms. Runs on just 4GB VRAM, 1B to 27B model sizes, vision support included.

Mistral Large 2 Local Install: All Platforms 2026

Install Mistral Large 2 (123B) locally. Europe's top open model for code & multilingual tasks. Complete Ollama guide 2026.