VPN07

Ollama Tutorial 2026: Install & Run Any LLM Free on Windows, Mac & Linux

March 6, 2026 18 min read Ollama Local LLM Free AI
Open Source LLM Download Hub
DeepSeek / Llama 4 / Qwen / Gemma — all in one place
Download Models →

Quick Summary: Ollama is the easiest way to run open-source large language models on your own computer — completely free, completely private, no API keys required. With a single command you can run DeepSeek R1, Llama 4, Qwen 3.5, Gemma 3, Phi-4, and dozens of other models locally. This guide covers everything from first installation on Windows, macOS, and Linux to advanced API server configuration, Open WebUI setup, and model management for power users.

What Is Ollama and Why Should You Use It?

Ollama is an open-source tool that packages large language models with everything needed to run them — model weights, inference engine, configuration, and a clean API — into a single executable that just works on your machine. Launched in 2023 by a team of former Apple engineers, Ollama has become the standard way for developers, researchers, and enthusiasts to run LLMs locally.

Before Ollama, running a local LLM meant compiling llama.cpp from source, manually converting model formats, writing custom loading scripts, and wrestling with CUDA drivers. Ollama reduces all of that to one command: ollama run llama4. It handles model download, quantization selection, GPU detection, and inference automatically.

Free
Open Source
200+
Models Available
REST
API Built-in
GPU+CPU
Both Supported

Key reasons to use Ollama in 2026: Privacy — your conversations never leave your machine. Speed — local inference on a modern GPU is often faster than cloud API calls for shorter responses. Cost — no per-token pricing, no subscription fees. Customization — full control over system prompts, temperature, context length, and quantization level. Offline access — works without internet once models are downloaded.

Hardware Requirements

Model Size Min RAM (CPU) Min VRAM (GPU) Example Models
1B–3B params4GB RAM2GB VRAMGemma 3 1B, MiniCPM 3B, Phi-4 mini
7B–9B params8GB RAM6GB VRAMQwen 3.5 7B, GLM-4 9B, Gemma 3 9B
14B–27B params16GB RAM12GB VRAMPhi-4 14B, Gemma 3 27B, Qwen 3.5 14B
70B+ params64GB RAM48GB VRAMLlama 4 Scout, DeepSeek R1 70B

Best Hardware Picks for Ollama in 2026

Apple Silicon MacBooks (M3/M4) are the best value for local LLMs — unified memory means a MacBook Pro with 24GB RAM effectively has 24GB of "VRAM" for model layers, enabling 27B models to run at 30+ tokens/second. On the PC side, an RTX 4060 Ti 16GB handles 14B–27B models comfortably. For smaller models, any 2020+ laptop with 8GB RAM can run 7B–9B models via CPU inference at 5–15 tokens/second — slow but usable for testing.

Installing Ollama on Windows

Windows installation is a standard .exe setup — no terminal knowledge required for the basic install. NVIDIA CUDA and AMD ROCm GPU acceleration are both supported automatically.

1

Download Ollama Installer

Visit ollama.com and click "Download for Windows". This downloads OllamaSetup.exe. Run it as Administrator for best results (right-click → Run as administrator). The installer adds Ollama to your system PATH automatically, so you can use it from any terminal.

2

Open PowerShell or Command Prompt

After installation, open PowerShell (Win+X → Terminal) or CMD. Ollama runs as a background service automatically on Windows startup. Verify the install worked:

ollama --version
# Should output: ollama version 0.x.x
3

Run Your First Model

Pull and run a model with one command. Ollama downloads the model automatically on first run:

ollama run llama4
# Downloads Llama 4 Scout (7.9B) and starts a chat session
# Or try a smaller model first:
ollama run gemma3:1b
# Only 815MB, very fast even on CPU

Windows GPU Acceleration

Ollama automatically detects NVIDIA GPUs (requires CUDA 11.3+) and AMD GPUs on Windows 10/11. If Ollama is not using your GPU, check: ollama run llama4 and look for "using CUDA" or "using ROCm" in the output. If missing, update your GPU drivers. For NVIDIA, install the latest Game Ready Driver from nvidia.com. Ollama does NOT support Intel Arc GPUs on Windows (Linux ROCm support is planned).

Installing Ollama on macOS

macOS users get the best local LLM experience thanks to Apple Silicon's unified memory architecture. A Mac Mini M4 with 24GB RAM is genuinely competitive with high-end Windows gaming PCs for LLM inference.

Method 1: App Download (Recommended)

Download Ollama.app from ollama.com, drag to /Applications, and double-click. The app installs a menu bar icon and CLI automatically. Apple Silicon (M1/M2/M3/M4) and Intel Macs are both supported.

# After app install, use in terminal:
ollama run qwen3.5:7b

Method 2: Homebrew (For Developers)

If you use Homebrew, install Ollama as a Cask. This is better for server environments where you don't want a GUI menu bar app:

brew install ollama
ollama serve &
ollama run deepseek-r1:7b

# macOS Apple Silicon performance examples (M3 Pro, 18GB):

ollama run gemma3:1b # ~80 tokens/sec — almost instant responses

ollama run phi4 # ~35 tokens/sec — 14B model, very fast

ollama run qwen3.5:14b # ~28 tokens/sec — excellent quality

ollama run deepseek-r1:7b # ~45 tokens/sec — great reasoning

Apple Silicon Memory Tips

On Apple Silicon, RAM and VRAM are shared (unified memory). Ollama uses as much as the model needs. To check how much memory a model needs: a Q4_K_M 7B model ≈ 4.5GB, 14B ≈ 8.5GB, 27B ≈ 16GB. Leave ~4GB for macOS and other apps. So on a 16GB Mac you can comfortably run 9B models; on 24GB, up to 14B; on 32GB, up to 27B at full quality.

Installing Ollama on Linux

Linux offers the most flexibility for Ollama deployment, from desktop workstations to headless servers. The one-line installer supports all major distributions including Ubuntu, Debian, Fedora, Arch, and CentOS.

# Universal Linux installer (Ubuntu, Debian, Fedora, Arch, etc.):

curl -fsSL https://ollama.com/install.sh | sh

# The installer automatically:

# - Detects your NVIDIA or AMD GPU

# - Installs appropriate GPU libraries

# - Creates an ollama systemd service

# - Starts the service and enables autostart

# Verify installation:

systemctl status ollama # Check service is active

# Run your first model:

ollama run mistral-large2

Docker Installation (Best for Servers)

For production server deployments or isolated environments, use the official Ollama Docker image:

# NVIDIA GPU:
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

# CPU only:
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

Remote Access (Allow Network Connections)

By default Ollama only accepts connections from localhost. To expose it to your local network or a specific IP:

OLLAMA_HOST=0.0.0.0 ollama serve
# Or set permanently via systemd override:
sudo systemctl edit ollama
# Add: [Service]
# Environment="OLLAMA_HOST=0.0.0.0"

Essential Ollama Commands

Learning these core commands gives you full control over Ollama's 200+ models. All commands work identically on Windows, macOS, and Linux:

# ===== MODEL MANAGEMENT =====

ollama pull llama4 # Download a model without running it

ollama run deepseek-r1:7b # Download (if needed) and start chat

ollama list # List all downloaded models

ollama rm phi4 # Delete a model to free up disk space

# ===== RUNNING OPTIONS =====

ollama run qwen3.5 "Write me a Python function to sort a list"

ollama run llama4 --num-ctx 32768 # Set 32K context window

ollama run gemma3 --verbose # Show timing and GPU stats

# ===== MODEL SIZES (use tags) =====

ollama run qwen3.5:0.6b # Smallest, fastest (0.6B params)

ollama run qwen3.5:7b # Default size, good balance

ollama run qwen3.5:32b # Larger, requires 20GB+ RAM

# ===== SYSTEM STATUS =====

ollama ps # Show currently running models

ollama show llama4 # Show model details and parameters

💡 Best Models to Start With

  • gemma3:1b — Ultra-fast, works on any hardware
  • phi4 — Microsoft 14B, excellent reasoning
  • deepseek-r1:7b — Top reasoning model
  • qwen3.5:7b — Best multilingual model
  • llama4 — Meta's latest flagship

⚡ Performance Tips

  • Close other GPU apps to free VRAM
  • Use Q4_K_M quantization for best speed/quality
  • Set OLLAMA_NUM_GPU=1 to force GPU use
  • Increase --num-ctx only when needed
  • Keep models under 80% of your total VRAM

Open WebUI — Browser Chat Interface

Once Ollama is running, Open WebUI transforms it into a full ChatGPT-like interface accessible from any browser on your network. It supports conversation history, file uploads, image analysis (for multimodal models), system prompts, and side-by-side model comparisons.

# Install Open WebUI with Docker (one command):

docker run -d -p 3000:8080 \

--add-host=host.docker.internal:host-gateway \

-v open-webui:/app/backend/data \

--name open-webui \

ghcr.io/open-webui/open-webui:main

Open your browser at http://localhost:3000, create an admin account, and you'll see all your Ollama models in the dropdown. You can also install Open WebUI without Docker via pip install open-webui if you prefer Python-only environments.

Open WebUI Key Features

✅ Full conversation history with search
✅ Image upload for multimodal models (Gemma 3, LLaVA)
✅ Document RAG — upload PDFs for Q&A
✅ Web search integration
✅ Multi-user with separate conversation histories
✅ Side-by-side model comparison

Ollama REST API — Build Your Own Apps

Ollama exposes a REST API on port 11434 that follows the OpenAI API format. This means any tool or application built for OpenAI's API can be pointed at Ollama instead — no code changes needed, just change the base URL.

# Direct API call with curl:

curl http://localhost:11434/api/chat -d '{

"model": "llama4",

"messages": [{"role": "user", "content": "Explain quantum entanglement simply"}],

"stream": false

}'

# Python with openai library (drop-in replacement):

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(

model="deepseek-r1:7b",

messages=[{"role": "user", "content": "Hello!"}]

)

print(response.choices[0].message.content)

Compatible tools and applications that work with Ollama's API include: LangChain, LlamaIndex, Cursor IDE, Continue (VS Code plugin), Msty, Page Assist (Chrome extension), and Enchanted (iOS/macOS native app). Any of these can connect to your local Ollama instance for private AI assistance.

Top Models to Run with Ollama in 2026

Ollama's library includes over 200 models. Here are the top picks from our Open Source LLM Hub, with Ollama pull commands and what they're best for:

DeepSeek R1 — Best Reasoning

ollama run deepseek-r1:7b

Top reasoning model globally. Matches GPT-4 class models on math and coding tasks. The 7B version runs on 6GB VRAM and delivers chain-of-thought reasoning that other 7B models can't match.

Llama 4 — Best Overall

ollama run llama4

Meta's latest multimodal model. Excellent at general conversation, coding, and image analysis. The Scout variant (7.9B active params from 109B total) delivers frontier-level quality on mid-range hardware.

Qwen 3.5 — Best Multilingual

ollama run qwen3.5:7b

Alibaba's flagship open model. Top performance in Chinese, Japanese, Korean, Arabic and 29 other languages. Also excellent at coding. The 235B A22B MoE version rivals GPT-5 on benchmarks.

Gemma 3 — Best for Low-End Hardware

ollama run gemma3:1b

Google's efficient model family. The 1B version runs on a Raspberry Pi 5 with 4GB RAM at acceptable speed. The 9B version fits in 6GB VRAM and offers multimodal vision support. Perfect for resource-constrained devices.

Phi-4 — Best for Coding

ollama run phi4

Microsoft's 14B parameter precision model. Outperforms most 70B models on coding benchmarks despite being 5× smaller. MIT license allows commercial use. Ideal for AI coding assistants integrated with VS Code or Cursor.

Troubleshooting Common Issues

❌ "GPU not found" or model running on CPU unexpectedly

Windows: Update NVIDIA/AMD drivers. Run nvidia-smi to confirm CUDA is available. Linux: Check CUDA is installed with nvcc --version. Reinstall Ollama after GPU driver updates. macOS: Metal is automatic on Apple Silicon — ensure you're not running the Intel binary on an Apple Silicon Mac (download the right version from ollama.com).

❌ "error: model not found" when pulling

This usually means a typo in the model name or the model requires Ollama version 0.5+. Run ollama --version and update if needed. Check the exact model name at ollama.com/library. For models not in Ollama's library, you can import GGUF files directly with ollama create mymodel -f Modelfile.

❌ Slow download speed when pulling models

Ollama downloads from its CDN which may be throttled or blocked in certain regions. Large models like DeepSeek R1 671B can be 400GB+. For fast, unrestricted downloads, use VPN07's 1000Mbps servers — our global CDN network bypasses regional throttling and delivers full download speed from any location. The same applies to downloading GGUF files from HuggingFace.

❌ "out of memory" when running large models

Your hardware doesn't have enough VRAM/RAM for the selected model at that quantization. Solutions: (1) Use a smaller quantization tag like :4b or :q3_k_m. (2) Use a smaller model variant. (3) Enable CPU offloading with OLLAMA_GPU_OVERHEAD=512MiB. (4) Close other GPU-using applications before running Ollama.

Frequently Asked Questions

Q: Is Ollama completely free? Are there any limits?

Yes, Ollama is 100% free and open-source (MIT license). There are no rate limits, no usage caps, no subscription, and no per-token fees. The only costs are your hardware (electricity + hardware you already own) and internet bandwidth for the initial model download. Once downloaded, models run offline with zero ongoing cost.

Q: How does Ollama compare to running llama.cpp directly?

Ollama uses llama.cpp as its inference backend, so the raw performance is similar. Ollama adds model management (download, version tracking, automatic updates), a clean API server, multi-model support, and easy GPU configuration. For most users, Ollama is strictly better. For advanced users who need custom llama.cpp build flags or experimental features, direct llama.cpp is still an option.

Q: Can I run Ollama on a VPS or cloud server?

Absolutely. Ollama runs on any Linux VPS. For GPU inference, you'll need a cloud GPU instance (Vast.ai, RunPod, Lambda Labs, or cloud providers with GPU VMs). For CPU-only inference, any VPS with 8GB+ RAM can run smaller models. Set OLLAMA_HOST=0.0.0.0 and use a VPN (like VPN07) or SSH tunnel to securely access your remote Ollama instance from anywhere.

Q: Which quantization level should I use?

For most users, Q4_K_M is the sweet spot — roughly 4 bits per parameter, less than 5% quality loss from full precision, and half the VRAM of FP16. If you have plenty of VRAM, Q6_K or Q8_0 give near-lossless quality. For maximum speed at lower quality, try Q3_K_M. Ollama defaults to Q4_K_M, so running ollama run llama4 automatically gives you the optimal quantization.

Explore More Open Source LLMs
DeepSeek / Llama 4 / Qwen / Gemma — view all models
View All Models →

VPN07 — Download LLMs at Full Speed

1000Mbps · 70+ Countries · Trusted Since 2015

Ollama model downloads can be huge — DeepSeek R1 671B is over 400GB, Llama 4 is 50GB+, and even 7B models are 4–5GB. Without a fast, unrestricted connection, these downloads take hours. VPN07 delivers 1000Mbps bandwidth with servers in 70+ countries, bypassing regional throttling on model hosting CDNs (Ollama's servers, HuggingFace, and model mirrors). Our network has been rock-solid for over 10 years. $1.5/month with a 30-day money-back guarantee — the fastest way to get your local AI running.

$1.5
Per Month
1000Mbps
Bandwidth
70+
Countries
30 Days
Money Back

Related Articles

$1.5/mo · 10 Years Strong
Try VPN07 Free