VPN07

MiniMax M2 Install Guide 2026: Run Locally on All Platforms

March 5, 2026 18 min read MiniMax M2 Local LLM All Platforms
Open Source LLM Download Hub
MiniMax M2 / DeepSeek / Qwen / Llama — all in one place
Download Models →

Quick Summary: MiniMax M2 (also known as MiniMax-M1-40k) is a powerhouse 456B Mixture-of-Experts model with a 40,000-token context window. This guide covers API-based access for all platforms, full self-hosting setup for GPU servers, and mobile apps for Android and iOS so you can start using MiniMax M2 today regardless of your hardware.

What Is MiniMax M2?

MiniMax M2 is the flagship open-source large language model from MiniMax, a leading Chinese AI research company. Released in 2025 and widely deployed in 2026, the model uses a Mixture-of-Experts (MoE) architecture with 456 billion total parameters but only 45.9 billion active per forward pass — making it far more computationally efficient than its total parameter count suggests.

The defining feature of MiniMax M2 is its extended context window of 40,000 tokens, with support for even longer contexts through the model's efficient attention mechanism. This makes MiniMax M2 exceptionally well-suited for processing entire codebases, lengthy legal documents, full research papers, and complex multi-turn conversations without truncating important context.

Unlike some large open-source models that sacrifice quality at smaller sizes, MiniMax M2's MoE architecture means each token is processed by a highly specialized subset of parameters — delivering performance that competes with models double its active parameter count while using comparable memory during inference.

456B
Total Params
45.9B
Active Params
40K
Context Tokens
Apache
2.0 License

MiniMax M2 is released under the permissive Apache 2.0 license, meaning you can use it freely for commercial applications, integrate it into your products, fine-tune it on your own data, and redistribute your modifications. This license flexibility makes it attractive for enterprise and startup deployments where licensing restrictions of other models can create legal complications.

Hardware Requirements for Local Deployment

Due to its 456B MoE architecture, running MiniMax M2 fully locally requires significant hardware. However, the MoE design means you need less memory than a dense model of equivalent capability:

Deployment Mode Min GPU VRAM Recommended Speed
Cloud API (easiest)NoneAny deviceFast (cloud)
GGUF Q4 Self-host160GB+4× A100 80GB8–12 t/s
FP16 Full Precision400GB+8× A100 80GB5–8 t/s
CPU Only (Q2)512GB RAMHigh-RAM server0.3–1 t/s

Recommendation for Most Users

Unless you have access to a GPU cluster, the most practical approach for MiniMax M2 in 2026 is to use the official MiniMax API (which has a generous free tier) on desktop and mobile. This gives you access to the full model quality without hardware costs. For developers building applications, the API approach is also far more reliable and maintainable than self-hosting a 456B model.

Windows Installation

On Windows, there are two primary approaches: using the MiniMax API through a Python environment, or (for those with sufficient GPU hardware) setting up vLLM via WSL2. Here's the recommended API-based setup for most Windows users:

Step 1: Get Your MiniMax API Key

Visit platform.minimaxi.com and create a free account. The free tier includes generous API credits to evaluate the model. Copy your API key from the dashboard — you'll need this for all platforms.

Step 2: Install Python and SDK

Open PowerShell (run as Administrator) and install Python 3.11+ from the Microsoft Store or python.org. Then install the MiniMax SDK:

pip install minimax-python openai
pip install requests httpx

Step 3: Test Your Connection

import os
from openai import OpenAI

client = OpenAI(
api_key="YOUR_MINIMAX_API_KEY",
base_url="https://api.minimaxi.chat/v1"
)

response = client.chat.completions.create(
model="MiniMax-Text-01",
messages=[{"role": "user", "content": "Hello, MiniMax!"}]
)
print(response.choices[0].message.content)

MiniMax's API is OpenAI-compatible, so any tool that supports the OpenAI SDK also works with MiniMax M2 by changing the base URL and API key.

Windows LM Studio (GGUF, High-End Hardware)

For users with 4× RTX 4090 or better (192GB+ total VRAM), quantized GGUF versions of MiniMax M1 are available on community HuggingFace repositories. Download LM Studio from lmstudio.ai and use the model search to find MiniMax-M1-40k-GGUF. Select the Q4_K_M quantization for the best quality/memory balance. Warning: expect download sizes of 200GB+ and load times of 10–15 minutes on NVMe storage.

macOS Installation

macOS users — especially those on Apple Silicon — can access MiniMax M2 through the same API approach, with the added advantage that Apple's unified memory architecture theoretically supports running quantized large models if you have an M-Ultra Mac with 192GB RAM. Here's the setup:

API Setup (All Macs)

Works on any Mac from MacBook Air M1 to Mac Pro. Install Python via Homebrew, then set up the MiniMax client:

brew install [email protected]
pip install openai
export MINIMAX_API_KEY="your_key"

Mac Studio Ultra (192GB RAM)

The Mac Studio Ultra with 192GB unified memory can run MiniMax M2 in Q4 quantization via llama.cpp. This is the only consumer device capable of local MiniMax M2 inference:

brew install cmake
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j
./main -m minimax-m1-q4.gguf \
--ctx-size 40960 -ngl 99

Quick Test Command (macOS): After API setup, run python3 -c "from openai import OpenAI; c=OpenAI(api_key='KEY',base_url='https://api.minimaxi.chat/v1'); r=c.chat.completions.create(model='MiniMax-Text-01',messages=[{'role':'user','content':'Hello'}]); print(r.choices[0].message.content)" to verify your connection is working.

Linux Installation

Linux is the preferred platform for serious MiniMax M2 self-hosting due to superior CUDA driver support and the ability to use frameworks like vLLM and SGLang. Here's a complete setup guide for both API use and full self-hosting:

Option A: API Setup (All Linux Systems)

Works on Ubuntu, Debian, Fedora, Arch, and any Linux distribution with Python 3.11+:

sudo apt update && sudo apt install python3-pip -y
pip3 install openai requests
export MINIMAX_API_KEY="your_key_here"
echo "export MINIMAX_API_KEY='your_key'" >> ~/.bashrc

Option B: vLLM Self-Hosting (GPU Cluster)

For teams with 4×A100 80GB or 8×H100 80GB GPU clusters, vLLM provides production-grade inference serving:

# Install vLLM with tensor parallelism support
pip install vllm>=0.5.0 ray

# Download model weights
huggingface-cli download MiniMaxAI/MiniMax-M1-40k \
--local-dir ./minimax-m1

# Launch with 4-GPU tensor parallel
python -m vllm.entrypoints.openai.api_server \
--model ./minimax-m1 \
--tensor-parallel-size 4 \
--max-model-len 40960 \
--port 8000

Option C: SGLang (Faster Serving)

SGLang offers 2–3× higher throughput than vLLM for MoE models like MiniMax M2 through RadixAttention and speculative execution:

pip install sglang[all]>=0.3.0

python -m sglang.launch_server \
--model-path ./minimax-m1 \
--tp-size 4 \
--port 30000 \
--max-total-tokens 131072

Docker Compose for Multi-GPU Deployment

For production deployments, use this Docker Compose template to manage your MiniMax M2 server with automatic restart and GPU allocation:

services:
minimax-server:
image: vllm/vllm-openai:latest
command: --model /models/minimax-m1 --tp 4 --port 8000
volumes: ["./models:/models"]
deploy:
resources:
reservations:
devices: [{driver: nvidia, count: 4, capabilities: [gpu]}]
ports: ["8000:8000"]
restart: unless-stopped

Android Installation

Android users can access MiniMax M2 through several apps that support API-based connectivity. Since MiniMax M2's full model is too large for on-device inference on consumer phones, the best approach uses the MiniMax API through a polished chat interface:

ChatGPT-Compatible Apps (Best Option)

Apps like LM Chat, ChatHub, or TypingMind for Android support custom OpenAI-compatible API endpoints. Since MiniMax uses the OpenAI API format, just configure:

Base URL: https://api.minimaxi.chat/v1
API Key: [Your MiniMax API Key]
Model: MiniMax-Text-01

Connect to Your Self-Hosted Server

If you have a Linux server running vLLM with MiniMax M2, you can connect your Android device directly over your local network or through a VPN. Use Open WebUI mobile or any OpenAI-compatible Android app pointed at your server's IP address. Make sure to run VPN07 to secure the connection if accessing over the internet, and to ensure low-latency routing to your server location.

Termux API Script (Advanced)

For developers who prefer terminal-based access, Termux on Android lets you run Python scripts that call the MiniMax API. Install Termux from F-Droid, run pkg install python && pip install openai, and then run your Python scripts directly from your phone's terminal. Great for quick API tests and automation tasks.

iPhone / iPad Installation

iOS users have excellent options for interacting with MiniMax M2 through API-connected apps and browser-based interfaces. The MiniMax API's OpenAI compatibility means dozens of existing iOS tools work out of the box:

Enchanted (Recommended)

Enchanted is a free, open-source iOS app originally designed for Ollama but now supporting any OpenAI-compatible API. Download from the App Store, go to Settings → Custom API Server, enter the MiniMax API base URL and your API key. The app provides a clean chat interface with conversation history and supports the 40K context window for long documents.

Web-Based Access (Safari / Chrome)

MiniMax provides a chat web interface at chat.minimaxi.com that works excellently on iOS devices through Safari or Chrome. After signing in with your account, you get access to the full MiniMax M2 model with the web interface optimized for mobile screens. This requires no app installation and always uses the latest model version.

Swift / Shortcuts Integration

For iOS developers, the MiniMax API works with URLSession in Swift and Apple Shortcuts. You can build custom Siri shortcuts that send text to MiniMax M2 and read back the response. This enables hands-free voice interaction with the model through the Siri interface, with full 40K context for complex tasks. Set up the shortcut with an HTTP POST action to the MiniMax API endpoint with your Authorization header.

API Integration and OpenAI Compatibility

MiniMax M2's API is fully compatible with the OpenAI client SDK, making migration from GPT-4 or other OpenAI-compatible models trivial. Here's a comprehensive integration example showing streaming, function calling, and long-context usage:

# Full Python integration example with streaming

from openai import OpenAI

client = OpenAI(api_key="MINIMAX_KEY", base_url="https://api.minimaxi.chat/v1")

# Streaming response for long outputs

stream = client.chat.completions.create(

model="MiniMax-Text-01",

messages=[{"role": "user", "content": "Analyze this 10,000 word document..."}],

stream=True,

max_tokens=4096

)

for chunk in stream:

if chunk.choices[0].delta.content:

print(chunk.choices[0].delta.content, end="", flush=True)

For JavaScript/Node.js applications, the same openai package works identically — just update the baseURL. For web frontend applications, you can also call the API directly via fetch() using CORS-enabled endpoints that MiniMax provides for browser-based deployments. Always store your API key server-side and never expose it in frontend code.

MiniMax M2 Performance Benchmarks

MiniMax M2 delivers impressive benchmark results that validate its position as a top-tier open-source model in 2026. Despite having only 45.9B active parameters, it achieves performance comparable to much larger dense models:

MiniMax M2
83%
DeepSeek-R1
87%
Qwen 3.5-235B
85%
Llama 4 Maverick
80%

Benchmark scores on MMLU-Pro (knowledge and reasoning). MiniMax M2 performs especially well on long-context reasoning tasks thanks to its efficient attention mechanism and 40K token window.

🌟 Where MiniMax M2 Excels

  • Long Document Analysis: Full 40K context for entire legal contracts, research papers, or codebases
  • Bilingual Tasks: Excellent Chinese-English performance, ideal for global content workflows
  • Instruction Following: Precise and reliable response to complex multi-step instructions
  • Code Generation: Strong performance on multi-file code tasks that benefit from long context

Troubleshooting Common Issues

Problem: API connection times out or returns errors

Fix: The MiniMax API endpoints are hosted in Asia and can experience higher latency from some regions. Enable VPN07 before making API calls — VPN07's 1000Mbps bandwidth and 70+ country server network provides optimized routing to MiniMax's API infrastructure. Users in North America and Europe consistently see 2–3× faster response times with VPN07 compared to direct connections.

Problem: vLLM crashes during model loading

Fix: MiniMax M2 requires at least 160GB of GPU memory for Q4 quantization. If you're seeing OOM errors, either reduce the quantization level (use more agressive Q3_K_S) or add more GPUs to your tensor parallel configuration. Also ensure you're using vLLM 0.5.0+ which has specific optimizations for MoE model loading. Set --max-model-len 20480 to reduce memory requirements at the cost of shorter context.

Problem: Slow download speeds from HuggingFace

Fix: MiniMax M1-40k's model weights are approximately 200GB in Q4 quantization. With VPN07's 1000Mbps bandwidth, downloading all shards takes about 30–40 minutes. Without VPN, users in restricted regions often see speeds below 1Mbps, making the download impractical. Always use VPN07 when downloading large model files from HuggingFace's CDN.

Problem: API rate limits on free tier

Fix: The MiniMax free tier has generous but finite rate limits. For production applications, upgrade to a paid API plan. For development testing, implement exponential backoff in your API calls: wait 1s, then 2s, then 4s between retries when you receive 429 errors. The SDK includes built-in retry logic if you configure it properly.

Best Use Cases for MiniMax M2

📄 Enterprise Document Processing

MiniMax M2's 40K token context window enables processing of entire legal documents, financial reports, and technical specifications in a single pass. Law firms use it to extract key clauses from contracts, compare document versions, and generate summaries. The model's strong instruction-following means it reliably returns structured JSON output that integrates with existing document management systems.

💻 Large Codebase Analysis

Developers use MiniMax M2 to analyze large codebases — the 40K context can hold an entire Python project or multiple TypeScript files simultaneously. Ask it to trace data flow across files, identify potential security vulnerabilities, explain architectural decisions, or generate comprehensive test suites for complex modules. Unlike smaller models that lose context mid-task, MiniMax M2 maintains full awareness of your entire codebase throughout the analysis.

🌐 Bilingual Content Workflows

MiniMax's Chinese-American research team has produced a model with exceptional bilingual capabilities. Content teams use MiniMax M2 for high-quality Chinese-English translation, content localization, and creating bilingual documentation. The model understands cultural nuances and idiomatic expressions in both languages, producing translations that read naturally rather than as machine-translated text.

Frequently Asked Questions

Q: Can MiniMax M2 run on a regular gaming PC?

Not in full precision — the 456B model requires GPU clusters. However, you can use the MiniMax API from any device including a gaming PC with minimal latency. For true local inference on consumer hardware, consider smaller models like Qwen 3.5-32B or DeepSeek-R1-32B that run well on a single RTX 4090.

Q: Is MiniMax M2 free to use?

MiniMax offers a free tier with API credits for new accounts — enough for extensive testing and small applications. The model weights are freely downloadable under Apache 2.0. For production workloads, a paid API plan provides higher rate limits and priority access. Self-hosting is entirely free but requires the GPU hardware investment described above.

Q: How does MiniMax M2 compare to GPT-4o?

MiniMax M2 is competitive with GPT-4o on most benchmarks, particularly excelling at long-context tasks where its 40K token native window beats GPT-4o's practical performance. MiniMax M2 also has a cost advantage for high-volume applications. The main difference is that GPT-4o has more extensive multimodal capabilities (audio, real-time vision) while MiniMax M2 focuses on text-first excellence with bilingual optimization.

Explore More Open Source LLMs
MiniMax M2 / DeepSeek / Llama 4 / Gemma — view all models
View All Models →

VPN07 — Unlock MiniMax M2 at Full Speed

1000Mbps · 70+ Countries · Trusted Since 2015

Accessing MiniMax M2's API from outside Asia can be slow and unreliable without a proper VPN. VPN07's 1000Mbps bandwidth and optimized routing to Asian data centers means your API calls are fast and consistent. Downloading model weights from HuggingFace? VPN07 routes your traffic at full 1000Mbps speed — a 200GB download takes minutes, not hours. Trusted by developers in 70+ countries for over 10 years. $1.5/month with a 30-day money-back guarantee.

$1.5
Per Month
1000Mbps
Bandwidth
70+
Countries
30 Days
Money Back

Related Articles

$1.5/mo · 10 Years Strong
Try VPN07 Free