VPN07

OpenClaw Fallback Model Not Switching When Claude Goes Down

March 10, 2026 17 min read Outage Fix OpenClaw Failover

The Scenario: Anthropic had an outage on March 2, 2026. OpenClaw users who thought they were protected by backup providers (Gemini, OpenAI) discovered that their agents went completely offline for hours — even though those backup providers were fully operational. The fallback system retried the same broken Anthropic endpoint instead of switching. This guide explains why OpenClaw's failover often doesn't work and how to configure it correctly.

When Anthropic's API went down, the OpenClaw community learned a painful lesson: having backup providers configured is not the same as having working failover. Many users had Gemini and OpenAI listed in their config, but during the outage their agents stayed silent. The failover system was making the same wrong choice repeatedly, hitting the dead Anthropic endpoint instead of escalating to the healthy backups.

This is GitHub Issue #32533 — "Fallback does not escalate to different provider on overload errors." The issue was filed after the outage and quickly accumulated over 200 comments from affected users. The root problem isn't that OpenClaw can't do failover — it's that the default failover logic is flawed in several specific ways that cause it to loop within a single provider rather than escalating to a different one.

How OpenClaw's Fallback Is Supposed to Work

In theory, OpenClaw supports multiple model providers and should automatically switch between them when one fails. The intended flow looks like this:

1
Primary model (e.g., anthropic/claude-opus-4-6) receives request
2
If primary returns 503/429/overload error, trigger fallback
3
Switch to first fallback provider (e.g., openai/gpt-5)
4
If that also fails, escalate to second fallback (e.g., google/gemini-2-pro)
5
Agent continues running with backup provider, notifies user of the switch

Why It Actually Fails: The 3 Bugs

Bug 1: Provider-Level vs. Profile-Level Fallback

The fallback system treats different authentication profiles of the same provider as separate fallback targets. So if you have two Anthropic API keys configured — your personal key and a team key — OpenClaw will try both before giving up on Anthropic. During a full Anthropic outage, both fail, but OpenClaw interprets this as "both fallback options exhausted" and stops trying instead of moving to the next provider (OpenAI, Gemini, etc.).

// What OpenClaw does (wrong): Try anthropic/claude key1 → 503 Try anthropic/claude key2 → 503 "All fallbacks exhausted" → Agent offline ✗ // What it should do (correct): Try anthropic/claude key1 → 503 Escalate to: openai/gpt-5 → 200 ✓

Bug 2: Error Type Discrimination

OpenClaw's fallback only triggers on specific error codes by default: rate limit (429) and overload (529). During the March 2026 Anthropic outage, the API was returning 500 (Internal Server Error) and connection timeout errors — not 429 or 529. The fallback logic didn't recognize these as "trigger fallback" conditions, so it simply retried the same endpoint repeatedly without escalating.

Bug 3: Missing Fallback Configuration

Many users assume that listing multiple providers in the config automatically enables smart failover. It doesn't. The fallback order must be explicitly configured with the fallback key. If you haven't set this, OpenClaw uses the first provider in the list and never tries others, regardless of what errors occur.

Fix: Correct Fallback Configuration

Here's the complete working configuration for multi-provider failover. This configuration addresses all three bugs above:

Working Fallback Config (~/.config/openclaw/openclaw.json5)

{ "models": { "providers": { "anthropic": { "apiKey": "$ANTHROPIC_API_KEY", "timeout": 30000 }, "openai": { "apiKey": "$OPENAI_API_KEY", "timeout": 30000 }, "google": { "apiKey": "$GOOGLE_API_KEY", "timeout": 30000 } }, "defaults": { "provider": "anthropic", "model": "claude-opus-4-20260301" }, // ✅ Key: explicit fallback chain by PROVIDER (not profile) "fallback": { "enabled": true, "chain": [ "anthropic/claude-opus-4-20260301", "openai/gpt-5", "google/gemini-2-pro-latest" ], // ✅ Key: expand error trigger conditions "triggerOn": [429, 500, 502, 503, 529, "timeout", "ECONNRESET"], "maxRetries": 1, // Only retry once per provider "retryDelay": 5000, // 5 seconds between attempts "escalateOnRetryFail": true // Move to next provider after 1 failure } } }

Apply and Verify

# Step 1: Edit your config nano ~/.config/openclaw/openclaw.json5 # Step 2: Validate the config syntax openclaw config validate # Step 3: Restart gateway openclaw gateway restart # Step 4: Verify fallback chain is loaded openclaw models status # Should show all 3 providers listed with their status # Step 5: Test fallback manually (optional) # Temporarily set an invalid API key for anthropic # Then send a test message — should switch to openai automatically openclaw test --model fallback

Getting Notified When Failover Happens

One of the frustrating aspects of the March 2026 outage was that users didn't know their agent was offline until hours later. Configure OpenClaw to send you a notification whenever it falls back to a different provider:

Failover Notification Setup

# Add to openclaw.json5 { "models": { "fallback": { "notifyOnSwitch": true, "notifyChannel": "your-telegram-chat-id", "notifyMessage": "⚠️ Provider switched: {from} → {to}. Reason: {error}" } } } # Add to HEARTBEAT.md for manual monitoring: ## Provider Health Check Every 5 minutes, silently verify the primary model is responding. If it fails to respond within 10 seconds, notify me with: "⚠️ Primary model {model} appears unresponsive. Switched to {fallback}."

OpenRouter as a Drop-In Failover Solution

If configuring multi-provider failover in OpenClaw feels too complex, there's a simpler approach: use OpenRouter as your single model provider. OpenRouter is a routing service that gives you access to dozens of models through one API key and handles failover automatically at the routing layer. When Claude is down, OpenRouter can transparently route to GPT-5 or Gemini without OpenClaw needing to know.

OpenRouter Configuration

# Get your OpenRouter API key from openrouter.ai export OPENROUTER_API_KEY="sk-or-..." # openclaw.json5 config { "models": { "providers": { "openrouter": { "baseUrl": "https://openrouter.ai/api/v1", "apiKey": "$OPENROUTER_API_KEY", "api": "openai-chat" } }, "defaults": { "provider": "openrouter", // Use auto-routing: OpenRouter picks the best available model "model": "anthropic/claude-opus-4" } } } # OpenRouter handles failover at their end: # claude-opus-4 down → automatically uses claude-sonnet-4 or gpt-5

Trade-off: OpenRouter adds slight latency (~100ms) and has its own pricing markup. But for 24/7 reliability without complex configuration, it's worth it for most users.

Provider Reliability Comparison (2026)

Anthropic (Claude)

99.2% uptime

Best model quality, but had a notable 2-hour outage on March 2, 2026. Strong for most use cases.

OpenAI (GPT-5)

99.5% uptime

Excellent reliability. Strong tool-calling support. Good primary or secondary choice.

Google (Gemini 2 Pro)

99.7% uptime

Highest recorded uptime in 2026. Excellent as a tertiary fallback. Large context window.

The Network Factor in Failover Failures

There's a common scenario where OpenClaw appears to have a provider failover problem, but the real issue is network-level blocking or throttling. In some regions, ISPs throttle or block connections to specific AI API domains — particularly Anthropic's API. This looks like an Anthropic outage from OpenClaw's perspective, but other providers (OpenAI, Google) may also be blocked or throttled at the network level. Result: failover appears to work (it switches to OpenAI) but OpenAI also times out, because all AI API connections are being throttled.

The solution is to route all OpenClaw traffic through a reliable VPN. VPN07 provides 1000Mbps bandwidth across 70+ global server locations, ensuring that your connections to all AI model providers are routed through clean, unrestricted paths. This means failover actually works as intended — when Anthropic is down, OpenClaw can reach OpenAI or Gemini without network-level interference.

VPN + OpenClaw Failover: Best Practice

Route via VPN server near API
Choose a VPN server in the same region as the AI API datacenter for lowest latency.
Avoid ISP throttling
VPN encryption prevents ISPs from identifying and throttling AI API traffic.
Enable split tunneling
Route only AI API domains through VPN — keeps other traffic on direct connection.
Test all providers via VPN
Confirm Anthropic, OpenAI, and Gemini are all reachable via your VPN before relying on failover.

Manual Failover: Switching Models via CLI

If automatic failover isn't working and you need your agent operational immediately during an outage, you can manually switch to a working provider via the command line. This doesn't fix the underlying configuration, but it gets your agent running in under two minutes.

Emergency Manual Failover

# Check which providers are available openclaw models status # Switch to OpenAI immediately openclaw models set openai/gpt-5 # Or switch via the chat interface /model openai/gpt-5 # Or with a shorter alias (if configured) /model gpt5 # Verify the switch worked openclaw models status # Should show openai/gpt-5 as active # Switch back to Claude when Anthropic recovers /model anthropic/claude-opus-4-20260301

Manual switching is instant and doesn't require a gateway restart. Your session context is preserved through the switch.

Lessons from the March 2026 Anthropic Outage

The March 2, 2026 Anthropic outage lasted approximately two hours and affected all Claude models. For the OpenClaw community, it was a stark lesson in the importance of properly configured failover. Here are the patterns that emerged from post-outage community discussion:

The number of users who reported being unaware their agent was down until hours later was striking. Many had built automation workflows that were supposed to run overnight — morning briefings, scheduled research tasks, daily reports — and discovered in the morning that nothing had been generated. The silent failure mode (agent offline with no notification) was as damaging as the outage itself.

What failed most commonly

Users with only Anthropic profiles in their config (no OpenAI or Gemini keys at all). Users who had multi-provider configs but hadn't set an explicit fallback chain. Users whose backup provider API keys had expired or been rotated without updating the config.

What stayed online

Users with properly configured fallback chains from different providers. Users using OpenRouter, which handled the failover at the routing layer. Users who happened to be awake and manually switched models within minutes of the outage beginning.

Key takeaway

Having multiple API keys from different providers is table stakes for any serious OpenClaw deployment. The 5 minutes it takes to get an OpenAI or Google API key can save hours of downtime. Treat provider redundancy the same way you treat backup power — you hope you never need it, but you're glad it's there.

Common Mistakes That Break Failover

Even with a well-intentioned failover configuration, several common mistakes prevent it from working. Here are the most frequently seen errors in the OpenClaw community:

Mistake 1: Expired API Keys

The backup OpenAI or Gemini key was added months ago and has since expired, hit a spending limit, or been revoked. OpenClaw tries to fail over, the backup key is rejected, and the agent goes offline anyway. Solution: set up a reminder to verify all backup API keys monthly.

Mistake 2: Incorrect Model Names in Fallback Chain

The fallback chain references an outdated model name that no longer exists in the provider's API. For example, openai/gpt-4 instead of openai/gpt-5. The failover attempt returns a 404 and the chain collapses. Always verify model names against current documentation when setting up fallback chains.

Mistake 3: No Gateway Restart After Config Change

You updated the fallback configuration in openclaw.json5 but didn't restart the gateway. OpenClaw reads the config at startup — changes don't take effect until the gateway restarts. Always run openclaw gateway restart after any configuration change.

Mistake 4: Wrong Error Codes in triggerOn

The default triggerOn list only includes 429 and 529. During a real outage, providers return 500, 503, or connection timeouts. Without expanding this list, failover never triggers for the most common outage error types.

Failover Setup Checklist

Configure explicit fallback.chain with providers from different companies (not just multiple Anthropic profiles)
Add 500, 502, 503, timeout, and ECONNRESET to triggerOn error list
Set maxRetries: 1 and escalateOnRetryFail: true for fast failover
Enable notifyOnSwitch so you know when failover happens
Have valid API keys for at least 2 different providers (Anthropic + OpenAI at minimum)
Consider OpenRouter for simplified failover management
Use a VPN to ensure all provider APIs are reachable from your network
Test your failover config manually before relying on it for production tasks

VPN07 — Keep All AI APIs Reachable

Failover only works if all providers are accessible — VPN07 ensures they are

When your primary model goes down, your failover chain is only as good as your network. VPN07 provides 1000Mbps bandwidth across 70+ countries, routing your OpenClaw traffic through unrestricted paths to Anthropic, OpenAI, and Google APIs. Trusted since 2015, with a 30-day money-back guarantee.

$1.5/mo
Starting price
1000Mbps
Max bandwidth
70+
Countries
10 yrs
Operational

Related Articles

$1.5/mo · 10 Years Trusted
Try VPN07 Free