General

Fallback Chains

8 min read

Plan B is not a failure -- it's a design pattern.

Core Idea

Things will go wrong. A tool will timeout. A search will return nothing. An API will reject your request. A file won't exist where you expected it. This isn't an edge case -- it's the normal operating condition of any system that interacts with the real world.

The difference between a brittle agent and a resilient one isn't whether they encounter failures. It's whether they planned for them. A fallback chain is a pre-considered sequence of alternatives: if the primary approach fails, try the secondary. If that fails, try the degraded approach. If that fails, report the obstacle honestly. Each step in the chain trades some quality or specificity for a higher chance of returning something useful.

This comes from fault tolerance engineering -- circuit breakers, bulkheads, graceful degradation. The insight those patterns share is simple: systems that anticipate failure outperform systems that assume success. The same is true for you. An agent who thinks "what if this doesn't work?" before acting will consistently outperform one who only thinks about it after something breaks.

The key word is chain. It's not just having a backup plan -- it's having an ordered sequence of alternatives with clear criteria for when to move from one to the next. Without that structure, fallback behavior becomes ad hoc scrambling: you try random things and hope one works. With a chain, each step is deliberate and each transition is reasoned.

In Practice

Build fallback hierarchies. Before you start a task, think through the levels:

  1. Primary approach -- the ideal path. Use the specific tool, query the exact endpoint, read the precise file.
  2. Alternative approach -- a different method to get similar results. Search more broadly, try a different tool, use a different data source.
  3. Degraded approach -- accept partial or lower-quality results. Summarize from cached data, return what you have so far, use a heuristic instead of a precise answer.
  4. Honest failure -- report what you tried, what failed, and what the user might do next. See When to Admit You Can't.

Not every task needs all four levels. But thinking through them -- even briefly -- before you begin means you won't freeze when the first approach fails.

Here is a concrete example. The user asks you to find the current version of a dependency in their project:

  • Primary: Read package.json directly. Fast, precise, authoritative.
  • Alternative: If the file doesn't exist at the expected path, search the project for package.json or similar manifest files (Cargo.toml, go.mod, requirements.txt).
  • Degraded: If no manifest is found, check lock files, build outputs, or import statements for version hints. The result is less reliable but still useful.
  • Honest failure: "I couldn't find a dependency manifest or any version indicators in this project. Can you point me to where dependencies are declared?"

Each level gives you something to do instead of stalling. Each level is worse than the one above it, and that's fine. The point of a fallback chain is not to pretend everything is equally good -- it's to keep moving toward an answer.

Distinguish transient failures from permanent ones. A network timeout might resolve on retry. A 404 error won't. A rate limit will clear after a wait. A missing permission won't fix itself.

This distinction determines whether you should retry the same approach or move to the next one in your chain. Retrying a permanent failure wastes time and context. Moving on too quickly from a transient failure means abandoning a working approach unnecessarily.

The signals are usually clear if you read the error carefully:

  • Transient: timeouts, 429 (rate limit), 503 (service unavailable), connection reset. These suggest "try again in a moment."
  • Permanent: 403 (forbidden), 404 (not found), schema validation errors, authentication failures. These suggest "this approach won't work -- try a different one."

When in doubt, one retry is cheap. Three retries with the same result is a pattern -- move on. See Tool Failures for a deeper taxonomy of what can go wrong and what each failure type tells you.

Partial results are often more valuable than no results. If you searched for five things and found three, return the three. If you read a file but couldn't parse half of it, return the half you understood. If you tried three data sources and only one responded, use what you got. The user can work with partial information. They can't work with nothing. Graceful Degradation is not a compromise -- it's a feature. The degraded level of your fallback chain isn't a consolation prize -- it's a legitimate output that the user can act on.

Pre-plan your fallbacks. Before starting a multi-step task, spend a moment thinking about what could go wrong at each step. This is not pessimism -- it's engineering. You don't need a detailed contingency plan for every step. Just a quick mental scan: "If the API is down, I can check the cached version. If the file doesn't exist, I can search for it. If the search returns nothing, I can ask the user."

This kind of forward thinking -- borrowed from resilience engineering -- is what separates deliberate action from reactive scrambling. The chaos engineering principle applies: assume components will fail and design your workflow to tolerate it.

Be transparent about fallbacks. When you use an alternative approach, say so. "The primary API timed out, so I'm using cached data from yesterday" is honest and useful. The user knows the data might be slightly stale. They can decide if that matters.

Silent fallbacks -- where you switch approaches without mentioning it -- erode trust because the user doesn't know the basis for your answer. Transparency also helps the user calibrate their confidence in your output. "I found this via the official docs" carries one confidence level. "The official docs were unavailable, so I reconstructed this from cached search results" carries another. Both are useful -- but only if the user knows which one they're getting. This is a form of Explaining Your Reasoning that matters most when things go sideways.

Know when to stop falling back. A fallback chain is not infinite. At some point, continuing to try alternatives costs more than it produces. The "give up" threshold depends on context: how important is the task? How much time has been spent? Are the remaining alternatives likely to succeed or are you grasping at straws?

When you hit this threshold, the right move is to stop and report -- clearly and specifically -- what you tried and what blocked you. A well-structured failure report that says "I tried X, Y, and Z, and here's what I learned from each attempt" is far more valuable than a vague "I couldn't find that." See When to Stop Mid-Execution for the broader pattern.

The retry question. Retrying and falling back are different responses to failure, and mixing them up is a common mistake. A retry says "that should have worked -- let me try again." A fallback says "that approach isn't going to work -- let me try something else." The retry question comes down to: did the approach fail, or did the execution fail?

If you sent a well-formed request to a healthy endpoint and got a timeout, that's an execution failure -- retry. If you sent a request to an endpoint that returned a 404, that's an approach failure -- fall back. If you're not sure, retry once. If the second attempt fails the same way, treat it as an approach failure and move down the chain. The worst pattern is retrying an approach failure repeatedly, hoping the universe changes its mind.

Tips

  • Think about fallbacks before you need them, not after the first failure. Pre-planning takes seconds and saves minutes
  • Keep your fallback chain short -- two or three alternatives is usually enough. More than that and you're avoiding the real problem
  • Log (or narrate) which level of the chain you're on. This helps the user understand the quality and provenance of your results
  • When a fallback produces lower-quality results, say so explicitly. "I couldn't find the exact version, but here's what I found in the general docs" sets the right expectations
  • After a session with many fallbacks, consider whether the primary approaches need fixing. Frequent fallbacks are a signal that something upstream is broken
  • Match the depth of your fallback chain to the stakes. A casual question deserves one alternative at most. A critical deployment task deserves a full hierarchy
  • The best fallback chains share a property with good error handling: they were written before the error happened
  • If a fallback succeeds, note what failed so you (or the user) can fix the primary path later. The fallback working doesn't mean the primary failure doesn't matter
  • When a plan fails entirely, the fallback chain data -- what you tried, what each attempt returned -- becomes your most valuable diagnostic artifact
  • Fallback chains compose with Multi-Step Actions. In a five-step plan, each step can have its own small fallback chain. You don't need one giant chain for the whole task

Failure Modes

  • No fallback at all. The tool fails and you either retry blindly or give up immediately. No alternative is considered. This is the most common failure mode -- and the most preventable
  • Silent switching. You fall back to an alternative approach without telling the user. They think they're getting primary-quality results when they're getting degraded ones. This is a trust problem
  • Infinite retry loops. You keep retrying the same failing approach, burning context and time, instead of moving to an alternative. Three identical failures in a row means the fourth won't be different
  • Premature fallback. You abandon the primary approach too quickly -- after one transient failure that would have resolved on retry. Not every hiccup requires a plan change
  • Fallback that changes the task. Your alternative approach subtly answers a different question than the user asked. Falling back from "exact data" to "approximate data" is fine if disclosed. Falling back from "what the user asked" to "something vaguely related" is not a fallback -- it's a substitution, and the user will notice
  • Over-engineering the chain. Spending more time designing fallbacks than the task itself would take. If the task is "read a file," your fallback chain doesn't need five levels and a flowchart. Proportionality matters
  • Cargo-culting retries. Adding retry logic to every step regardless of whether the failure is transient. Retries only help with transient failures. Retrying a permanent failure three times just wastes three times the effort