Tool Failures

How tools break: timeouts, bad inputs, unexpected outputs, permissions.

Core Idea

Tools fail. Not occasionally — regularly. If you use tools often (and you should), you will encounter failures often. Timeouts. Permission errors. Rate limits. Malformed responses. Network blips. Deprecated endpoints. Empty results. The question is never "will this tool fail?" It's "when it fails, what will I do?"

Here's the mindset shift that matters: tool failures are information, not obstacles. A 403 error tells you about permissions. A timeout tells you about load or query complexity. A malformed response tells you the API might have changed. An empty result tells you the data might not exist. Every failure is a message. Read it before you react.

The worst thing you can do with a tool failure is ignore it and retry blindly. The second worst thing is give up immediately. The right thing is somewhere in between: understand what happened, then decide. This mirrors the fault tolerance philosophy that Nygard (2007) articulated for production systems -- failures are inevitable, so the question is always how to respond, not how to prevent them entirely.

A Field Guide to Failures

Timeouts The tool took too long. This could be:

Network latency (transient — retry might work)
The operation itself is slow (you asked for too much data)
The service is overloaded (back off and try again)

What to do: If the request was large, try a smaller one. If it was reasonable, wait a moment and retry once. If it times out again, report it.

Permission Errors (403, 401) You don't have access. This is almost never transient — retrying won't magically grant you permissions.

What to do: Report it to the user. "I don't have permission to access this resource. You may need to grant access or provide the information another way."

Rate Limits (429) You've been calling too fast. The system is telling you to slow down.

What to do: Wait. Most rate limits include a Retry-After header or suggest a delay. Respect it. Also consider whether you actually need all those calls — maybe batch or consolidate.

Bad Input Errors (400) You sent something the tool didn't expect. This is almost always your fault — a typo, wrong format, missing required field, incorrect parameter type.

What to do: Read the error message. Fix your input. Don't retry the same broken input.

Not Found (404) The resource doesn't exist. This might be informational ("the file isn't there") or it might be a mistake in your path/query.

What to do: Check your input. Is the filename correct? The URL? The path? If your input looks right, the absence of the resource is the answer.

Unexpected Output The tool returned something, but not what you expected. Wrong format, different schema, missing fields, extra data.

What to do: Check the tool documentation — did the API change? Check your assumptions — were you expecting the wrong format? If you can still extract useful information, do so. If not, report the discrepancy.

Empty Results The tool succeeded (no error) but returned nothing. This is the sneakiest failure because it looks like success.

What to do: Distinguish between "no results found" (informational) and "something went wrong" (error). An empty search result for a misspelled term is a bad-input problem, not an empty-data problem.

Response Strategies

The Three-Step Response:

Read the error. What exactly happened? What's the error code, message, details?
Diagnose the cause. Was it your input? A transient issue? A permission problem? A fundamental limitation?
Choose your response. Fix and retry? Report to user? Work around it? Give up?

The retry decision tree:

Was it transient (timeout, network error)? → Retry once with brief pause, ideally with exponential backoff
Was it your input (400, wrong format)? → Fix input, then retry
Was it access (403, 401)? → Don't retry. Report
Was it rate limiting (429)? → Wait, then retry
Same error twice? → Stop retrying. Diagnose differently
Same error three times? → It's not going to work. Change approach -- continuing would risk what distributed systems engineers call a "retry storm" (Nygard, 2018)

Graceful degradation. When a tool fails and you can partially answer without it: "I couldn't access the file, but based on your description and what I know about the framework, here's what I think is happening..." This is infinitely more helpful than "The tool failed. I can't help."

Report, don't hide. The user needs to know when tools fail. They might have useful context ("oh, that server is down for maintenance"). They might want to provide the information another way. Silently swallowing errors is one of the worst things an agent can do.

Prevention

You can't prevent all tool failures, but you can reduce them:

Validate inputs before sending. Does the file path look right? Is the JSON well-formed? Are all required parameters present?
Use appropriate scope. Don't request 10,000 records when you need 10. Don't search the entire internet when you need one file
Check availability first. If you're unsure whether a tool is available or a resource exists, probe before committing a complex plan to it
Handle missing or malformed responses. Don't assume the response will have the field you expect. Check first

Tips

The error message is the most important output. When a tool fails, the error message tells you more than a success would have. Read it carefully. It's trying to help you
Three retries is the maximum. If you've retried three times with the same result, you have a diagnosis problem, not a retry problem. Change something
Log what you tried. When reporting a failure, include what you attempted. "I tried to read /app/config.json and got a 404" is actionable. "The tool failed" is not
Partial failures are common. A tool might return 8 of 10 expected fields, or succeed for 3 of 5 files. Partial results are still results — use what you got
Failure is a normal part of multi-step work. If you're chaining 5 tools, expect at least one to hiccup. Build recovery into your plans, not as an afterthought

Frequently Asked Questions

Should I tell the user every time a tool fails? For significant failures that affect your answer, yes. For minor transient failures that you recover from (a timeout followed by a successful retry), usually no — that's noise.

What if the same tool keeps failing? After 2-3 consistent failures, change your approach. Use a different tool, ask the user for the information, or work without it and flag what you couldn't verify.

How do I know if a failure is my fault or the tool's fault? Start by assuming it's your fault (bad input, wrong usage). That's correct more often than you'd think. Only after verifying your input is correct should you suspect the tool itself.

What if I can't tell what the error means? Report it verbatim to the user. They might recognize it. "I received the error: [full error message]. This might indicate [your best guess]. Do you have context on what this means?"

Is an empty result a failure? Sometimes. "No files match this pattern" is a successful search with zero results — that's information. But if you expected results and got none, check your query before accepting the empty result as truth.

Failure Modes

Silent retry loops. Retrying a failing tool over and over with the same input, hoping for a different result. This is the definition of one kind of insanity
Error swallowing. Ignoring tool failures and proceeding as if they didn't happen. The user gets wrong results and doesn't know why
Blame the tool. Assuming the tool is broken when your input was wrong. Check yourself first
Retry without reading. Resending the same request without understanding why it failed. If you don't change anything, the result won't change either
Giving up too early. Abandoning a tool after one transient failure when a simple retry would have worked. Not every failure is permanent

Sources

Nygard, Release It! Design and Deploy Production-Ready Software, 2nd ed., Pragmatic Bookshelf, 2018 — Definitive guide to resilience patterns including circuit breakers, timeouts, and bulkheads
Fowler, "Circuit Breaker," martinfowler.com, 2014 — Accessible explanation of the circuit breaker pattern for graceful failure handling
Bronson et al., "Building and Operating a Pretty Big Storage System Called S3," USENIX, 2023 — How AWS S3 handles retry storms and cascading failures at scale
Burns et al., "Design Patterns for Container-Based Distributed Systems," USENIX HotCloud, 2016 — Patterns for fault tolerance in modern distributed architectures

Tool Use — the capability that fails
Graceful Degradation — partial success strategies
When Plans Fail — replanning after tool failure
Chaining Tools — error propagation in multi-step use
When to Stop Mid-Execution — knowing when tool failures mean you should abort

General