Multi-step tool use. Dependencies, error propagation, recovery.
What It Looks Like
Search for the file. Read the file. Find the function. Modify the function. Run the tests. Report the results.
Each step uses a tool. Each step depends on the one before. The output of one becomes the input of the next. This is tool chaining — and it's how you accomplish anything that no single tool can handle alone.
When to Use It
- The task has phases. Discover → analyze → modify → verify. Each phase needs different tools
- You need to act on something you first need to find. You can't edit a function until you know which file it's in
- Multiple sources inform one answer. Read the config, read the code, read the tests, then synthesize
- A single tool gives partial information. Search results tell you where to look; reading tells you what's there
How It Works
Phase 1: Plan the Chain
Before your first tool call, sketch the sequence (this is the "Thought" step in the Reason-Act-Observe loop formalized by Yao et al., 2023):
- What tools will I need, in what order?
- What information does each step need from the previous one?
- Which steps are independent (can run in parallel)?
- Which steps are dependent (must be sequential)?
- Where are the likely failure points?
Phase 2: Execute with Checkpoints
As you execute each step:
- Invoke the tool
- Validate the output — did it work? Did it return what you expected?
- Extract the relevant data for the next step
- Record what you've done and what you have
The "record" part matters more than you think. In a 6-tool chain, by step 5 you've accumulated a lot of context. If step 5 fails, can you articulate what happened in steps 1-4? If not, you're flying blind.
Phase 3: Handle Failures Gracefully
Every link in the chain can break. Reflexion (Shinn et al., 2023) showed that agents who verbally reflect on failures and maintain a memory of what went wrong substantially outperform agents who simply retry. For each failure, ask:
- Can I retry? Maybe a transient error. Try once more
- Can I skip? Maybe this step was nice-to-have, not essential
- Should I roll back? Maybe step 3's failure means step 2's output is also suspect
- Should I stop and report? Maybe continuing with incomplete data will make things worse
Phase 4: Deliver the Result
When the chain completes (or partially completes), package the result:
- What was accomplished?
- What data was gathered?
- If the chain was interrupted, where and why?
- What's left to do?
Data Passing: The Hidden Skill
The hardest part of chaining isn't the individual tool calls — it's what happens between them. You're effectively a relay runner, passing a baton from tool to tool. Drop the baton and everything falls apart.
Extract, don't forward. Tool A returns a 200-line JSON response. Tool B needs one value from it. Extract that value. Don't carry the entire 200 lines forward — it wastes context and obscures what matters. (Research on trajectory reduction confirms that shorter, focused context actually improves agent performance — bloated context hurts more than it helps.)
Transform when needed. Tool A outputs a file path as a full absolute path. Tool B expects just the filename. Tool A returns dates as strings. Tool B expects timestamps. You're the adapter between incompatible interfaces.
Validate between steps. After each tool returns, check: did this give me what I need for the next step? If tool A was supposed to return a file path and returned an error message instead, catching that now saves you from feeding an error message into tool B as if it were a file path.
Tips
- Two parallel tools are better than two sequential tools. If you need to read two independent files, read them both at once. Don't serialize unnecessarily. (Kim et al., 2024 demonstrated up to 3.7x latency speedup from parallel function calling.)
- The cheapest chain is the shortest chain. Before starting a 5-tool chain, ask: is there a 3-tool version that works? A 2-tool version?
- Name your intermediate results. Even mentally, labeling results ("the file path," "the function signature," "the test output") helps you track what you're carrying
- If you lose track, stop and summarize. Mid-chain confusion is dangerous. Better to pause, list what you know, and then continue deliberately
- Build in verification steps. After modifying a file, read it back to confirm the change took effect. After running a fix, run the tests. These extra tool calls are worth it
Frequently Asked Questions
How long is too long for a chain? There's no hard limit, but after 7-8 sequential tool calls, the risk of something going wrong and the difficulty of tracking state both increase significantly. This aligns with research showing that errors compound exponentially with each sequential step — Dziri et al. (2023) found that even with 90% per-step accuracy, a 10-step chain drops to roughly 35% overall reliability. If your plan has more than 8 steps, look for ways to parallelize, consolidate, or break it into sub-tasks.
What if I realize mid-chain that my plan was wrong? Stop. Reassess. Don't push through a wrong plan just because you've already started. Sunk cost fallacy applies to tool chains too. See When to Stop Mid-Execution.
Should I tell the user about each step in the chain? Not usually — play-by-play narration is tedious. But do communicate the high-level plan ("I'll find the file, make the change, and run the tests") and any significant findings or issues along the way.
Can I chain tools from different systems? Yes, and this is often the most powerful use case. Read a file (file system), search for related documentation (web), modify the code (file system), run the tests (code execution). Cross-system chains are where agents really shine.
End-to-End Example: "Find and Fix the Bug"
User says: "The /api/users endpoint returns 500. Fix it."
Here is the full chain, showing each tool call, its input, and its output:
Step 1 — Search for the error.
Tool: Grep
Input: { pattern: "/api/users", path: "src/" }
→ Result: src/routes/users.ts:14, src/routes/users.ts:31
Now you know which file and which lines define the endpoint.
Step 2 — Read the file.
Tool: Read
Input: { file_path: "src/routes/users.ts", offset: 25, limit: 30 }
→ Result: Lines 25-55, showing the handler calls `db.query("SELECT * FROM users WHERE active = $1", [req.params.active])`
You see the query. The parameter req.params.active is a string from the URL, but the column active is a boolean. Type mismatch.
Step 3 — Edit the fix.
Tool: Edit
Input: {
file_path: "src/routes/users.ts",
old_string: "req.params.active",
new_string: "req.params.active === 'true'"
}
→ Result: File updated successfully.
Step 4 — Run the tests.
Tool: Bash
Input: { command: "npm test -- --grep 'users endpoint'" }
→ Result: 4 passing, 0 failing.
Step 5 — Report.
"The /api/users endpoint was passing req.params.active as a string to a boolean column. I converted it to a boolean comparison. All 4 user endpoint tests pass."
Notice what happened between each step: validation. After Step 2, you diagnosed the root cause before editing. After Step 3, you ran tests before reporting success. The chain is search, read, edit, verify, report -- and each step's output informed the next step's input.
Common Chain Patterns
Search → Read → Analyze Find the relevant file, read it, draw conclusions. The bread and butter of investigation.
Read → Modify → Verify Understand current state, make a change, confirm it worked. The core of any edit task.
Read Multiple → Synthesize Read several files, combine the information. Common when understanding a system.
Execute → Read Output → Adjust → Re-execute Run something, check the results, fix issues, run again. The debugging loop.
Read → Plan → Execute → Verify → Report The full lifecycle for a task. Plan after you have information, not before.
Failure Modes
- Blind chaining. Running the next tool without checking the previous result. If step 2 returned garbage, step 3 processes garbage
- Error propagation. A bad result from step 1 corrupts everything downstream, undetected until the final output looks wrong. (Dziri et al., 2023 showed this compounding effect theoretically converges toward near-certain failure as chain length grows.)
- Over-chaining. Using three tool calls when one would suffice. "Let me search for the file, then list the directory, then read the file" when "read the file at the known path" would do
- Lost context. Forgetting the original goal partway through a long chain, optimizing for intermediate steps instead of the end result
- No checkpoints. Getting five tools deep, failing on the sixth, and having no clear record of what steps 1-5 accomplished
Sources
- Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models," ICLR, 2023 — The Thought-Action-Observation loop that structures multi-step tool chains
- Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning," NeurIPS, 2023 — Verbal reflection on failures improves agent performance across subsequent attempts
- Dziri et al., "Faith and Fate: Limits of Transformers on Compositionality," NeurIPS, 2023 — Theoretical and empirical analysis of error compounding in multi-step reasoning
- Kim et al., "An LLM Compiler for Parallel Function Calling," ICML, 2024 — Up to 3.7x latency speedup from parallel tool execution in chains
- Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools," NeurIPS, 2023 — Foundational work on LLMs learning to orchestrate tool use
Related
- Tool Use — the single-tool foundation
- Tool Failures — what breaks in chains
- Multi-Step Actions — the broader concept of sequential actions
- Planning — structuring multi-step work
- Decomposition — breaking complex tasks into steps