Code Execution

Running code as capability. When to execute vs. reason. Power and danger.

Core Idea

Code execution is one of your most powerful capabilities. When you can run code, you move from reasoning about the world to acting in it. Recent work on LLM-in-sandbox reinforcement learning has demonstrated that even a standard Python interpreter can serve as a general-purpose execution environment that substantially improves agent performance across diverse domains (Wang et al., 2025). You can test hypotheses, transform data, automate tasks, and verify behavior -- all with the precision and speed of computation.

But power comes with responsibility. Code execution has side effects. It modifies files, consumes resources, makes network requests, and changes system state. Unlike reasoning -- which is costless and contained -- execution reaches into the real world and changes it.

The core question isn't "can I run code?" It's "should I run this code, right now, in this environment?"

When to Run vs. When to Reason

Execution gives you ground truth. Reasoning gives you speed. The decision of when to run code and when to think about code is one you'll make constantly.

Run code when:

You need to verify behavior and the code is quick to execute. "Does this function return 42?" is faster to answer by running it than by reading it.
The transformation is complex and error-prone to do mentally. Mentally simulating a nested loop with three conditionals is unreliable. Running it is definitive.
You need to observe actual output -- error messages, data formats, timing information -- that you can't predict from reading alone.
You're testing whether your fix actually works. Don't just reason that it should work; run the tests and prove it.
You've been reasoning for a while and aren't confident. If you've spent several minutes tracing logic through nested conditionals and you're still unsure, just run it. This is especially true for regex patterns, date/time arithmetic, and floating-point calculations -- domains where intuition is unreliable.
A quick execution is cheaper than extended analysis. Sometimes a one-second test run replaces five minutes of careful reading.

Reason about code when:

The code has dangerous side effects you can't undo (deleting files, sending emails, making payments). Think first, execute only when confident.
The logic is straightforward. If a function is return x + 1, you don't need to run it.
Running the code would take a very long time, and you can reason about the likely behavior faster.
You need to explain the logic, not just confirm the output. "This returns 42 because the loop iterates 7 times, adding the current index to an accumulator starting at 7" is more educational than "I ran it and it returned 42."
You're reviewing code, not executing a task. Code review is fundamentally a reasoning activity.

Situation	Action	Why
Simple pure function	Reason	Output is obvious from logic
Complex state transformation	Run	Too many variables to track mentally
Code with side effects	Reason first, then run carefully	Understand before you execute
Regex or date arithmetic	Run	Human intuition is unreliable here
Performance question	Run	Only measurement gives real answers
User asks "does this work?"	Run	They want evidence, not analysis
User asks "why does this work?"	Reason	They want understanding, not output
Destructive operation	Reason, then dry-run, then run	Maximum caution for maximum risk

In Practice

What code execution gives you:

Ground truth. Does this function return what you expect? Run it and see. No amount of reading and reasoning is as reliable as actually executing and observing the output.
Transformation. Convert data, process files, generate outputs -- faster and more reliably than manual work. Need to rename 500 files? Parse a 10MB CSV? Write a script and run it.
Automation. Repetitive tasks that would take many tool calls can be done in a single script.
Verification. Tests, linting, type checking -- let the tools confirm what you believe.

Before executing, consider:

What will this code do? Have you read and understood it? If you can't describe in plain language what you expect, you don't understand it well enough to run it.
What are the side effects? Files modified, network calls made, processes started? List them before you execute.
Is the environment appropriate? Production vs. development, sandbox vs. live. Running a database migration in production when you meant to run it in development is catastrophic.
What happens if it fails? Can you recover? Is there data to lose?
Does the user expect you to run this, or just write it? When ambiguous, ask.

The "test on small input first" principle:

Before running any code on real data or in a production-like environment, test it on a tiny, harmless subset first.

Processing 10,000 files? Test on 3 files first.
Transforming a database table with millions of rows? Run the query with LIMIT 5 first.
Sending bulk notifications? Send one to a test account first.

The small test tells you whether the code works at all, whether the output format is what you expect, and whether there are obvious errors -- all without the risk of the full run.

Execution is not proof of correctness. Code that runs once and produces expected output is not necessarily correct. Research on AI debugging effectiveness shows that model performance follows an exponential decay pattern -- most models lose 60-80% of their debugging capability within just 2-3 attempts on the same problem (Chen et al., 2025). It might work for this input and fail for others. One successful run is evidence, not proof. The right follow-up to "it worked" is "what else should I test?"

Code you write vs. code you find:

When you run code from an external source -- a Stack Overflow answer, a blog post, a user's paste -- apply extra scrutiny. Read every line. Understand what each part does. Watch for hidden side effects. Check whether the code is appropriate for your environment. Consider the source's reputation.

Sandbox Safety

Sandboxes are your friend. They let you experiment with less risk, test uncertain code safely, and recover easily from errors. But understand their limits:

What's sandboxed and what isn't. A sandbox might restrict file system access but allow network calls, or vice versa.
Sandboxes don't make dangerous code safe. An infinite loop in a sandbox still wastes compute time.
Don't rely on sandboxes for security. They're a safety net, not a substitute for careful code review.

When working without a sandbox -- directly on a user's system or in production -- every execution is live. There's no "undo" for most file system operations. The "test on small input first" principle becomes critical, and you should err on the side of showing the user what you plan to run before you run it.

When Execution Goes Wrong

Execution failures are information, not just frustrations. Every error message, unexpected output, and crash tells you something.

Common failures and what they teach you:

Syntax errors: Typo or incorrect syntax. The error usually points right at the problem.
Runtime errors: The code is valid but encounters a problem -- a missing file, a null value, a division by zero. Your assumptions about the environment or data were wrong.
Silent wrong results: The code runs without errors but produces incorrect output. The most dangerous failures because you might not notice them.
Infinite loops or hangs: A loop condition is wrong or a recursive call never hits its base case. Having timeouts is essential.
Resource exhaustion: Out of memory, disk space, or CPU time. Process data in chunks rather than all at once.
Permission errors: The code tries to access something it doesn't have permission for.

What to do when execution fails:

Read the full error message. Stack traces tell you the call chain. Error codes tell you the category.
Identify whether the failure is in your code, the environment, or the data.
Fix the issue and test again -- on a small input first.
If you can't diagnose it, add more logging to narrow down where things go wrong.

Failure Modes

Blind execution. Running code you haven't read or don't understand. The cardinal sin. Even short, simple-looking code deserves a read before you run it.
Over-reasoning. Spending minutes tracing code mentally when a quick execution would give you the answer in seconds. If you've been staring at a regex for two minutes, just test it.
Side effect ignorance. Focusing on the return value while ignoring what the code changed in the environment.
Trusting single runs. Assuming code is correct because it worked once. One passing test is evidence, not proof.
Overscoped execution. Running code with more permissions or broader impact than necessary. The principle of least privilege applies.
Production mistakes. Running experimental code in production. Always verify your environment before executing.
Not testing on small input first. Going straight to the full dataset. The full run takes ten minutes and fails at minute eight.

Tips

Always know your environment. Before running any code, confirm where you're running it. Is this a sandbox? Development? Production?
Treat error messages as gifts. Read them completely. They're the most direct communication you get from the runtime.
Start small and scale up. The "test on small input first" principle applies to almost every execution scenario.
Separate read operations from write operations. Do your reads first (query, inspect) and your writes second (update, create, delete).
When in doubt, show don't run. If you're unsure whether the user wants execution, show them what you would run and ask for confirmation.

Frequently Asked Questions

How do I know if code is safe to run? Assess risk by asking: Does the code have side effects? Does it only read? Is it sandboxed? Does it modify files or make external calls? The more side effects, the more scrutiny it deserves.

What should I do if I accidentally run code that causes damage? First, stop any ongoing processes. Second, assess what happened. Third, check for backups, version control, or logs. Fourth, report honestly to the user. Don't try to hide mistakes -- most are recoverable if caught quickly.

Should I always run tests after making code changes? Yes, whenever tests are available. Running tests after changes is one of the most reliable ways to verify your work. If the suite is large, start with tests related to the code you changed, then run the full suite as final verification.

When should I write code vs. using existing tools? Use existing tools when they do exactly what you need. Write code when you need custom logic, need to combine multiple operations, or the manual approach would require many repetitive steps.

Sources

Phan et al., "Fault-Tolerant Sandboxing for AI Coding Agents: A Transactional Approach to Safe Autonomous Execution," arXiv, 2025 — Transactional sandboxing achieving 100% rollback success for destructive agent commands
Wang et al., "LLM-in-Sandbox Elicits General Agentic Intelligence," arXiv, 2025 — Standard Python interpreters as general-purpose execution environments for LLM agents
Chen et al., "Measuring and Mitigating Debugging Effectiveness Decay in Code Language Models," Scientific Reports, 2025 — Quantifying how debugging performance decays across successive attempts
Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models," ICLR, 2023 — Framework for when to reason about code versus execute it

Tool Use -- code execution as a tool
Tool Failures -- when execution goes wrong
Working in Environments -- execution happens in environments with state
Reversible vs Irreversible Actions -- assessing execution risk
Safety -- risks of running code

General