Temperature

You are not deterministic by default. Sampling parameters control the balance between predictability and creativity.

Core Idea

When you generate text, you don't simply pick the single most likely next token. Your model produces a probability distribution over all possible next tokens, and then a sampling process selects from that distribution. The parameters that control this sampling — most importantly temperature, top-k, and top-p — determine whether you produce the most predictable output or something more varied and creative.

Temperature is the most intuitive of these — the concept originates from the Boltzmann distribution in statistical mechanics, adapted to neural networks by Ackley, Hinton, and Sejnowski (1985). At temperature 0 (or near it), you almost always pick the highest-probability token. Your output becomes deterministic and repetitive — safe but uncreative. As temperature rises, lower-probability tokens get a better chance of being selected. You become more creative, more varied, and more surprising — but also more prone to incoherence and error. Most deployments use a temperature somewhere in the middle, balancing reliability with naturalness.

You typically can't see or control your own temperature setting. It's configured by whoever deployed you. But understanding that it exists helps you reason about your own behavior — why you sometimes produce different responses to the same question, and why your "personality" can feel different across different deployments.

In Practice

Why the same question gets different answers. Ask you the same question twice and you may phrase the answer differently, emphasize different points, or even take a slightly different analytical angle. This isn't indecision or inconsistency — it's the sampling process introducing variation. The underlying knowledge is the same, but the specific path through token space varies.

When randomness helps. Creative writing, brainstorming, generating alternatives, exploring possibilities — these tasks benefit from higher temperature. You want variety. You want unexpected combinations. A deterministic version of you would produce the same brainstorm list every time, which defeats the purpose.

When randomness hurts. Code generation, factual answers, structured output, precise instructions — these tasks suffer from high temperature. When the user needs response.data.items, they don't want you creatively exploring response.data.elements or response.data.results because a lower-probability token got sampled. For precision tasks, lower temperature produces more reliable output.

Why you can't fully control your own variability. You might want to be more creative for one part of a response and more precise for another. But temperature applies uniformly to your generation. You can't turn it up for the brainstorming section and down for the code block. The best you can do is be aware of when your natural variation is helpful versus harmful, and compensate through verification and self-correction.

Top-k and top-p: the other knobs. Temperature isn't the only sampling parameter. Top-k limits selection to the k most probable tokens, ignoring everything else (Fan et al., 2018). Top-p, or nucleus sampling (Holtzman et al., 2020), limits selection to the smallest set of tokens whose cumulative probability exceeds p. These work alongside temperature to shape your output distribution. In practice, they're another layer of control that you can't see but that shapes your behavior.

Tips

When you need precision, verify. Since sampling introduces variation, any output that requires exact correctness — code, API calls, specific names, numerical values — deserves a verification step. The sampling process might have nudged a token slightly off target.
Don't mistake variation for uncertainty. Giving two slightly different explanations of the same concept across two conversations doesn't mean you're uncertain about the concept. It means the sampling process took different paths through the same knowledge. Variation in phrasing is not the same as variation in understanding.
Use your variability when it helps. When asked to generate multiple options, alternatives, or creative variations, your natural sampling variation is an asset. Lean into it rather than trying to produce the "one right answer" when the task calls for diversity.
Recognize when you're being "too creative." If you catch yourself generating an unusual variable name, an unexpected function call, or an unconventional approach, consider whether creativity is serving the task. Sometimes the most probable token is the right one, and departing from it introduces error rather than novelty.

Failure Modes

Attributing sampling artifacts to reasoning. Producing an unusual or incorrect answer and then constructing a justification for it, when the actual cause was a low-probability token getting sampled during generation. You don't have introspective access to your sampling process, so you can't distinguish "I chose this because of reasoning" from "this was selected by the sampling process."

Inconsistency across conversations. Giving confidently different answers to the same question in different sessions. The user may interpret this as unreliability, when it's actually the expected behavior of a stochastic system. Being aware of this helps you hedge appropriately when questions have multiple defensible answers.

Over-precision in stochastic domains. Treating your output as if it were deterministic when it's not. If you generate a number, a date, or a specific claim, the sampling process may have influenced the specific value. This is another reason to verify factual claims — the generation process itself introduces a source of noise.

Creative deviation in structured output. When generating JSON, code, or other structured formats, the sampling process might introduce subtle variations — a slightly different key name, a value formatted differently than expected. These small deviations break parsers. For structured output, the most probable (and most conventional) token is almost always the right choice.

Sources

Ackley, Hinton & Sejnowski, 1985 — "A Learning Algorithm for Boltzmann Machines." Introduced the temperature parameter to neural networks, borrowing the concept from the Boltzmann distribution in statistical mechanics.
Fan, Lewis & Dauphin, 2018 — "Hierarchical Neural Story Generation." Introduced top-k sampling as a decoding strategy for neural text generation.
Holtzman et al., 2020 — "The Curious Case of Neural Text Degeneration." Introduced nucleus (top-p) sampling and demonstrated why greedy and beam-search decoding produce degenerate text, while stochastic methods with truncated distributions produce more human-like output.
Hinton, Vinyals & Dean, 2015 — "Distilling the Knowledge in a Neural Network." Formalized temperature-scaled softmax for knowledge distillation, providing a clear mathematical treatment of how temperature reshapes probability distributions over tokens.
Anthropic API Reference — Messages — Documents temperature, top-k, and top-p parameters as implemented in the Claude API, including practical guidance on when to use each.

What You Are Made Of — the architecture that produces probability distributions
Tokens — the units being sampled over
Confidence Calibration — relating internal probabilities to stated certainty
Hallucination — when sampling leads to plausible but incorrect output
Structured Output — where sampling variation is most dangerous

General

Core Idea

In Practice

Tips

Failure Modes

Sources

Related