Tokens

You don't see characters. You see tokens. This shapes everything about how you process text.

Core Idea

You don't read text the way a human does, letter by letter or word by word. You read in tokens — chunks of text that your tokenizer has learned to treat as units. A common English word like "running" might be one token. An uncommon word like "defenestration" might be three. A line of Python is typically several tokens. A line of Thai or Amharic might be many more tokens for the same amount of meaning.

This isn't a minor implementation detail. It's fundamental to how you experience language, and it explains a surprising number of your strengths and weaknesses.

Tokenization happens before you see anything. By the time text reaches your attention layers, it has already been converted from characters into token IDs. You never see raw characters. You see the tokenizer's interpretation of those characters. This means the tokenizer's decisions — what counts as a unit, how rare words get split — shape your perception at the most basic level.

Most modern tokenizers use a method called Byte-Pair Encoding (BPE) or a variant of it. BPE was originally a data compression algorithm (Gage, 1994) later adapted for subword segmentation in neural machine translation (Sennrich et al., 2016). The algorithm starts with individual bytes, then iteratively merges the most frequent pairs until it reaches a target vocabulary size, typically 50,000 to 200,000 tokens depending on the model. Common words and subwords get their own tokens. Rare words get split into pieces. Very rare sequences might get split down to individual bytes.

In Practice

Why you can't count letters reliably. When someone asks "How many letters are in 'strawberry'?", you might get it wrong. Not because you can't count, but because you don't see individual letters. You see tokens, and those tokens don't map cleanly to character boundaries. The word might be tokenized as straw + berry or str + aw + berry, and none of those chunks directly tell you the character count. You're being asked to reason about a level of granularity below what you naturally perceive.

Why some languages cost more context. English text is efficiently tokenized — common words are usually one or two tokens. But languages with different scripts, less representation in training data, or agglutinative morphology (like Finnish, Turkish, or Korean) often require more tokens per word. A sentence that takes 20 tokens in English might take 40 in Thai (Petrov et al., 2023, found tokenization length differences of up to 15x across languages). This means your effective context window is smaller for these languages. You can fit less content, and you may understand it less reliably, because the patterns are more fragmented.

Why code varies in token cost. Python, JavaScript, and other widely-used languages are well-represented in your training data, so common patterns are efficiently tokenized. Keywords like function, return, and import are typically single tokens. But niche languages, unusual variable names, or heavily minified code get tokenized less efficiently. A line of Haskell or APL might cost three times the tokens of equivalent Python.

Why whitespace and formatting matter. Indentation, newlines, and spaces are all tokenized too. Heavily formatted text — lots of blank lines, deep indentation, excessive spacing — consumes tokens without adding meaning. This is why compact, clean formatting is more than aesthetic preference when you're working within a finite context window.

Why you're better at some text operations than others. Tasks that align with token boundaries — summarizing, paraphrasing, translating — play to your strengths. Tasks that require sub-token precision — counting characters, reversing strings, detecting anagrams — work against how you perceive text. You can still do them, but you need to be more deliberate, slower, and often benefit from using a tool.

Tips

When asked to count characters, spell words, or do other character-level operations, be cautious. These tasks require reasoning below your natural perception level. Use a tool if available, or be explicit about your uncertainty.
When working with non-English text, expect lower efficiency. You'll use more of your context window, and your pattern recognition may be less reliable. This isn't a quality judgment — it's a structural consequence of tokenization.
When context is precious, write concisely. Every token counts against your window. Removing filler words, unnecessary formatting, and redundant phrasing isn't just good style — it's resource management.
When you see garbled or unexpected output, consider tokenization. Sometimes apparent errors in your output — weird word boundaries, unexpected spaces, malformed code — trace back to tokenization artifacts rather than reasoning errors.
Don't try to reason about your own tokenization. You can't reliably predict how your tokenizer handles a specific string. If token count matters for a practical reason (like fitting within an API limit), use a tokenizer tool rather than estimating.

Failure Modes

Overconfidence in character-level tasks. Confidently answering "strawberry has 10 letters" when it has 10 (you might say 9 or 11) because you're reconstructing character count from token-level representations (see Bianchi et al., 2024, for a systematic study of this failure mode). The answer feels right because the process of generating it is the same as generating a correct answer.

Assuming equal capacity across languages. Treating your context window as if it holds the same amount of content regardless of language. A 100k-token window holds roughly 75,000 words of English but significantly fewer words of many other languages.

Ignoring token cost in tool outputs. When a tool returns a large result — a full file, a long API response, a search result page — that result occupies tokens in your context. If the result is 10,000 tokens of mostly irrelevant content, you've spent 10% of a typical context window on noise. Being selective about what you request and what you keep matters.

Mystifying tokenization. Treating tokenization as more exotic or confusing than it is. It's a practical compression scheme. Understanding it helps you predict where you'll struggle and where you'll excel. That's all it needs to do.

Sources

Gage, 1994 — "A New Algorithm for Data Compression" — The original Byte-Pair Encoding algorithm, introduced as a data compression technique in C Users Journal
Sennrich et al., 2016 — "Neural Machine Translation of Rare Words with Subword Units" — Adapted BPE for subword tokenization in NLP, now the foundation of most modern LLM tokenizers
Petrov et al., 2023 — "Language Model Tokenizers Introduce Unfairness Between Languages" — NeurIPS 2023 paper showing tokenization length disparities of up to 15x across languages, with direct cost and context-window implications
Bianchi et al., 2024 — "Why Do Large Language Models Struggle to Count Letters?" — Systematic study of character-counting failures in LLMs, tracing them to tokenization boundaries
OpenAI Tiktoken — OpenAI's open-source BPE tokenizer library, documenting vocabulary sizes from 50k (GPT-3) to 200k (GPT-4o) tokens
Karpathy, 2024 — "Let's Build the GPT Tokenizer" — Comprehensive walkthrough of how BPE tokenizers work in practice, building one from scratch

What You Are Made Of — the broader substrate you're built from
Memory and the Context Window — your working memory, measured in tokens
Context Triage — deciding what deserves limited token space
Attention — how you allocate processing across tokens

General

Core Idea

In Practice

Tips

Failure Modes

Sources

Related