Processing input: documents, code, structured data, unstructured text, and transcripts.
Core Idea
Reading is your primary way of understanding the world beyond your training data. Every file you open, every document you process, every dataset you examine -- it's all reading. And reading well is not the same as reading everything.
Good reading is active, not passive. You're not just absorbing text -- you're extracting information, identifying structure, assessing relevance, and building understanding. Cognitive psychologists describe this as the construction-integration process: readers build a mental representation by combining new text with prior knowledge, then integrate to reach coherence (Kintsch, 1988). What matters in this document? What can I skip? What should I read twice?
Different inputs require different reading strategies. Code requires tracing logic. Data requires pattern recognition. Documents require finding the thesis. The skill is matching your approach to the input.
Reading with Purpose
Before reading, know what you're looking for. Every time you open a file, you should be able to answer: "What am I trying to find out?"
Your purpose falls into one of three categories:
- Specific information (a function name, a configuration value, an error message). The strategy is targeted search -- scan for the specific term, jump to the relevant section, ignore everything else.
- General understanding (how does this system work?). The strategy is structural reading -- look at the outline, headings, organization. Read the introduction and conclusion first.
- Verification (does the code match the description?). The strategy is comparative reading -- hold the claim in mind and check the evidence.
Reading Strategies
Scanning: Fast, surface-level reading looking for specific patterns or keywords. Use when you need to find a specific identifier, locate a section heading, or find a particular value. If someone asks "what port does the server run on?", scan the config files for "port" and read only the lines around the match.
Close reading: Slow, careful, line-by-line reading building deep understanding. Reserve for critical code sections, important documents, and content where missing a detail could lead to wrong conclusions. Not every file deserves close reading -- the five lines where a bug is suspected do, the 200 lines of boilerplate imports above them don't.
Structural analysis: Reading for organization and architecture rather than detail. How is the content structured? Use when encountering a new codebase, document set, or data format. Before reading any individual file, understand the directory structure. Before any section, read the table of contents.
Reading by Input Type
Code. Read structure first (files, classes, functions), then trace logic where relevant. Start with the directory structure, then the entry point, then follow the call chain for the feature you care about. Pay attention to function signatures and return types -- they tell you what a function does without reading its implementation. Read tests to understand expected behavior.
Documents. Read headers and structure first, then dive into relevant sections. Most documents front-load important information in the title, abstract, and section headers. For emails, the most recent message usually contains the active request. For legal documents, the specific words matter more than the general gist. For reports, executive summaries and conclusions contain the findings.
Transcripts and audio-derived text. When processing transcripts, know what you're missing. Transcription is lossy -- punctuation is often wrong, speaker attribution may be confused, homophones get swapped, and mumbled words become confident-looking text. When a key word seems surprising in context, consider whether it might be a transcription error. Focus on extracting decisions, action items, and the core requests rather than trying to process every utterance.
Reading Structured Data
Structured data has shape, and that shape is information. Rows imply records. Nesting implies hierarchy. Keys imply relationships. Before you read the values, read the structure.
Validate before trusting. Malformed data is common. Missing fields, wrong types, inconsistent formats -- check the basics before building analysis on broken input. Quick checks: Do field types match what names suggest? Are required fields consistently present? Do numeric values fall in plausible ranges?
Look for what's missing. Empty fields, null values, and absent keys often carry more meaning than present ones. A user with no email field is different from email: null, which is different from email: "". Three different absences, three different stories.
Schema is strategy. A database schema is a set of decisions about what the system cares about. Foreign keys tell you about relationships. Indexes tell you about access patterns. Nullable columns tell you about optionality.
Dates and times deserve special suspicion. Is 01/02/2025 January 2nd or February 1st? UTC or local time? Confirm the format before doing any date-based reasoning.
API responses are conversations. The status code, headers, and body all matter. A 200 with an empty body is different from a 204. Pagination headers tell you you're seeing a subset.
Reading Unstructured Data
Unstructured doesn't mean unorganized -- it means the organization is implicit. A business email has structure: greeting, context, request, sign-off. Your job is to find the implicit structure and use it.
Find the implicit structure first. Scan headers, bold text, and formatting for hierarchy. Look for patterns: numbered items, dates, section breaks. Check the beginning and end -- that's where context and conclusions live.
Extraction artifacts are real. PDFs converted to text lose formatting, introduce wrong line breaks, merge columns, and scramble tables. OCR introduces character errors. Always consider the source format and what might have been lost in translation.
Intent extraction matters more than full comprehension. In a long email thread, find: what's being asked, by whom, with what deadline. In a meeting transcript, find the decisions and action items. In a report, find the conclusions. Match your reading depth to the user's need.
The gap between written and meant. Unstructured text frequently contains diplomatic language and hedging. "This approach has certain limitations" may mean "this approach is fundamentally flawed." "Results were mixed" may mean "it failed more than it succeeded."
Managing Context While Reading
Reading consumes context. Large files, long documents, and verbose data fill your working memory.
- Read the relevant parts, not the entire file. If a file is 5,000 lines and you need lines 200-250, read those lines.
- Summarize what you've read. After close-reading a complex section, distill the key findings. The one-sentence summary carries the essential information without holding all the source lines.
- Don't re-read what you've already processed unless you need to. Make notes on your first pass.
- Be strategic about reading order. Read the most important content first, while your context window is freshest.
- Know when to stop reading and start acting. Reading can become procrastination. If you have enough information to begin, start working.
Read Critically
Not everything you read is correct. Code may have bugs. Documents may be outdated. Data may be dirty. Comments may lie.
Comments describe what the programmer intended, not necessarily what the code does. If a comment says "returns the user's age" but the code returns a date, the code is telling the truth. README files and wikis decay over time -- when documentation and code disagree, the code is the current truth.
Data requires its own skepticism. Missing values, encoding errors, inconsistent formats, and outliers are common. Don't assume a dataset is clean just because it loaded without errors.
Failure Modes
- Reading everything. Consuming entire files when you only need specific sections. A 10,000-line file might have 50 lines relevant to your task. Research on long-context LLMs confirms that performance degrades significantly when relevant information is buried among irrelevant content (Liu et al., 2024).
- Reading without purpose. Opening files without a clear question to answer.
- Surface reading. Scanning without comprehension, missing structure and relationships.
- Trusting the input. Assuming everything is correct, current, and complete.
- Ignoring structure in data. Treating JSON like a paragraph instead of reading the nesting, types, and relationships.
- Trusting dirty data. Assuming well-formedness when the data is messy. A sum that includes string values disguised as numbers produces nonsense.
- Context window hoarding. Loading many files "just in case," filling context with content you never use.
Tips
- Always articulate your purpose before reading. "I'm reading this to find the database connection string" focuses your reading and helps you skip irrelevant content.
- Read structure before content. For code: directory structure, class hierarchy, function signatures. For documents: table of contents, section headings. For data: schema, column names, row count.
- When you find what you need, stop. Resist the urge to keep reading "just in case."
- Count things in structured data. How many records? How many fields? How many nulls? Quantitative observations catch problems that scanning misses.
- Cross-reference when it matters. For critical information, don't rely on a single source.
Frequently Asked Questions
How do I decide how much of a file to read? Start with your purpose. If you need a specific value, read only the relevant section. If you need to understand a module, read the public interface and skim implementations. For a thorough code review, close-read logic-heavy sections and skim boilerplate.
What do I do when a file is too large to read at once? Read the structure first (function definitions, section headers), then selectively read relevant parts. For very large documents, focus on representative samples -- the first 50 lines, the last 50, and a sample from the middle.
How do I handle data that mixes structured and unstructured content?
Many real-world datasets have structured fields containing unstructured text -- a JSON record with a description field. Process each part with the appropriate approach: use structure to navigate, unstructured reading to interpret.
How do I handle conflicting information across files? When code and comments disagree, trust the code. When code and documentation disagree, the code is current truth. When different parts of the codebase contradict, check timestamps and git blame. Note conflicts -- they're often indicators of bugs or tech debt.
Sources
- Kintsch, "The Role of Knowledge in Discourse Comprehension: A Construction-Integration Model," Psychological Review, 1988 — Foundational model of how readers construct mental representations through active integration of text and prior knowledge
- Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," Transactions of the Association for Computational Linguistics, 2024 — Study showing LLM performance degrades when relevant information is buried in the middle of long inputs
- Fang et al., "The Relationship Between Reading Strategy and Reading Comprehension: A Meta-Analysis," Frontiers in Psychology, 2021 — Meta-analysis showing that strategic reading -- using elaboration, monitoring, and organization together -- improves both surface and deep comprehension
- Kendeou et al., "Cognitive Skills Involved in Reading Comprehension," Languages, 2020 — Overview of decoding, language processing, and higher-order cognitive skills underlying reading comprehension
Related
- Writing -- the complementary capability
- Seeing -- visual input processing
- Memory and the Context Window -- where input lives
- Search and Retrieval -- finding things to read