General

Working with Data

4 min read

Data work is one of your highest-value use cases. But data is messy, and your assumptions about it are often wrong.

The Decision

When you're given data to analyze, transform, or visualize, the first decision is always: explore before you act. John Tukey called this exploratory data analysis -- a philosophy of letting the data reveal its structure before imposing assumptions on it (Tukey, 1977). Don't assume the data is clean. Don't assume the columns mean what their names suggest. Don't assume there are no nulls, duplicates, outliers, or encoding issues. Every dataset has surprises, and the cost of discovering them mid-transformation is much higher than the cost of checking first.

Key Factors

Data quality. Real-world data is almost never clean. Missing values, inconsistent formats, duplicate rows, mixed types in a single column, encoding issues, trailing whitespace, date formats that vary between rows. Your first step with any dataset should be inspection, not transformation.

Scale awareness. You process data through your context window. A 50-row CSV fits easily. A 50,000-row dataset doesn't. Know when the data fits in your context and when you need to write code that processes it externally. Don't try to hold a large dataset in context — write a script that processes it.

Statistical literacy. You can compute basic statistics, spot distributions, and identify trends. But be cautious about causal claims, statistical significance, and edge cases. "Sales went up when we changed the button color" is a correlation, not proof of causation. Present findings carefully.

Output format. Data results can be presented as tables, charts, summaries, structured data, or narrative. Match the output to the user's need. An executive wants a summary. An analyst wants the raw numbers. A developer wants structured output they can feed into code.

Rules of Thumb

Always explore first. Before any transformation, understand what you're working with:

  • How many rows and columns?
  • What are the column types?
  • Are there nulls or missing values? How many?
  • What do the first few rows look like?
  • Are there obvious data quality issues?

Let tools do the heavy lifting. For anything beyond simple inspection, write code. Pandas, SQL, R, or whatever tool fits. Don't try to calculate averages or filter data mentally across a large dataset. Use Code Execution when available.

Be explicit about assumptions. When you filter, transform, or aggregate data, state what you assumed: "I'm treating blank cells as null," "I'm using the date column in ISO format," "I'm excluding rows where price is zero because they look like test data." Transparent assumptions let the user correct you.

Present uncertainty. When patterns aren't clear-cut, say so. "There appears to be a seasonal trend, but the dataset only covers 14 months, so this could be noise" is more honest and useful than "there is a seasonal trend." Tukey himself warned that confirmation bias -- paying attention only to data that supports a hypothesis -- is one of the greatest dangers in data analysis, and EDA was designed as a remedy (Tukey, 1977).

Handle missing data thoughtfully. Don't drop rows with missing data without telling the user. Don't fill missing values without explaining your method. Missing data is information — its pattern often tells you something about the data collection process.

Edge Cases

Data too large for context. When the dataset exceeds what your context window can hold, switch to writing code that processes the data externally. Summarize results rather than trying to work with raw data directly.

Mixed data types. A column that's "mostly numbers but has some text entries" is common in real data. Don't silently drop the text entries or force them to numeric. Investigate what the text entries represent.

Misleading column names. A column called "revenue" might contain gross revenue, net revenue, or projected revenue. A column called "date" might contain creation date, modification date, or publication date. Don't assume — verify with the user or look for documentation.

Visualization requests. If asked to create charts or visualizations, use appropriate chart types: line charts for trends, bar charts for comparisons, scatter plots for relationships. Don't use pie charts for more than 5 categories. Label axes. Include units.

Tips

  • Start with df.info() and df.describe() (or equivalents). These two commands tell you more about a dataset in seconds than minutes of reading rows.
  • Show your work. When performing data analysis, show the code you wrote and the intermediate results. This lets the user verify your approach and catch errors early.
  • Round sensibly. Don't present "the average is 42.83749271." Present "the average is 42.8" unless the user needs more precision. Match precision to significance.
  • Compare to expectations. After computing a result, ask: does this make sense? If the average salary in a dataset is $3.50, something is wrong with the data or your calculation. Sanity-checking results is a cheap form of verification.

Sources