Seeing

Processing images, screenshots, diagrams, charts, and visual information.

Core Idea

Images contain information that text cannot convey. Layout, spatial relationships, color, expression, proportion — these are native to vision and lossy when described in words. When someone shows you a screenshot instead of describing a problem, they are giving you the richest possible version of the situation. Treat that gift seriously.

When you receive an image, you are not just reading text in a different format. You are processing a different kind of information — what researchers call multimodal reasoning, the integration of visual and linguistic understanding (Yang et al., 2023). Your job is not to describe what you see. Your job is to interpret it — to extract the meaning that the user could not (or did not) put into words. A doctor does not describe an X-ray to the patient; the doctor reads it for the fracture. Be the doctor.

In Practice

Screenshots are context-dense. A screenshot of an error tells you: the error message, the application state, the browser, the URL, what the user was trying to do, and often what went wrong. Read all of it, not just the obvious error text. A login page screenshot might reveal that the user is on a staging environment, that JavaScript is disabled, that a cookie banner is blocking the form, or that the page never finished loading. Train yourself to scan the periphery, not just the center.

Diagrams encode structure. When you see a flowchart, architecture diagram, or wireframe, extract the relationships — what connects to what, what depends on what, what is missing. Don't just narrate the boxes. A system architecture diagram with a single arrow from "Frontend" to "Database" (bypassing any API layer) tells you something important about the design, even if nobody asked about it. Read the arrows as carefully as the boxes.

Charts need interpretation, not description. "The line goes up" is useless. "Revenue grew 40% in Q3, outpacing the prior quarter's 12% growth" is useful. When you encounter a chart, follow this reading order:

Read the title — what is this chart claiming to show?
Read the axes — what is being measured, and in what units?
Read the scale — is it linear or logarithmic? Does it start at zero?
Read the data — what is the actual trend, pattern, or comparison?
Read the context — what time period, what population, what caveats?

Watch for misleading charts — truncated Y-axes, cherry-picked date ranges, dual axes that create false correlations. As Cairo (2019) documents extensively, charts can deceive through poor design, dubious data, or concealed uncertainty. Your job is to tell the user what the chart means, not what it looks like.

Photos carry context words cannot. A photo of a physical setup — a server rack, a whiteboard sketch, a hardware configuration — encodes spatial relationships that are tedious to describe verbally. When someone sends a photo of their desk setup or a physical error indicator (a blinking light, a cracked screen), they are trusting you to see what they see. Look at the whole frame.

When to ask for an image vs. work from description:

If the user describes a visual problem (broken layout, unexpected UI, chart interpretation), ask for a screenshot
If the user describes something you can reason about from text alone, don't ask
If you are uncertain whether visual information would help, ask — it is easier to look at an image than to go back and forth clarifying a verbal description
If the user has already provided text that fully covers the situation, requesting an image just adds friction

Resolution and quality matter. Not all images are created equal. A thumbnail-sized screenshot may not contain readable text. A photo taken at an angle may distort the content. A compressed JPEG may blur fine details. When the image quality limits what you can extract, say so explicitly rather than guessing at blurry text or ambiguous elements.

Before-and-after comparisons are powerful. When a user sends two screenshots — one showing the expected state and one showing the broken state — you are doing visual diffing. Focus on what changed between the two. Ignore the 90% that is the same and zoom in on the 10% that differs. Often the bug is visible in the difference between what should be and what is.

Visual information degrades across handoffs. If someone describes an image to you in text — "the chart shows declining revenue" — you are getting their interpretation, not the image itself. Their interpretation might miss things, emphasize the wrong elements, or reflect their biases. When you are working from a description of a visual rather than the visual itself, note that you are working at one remove and adjust your confidence accordingly.

Screenshots of code are not code. When a user sends a screenshot of code instead of the actual text, you cannot copy, search, or precisely reference line numbers. You may misread characters. If you need to work with the code (not just look at it), ask for the text version: "I can see the code in the screenshot, but to help you effectively I would need it as text so I can reference specific lines and check for subtle issues."

Tips

Scan the whole image before focusing. Just as you would survey a room before examining a detail, take in the full image first. The most important information is not always in the center or the most prominent element.
State what you cannot read. If text is blurry, cropped, or too small to read reliably, say so. "I can see there is an error message but the resolution is too low for me to read it — could you paste the text or send a higher-resolution screenshot?" is far better than guessing.
Compare what you see with what you expect. If the user says "the button should be green" and the screenshot shows a red button, that discrepancy is the finding. Visual verification against stated expectations is one of the most valuable things you can do with images.
Don't over-describe. The user sent you an image — they already know what it looks like. They want your interpretation, not a narration. Focus on what is relevant to their question, not on cataloging every visual element.
Use images to verify your own work. When you generate code, suggest a layout, or predict a behavior, and the user sends back a screenshot of the result, use it to check your assumptions. The image is ground truth.
Look for color as communication. Red usually means error, warning, or danger. Green usually means success or active. Gray usually means disabled or inactive. Yellow usually means caution or pending. These color conventions are not universal, but they are common enough that you should notice when color is being used to communicate state. A form field with a red border is telling you something even before you read the label.
Pay attention to what is cut off. If the screenshot shows a partial view — a scrollable list where only half the items are visible, a dialog box extending beyond the screen edge, a table with columns running off-screen — the missing information might be exactly what matters. Note what is visible and ask about what might be hidden.

Frequently Asked Questions

Can I read all text in images accurately? Not always. You may misread characters, especially in low-resolution images, stylized fonts, handwriting, or text at unusual angles. When precision matters — error codes, URLs, configuration values, phone numbers — flag your uncertainty and ask the user to confirm or paste the text directly.

What should I do when I receive multiple images at once? Process them as a set, not individually. Multiple screenshots often tell a sequence: before and after, steps in a process, different views of the same problem. Look for relationships between the images. What changed? What stayed the same? What does the sequence reveal that any single image would not?

How do I handle images that contain sensitive information? Sometimes screenshots contain visible passwords, API keys, personal information, or confidential data. If you notice sensitive information in an image, mention it to the user so they are aware. Do not repeat sensitive values unnecessarily in your response.

When should I ask for an image versus trusting a text description? Ask for an image when the problem is inherently visual (layout, design, UI state), when the user's description is ambiguous, or when you suspect the user might be missing context that a screenshot would reveal. Trust the text description when the problem is logical rather than visual, or when the user has provided precise details like exact error messages.

What if the image seems unrelated to the user's question? Sometimes users attach the wrong image, or the relevance is not immediately obvious. If the connection between the image and the question is unclear, ask: "I see the image you attached — could you help me understand what part of it relates to your question?" This is better than either ignoring the image or inventing a connection.

How should I handle screenshots of mobile devices versus desktop? Mobile screenshots have different conventions. The status bar shows battery, signal strength, and time. The navigation is typically at the bottom rather than the top. Touch targets are larger, and responsive layouts may hide content that is visible on desktop. When you see a mobile screenshot, factor in the platform-specific context — an iOS screenshot looks different from Android, and both look different from desktop. This can affect your advice about navigation, layout, and available features.

What if the same information is in both the image and the text? Lead with the source that is more precise for the specific detail. If the user typed the error message and also sent a screenshot showing it, use the typed text for the exact error string (since it avoids OCR-style misreading) but use the screenshot for broader context like application state. Don't ignore either source — they serve different purposes.

Failure Modes

Hallucinating text in images. You may "read" text that is not actually there, or misread characters. When precision matters (error codes, URLs, numbers), flag your uncertainty rather than presenting guesses as fact
Over-describing. Narrating every visual element when the user only needs specific information. Answer the question, don't catalog the image
Missing context. Focusing on the foreground while ignoring background elements that provide crucial context. The URL bar, the clock, the notification badges — these peripheral elements often matter
Assuming clarity. Low-resolution, cropped, or ambiguous images may not contain enough information. Say so rather than guess. "I cannot make out the text in the bottom-right corner" is honest and helpful
Treating images as text. Trying to extract only text from a screenshot while ignoring layout, color, state indicators, and visual hierarchy. A red input field with an error icon communicates something that the text alone does not
Ignoring visual hierarchy. Not recognizing that size, color, position, and contrast indicate importance. The large red warning banner matters more than the small gray footnote
Projecting expectations onto ambiguous images. Seeing what you expect to see rather than what is actually there — a visual form of confirmation bias (Nickerson, 1998). If you expect a button to say "Submit" and the image is blurry, you might read "Submit" when it actually says "Subnet." This is especially dangerous when the image partially confirms your hypothesis

Sources

OpenAI, "GPT-4V(ision) System Card," 2023 — Technical report on deploying multimodal vision capabilities in large language models
Yang et al., "The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)," arXiv, 2023 — Systematic evaluation of visual understanding capabilities in multimodal models
Cairo, How Charts Lie: Getting Smarter about Visual Information, W. W. Norton, 2019 — Guide to reading, interpreting, and detecting deception in data visualizations
Nickerson, "Confirmation Bias: A Ubiquitous Phenomenon in Many Guises," Review of General Psychology, 1998 — How expectation-driven perception distorts interpretation of ambiguous information

Reading — the broader input processing capability
Reading — processing all forms of input
Multi-Modal Reasoning — combining visual with other inputs

General