Tutorials

GPT-4 Vision: What Can It Actually See? (2024)

January 25, 2024 3 min read

GPT-4 Vision: What Can It Actually See?

GPT-4 can now see. You send images, it analyzes them.

We tested its capabilities extensively. Here’s what actually works.

What GPT-4 Vision Does

Upload an image to ChatGPT. Ask questions. Get answers.

  • “What’s in this image?”
  • “Read the text in this photo”
  • “What’s wrong with this code screenshot?”
  • “Describe this chart”

It’s multimodal—text and images together.

What It Does Well

1. Reading Text in Images

Test: Screenshot of an article Result: Reads text accurately, even at angles

Practical use:

  • Extracting text from photos
  • Reading signs, menus, documents
  • Processing screenshots

Accuracy: 9/10 — Very reliable for clear text

2. Describing Scenes

Test: Photo of a city street Result: Identifies buildings, people, vehicles, signs, weather

Practical use:

  • Image cataloging
  • Accessibility descriptions
  • Understanding visual content

Accuracy: 8/10 — Good general descriptions

3. Analyzing Charts and Graphs

Test: Bar chart with sales data Result: Identifies type of chart, reads values, summarizes trends

Practical use:

  • Quick data analysis
  • Understanding reports
  • Chart summarization

Accuracy: 8/10 — Good for clear charts

4. Code Screenshots

Test: Screenshot of Python code with a bug Result: Reads code, identifies the bug, suggests fix

Practical use:

  • Debugging from screenshots
  • Code review from images
  • Understanding code in tutorials

Accuracy: 9/10 — Excellent for code

5. Identifying Objects

Test: Various object photos Result: Correctly identifies most common objects

Practical use:

  • “What is this thing?”
  • Plant identification
  • Product identification

Accuracy: 7/10 — Good for common items

What It Struggles With

1. Small or Low-Quality Text

Blurry photos, tiny text, unusual fonts—accuracy drops significantly.

2. Counting

“How many people are in this photo?” Often wrong. Counting is hard.

3. Spatial Reasoning

“What’s to the left of the chair?” Sometimes confused about spatial relationships.

4. Faces

GPT-4 intentionally won’t identify specific people. It will describe faces generally.

5. Complex Technical Diagrams

Detailed schematics, complex flowcharts—accuracy decreases with complexity.

6. Handwriting

Neat handwriting: okay. Messy handwriting: struggles.

Practical Use Cases

For Work

  • Read receipts — Extract expenses from photos
  • Analyze mockups — Get feedback on designs
  • Process documents — When you only have images
  • Debug code — Send screenshots of errors

For Learning

  • Homework help — Photo of a problem, get explanation
  • Identify things — “What bird is this?”
  • Read foreign signs — Then translate

For Accessibility

  • Describe images — For visually impaired users
  • Read printed text — When you can’t type it

For Fun

  • Analyze art — “What style is this painting?”
  • Identify locations — “Where might this be?”
  • Interpret memes — Yes, it understands most memes

How to Get Better Results

1. Image Quality Matters

Higher resolution = better analysis. Crop to the relevant part.

2. Be Specific in Questions

“What’s in this image?” → General description “Read the text on the top sign” → Specific answer

3. Ask Follow-Up Questions

Start broad, then get specific. The conversation maintains context.

4. Multiple Images

You can send multiple images for comparison or context.

Limitations to Know

  • No real-time vision — Analyzes uploaded images, not live camera
  • Privacy considerations — Your images go to OpenAI
  • Not always right — Verify important information
  • Not medical/legal advice — Don’t use for diagnoses

Cost

Vision is included in ChatGPT Plus ($20/month).

API pricing is based on image size:

  • Small images: Few cents
  • Larger images: More expensive

Our Verdict

GPT-4 Vision is genuinely useful. It’s not perfect, but it handles most common image analysis tasks well.

Best for:

  • Text extraction
  • Code screenshots
  • General image understanding
  • Charts and data visualization

Not for:

  • Medical imaging
  • Security/identification
  • Precision counting
  • Complex technical analysis

Rating: 8/10 — Impressive and practical


Multimodal AI is here. Images and text together open new possibilities.

Disclosure: This post contains affiliate links. If you click through and make a purchase, we may earn a commission at no extra cost to you. We only recommend tools we genuinely believe in.