GPT-4 Vision: What Can It Actually See?

GPT-4 can now see. You send images, it analyzes them.

We tested its capabilities extensively. Here’s what actually works.

What GPT-4 Vision Does

Upload an image to ChatGPT. Ask questions. Get answers.

“What’s in this image?”
“Read the text in this photo”
“What’s wrong with this code screenshot?”
“Describe this chart”

It’s multimodal—text and images together.

What It Does Well

1. Reading Text in Images

Test: Screenshot of an article Result: Reads text accurately, even at angles

Practical use:

Extracting text from photos
Reading signs, menus, documents
Processing screenshots

Accuracy: 9/10 — Very reliable for clear text

2. Describing Scenes

Test: Photo of a city street Result: Identifies buildings, people, vehicles, signs, weather

Practical use:

Image cataloging
Accessibility descriptions
Understanding visual content

Accuracy: 8/10 — Good general descriptions

3. Analyzing Charts and Graphs

Test: Bar chart with sales data Result: Identifies type of chart, reads values, summarizes trends

Practical use:

Quick data analysis
Understanding reports
Chart summarization

Accuracy: 8/10 — Good for clear charts

4. Code Screenshots

Test: Screenshot of Python code with a bug Result: Reads code, identifies the bug, suggests fix

Practical use:

Debugging from screenshots
Code review from images
Understanding code in tutorials

Accuracy: 9/10 — Excellent for code

5. Identifying Objects

Test: Various object photos Result: Correctly identifies most common objects

Practical use:

“What is this thing?”
Plant identification
Product identification

Accuracy: 7/10 — Good for common items

What It Struggles With

1. Small or Low-Quality Text

Blurry photos, tiny text, unusual fonts—accuracy drops significantly.

2. Counting

“How many people are in this photo?” Often wrong. Counting is hard.

3. Spatial Reasoning

“What’s to the left of the chair?” Sometimes confused about spatial relationships.

4. Faces

GPT-4 intentionally won’t identify specific people. It will describe faces generally.

5. Complex Technical Diagrams

Detailed schematics, complex flowcharts—accuracy decreases with complexity.

6. Handwriting

Neat handwriting: okay. Messy handwriting: struggles.

Practical Use Cases

For Work

Read receipts — Extract expenses from photos
Analyze mockups — Get feedback on designs
Process documents — When you only have images
Debug code — Send screenshots of errors

For Learning

Homework help — Photo of a problem, get explanation
Identify things — “What bird is this?”
Read foreign signs — Then translate

For Accessibility

Describe images — For visually impaired users
Read printed text — When you can’t type it

For Fun

Analyze art — “What style is this painting?”
Identify locations — “Where might this be?”
Interpret memes — Yes, it understands most memes

How to Get Better Results

1. Image Quality Matters

Higher resolution = better analysis. Crop to the relevant part.

2. Be Specific in Questions

“What’s in this image?” → General description “Read the text on the top sign” → Specific answer

3. Ask Follow-Up Questions

Start broad, then get specific. The conversation maintains context.

4. Multiple Images

You can send multiple images for comparison or context.

Limitations to Know

No real-time vision — Analyzes uploaded images, not live camera
Privacy considerations — Your images go to OpenAI
Not always right — Verify important information
Not medical/legal advice — Don’t use for diagnoses

Cost

Vision is included in ChatGPT Plus ($20/month).

API pricing is based on image size:

Small images: Few cents
Larger images: More expensive

Our Verdict

GPT-4 Vision is genuinely useful. It’s not perfect, but it handles most common image analysis tasks well.

Best for:

Text extraction
Code screenshots
General image understanding
Charts and data visualization

Not for:

Medical imaging
Security/identification
Precision counting
Complex technical analysis

Rating: 8/10 — Impressive and practical

Multimodal AI is here. Images and text together open new possibilities.

Disclosure: This post contains affiliate links. If you click through and make a purchase, we may earn a commission at no extra cost to you. We only recommend tools we genuinely believe in.

gpt-4 vision image analysis multimodal chatgpt

GPT-4 Vision: What Can It Actually See? (2024)

GPT-4 Vision: What Can It Actually See?

What GPT-4 Vision Does

What It Does Well

1. Reading Text in Images

2. Describing Scenes

3. Analyzing Charts and Graphs

4. Code Screenshots

5. Identifying Objects

What It Struggles With

1. Small or Low-Quality Text

2. Counting

3. Spatial Reasoning

4. Faces

5. Complex Technical Diagrams

6. Handwriting

Practical Use Cases

For Work

For Learning

For Accessibility

For Fun

How to Get Better Results

1. Image Quality Matters

2. Be Specific in Questions

3. Ask Follow-Up Questions

4. Multiple Images

Limitations to Know

Cost

Our Verdict

Related Articles

GPT-4 Turbo: What's New and Is It Worth It?

GPT-4 vs GPT-3.5: Is the Upgrade Worth $20/Month?

How to Use AI to Write Emails Faster (Templates & Tips)

Stay Ahead with AI