GPT-4 Vision: What Can It Actually See?
GPT-4 can now see. You send images, it analyzes them.
We tested its capabilities extensively. Here’s what actually works.
What GPT-4 Vision Does
Upload an image to ChatGPT. Ask questions. Get answers.
- “What’s in this image?”
- “Read the text in this photo”
- “What’s wrong with this code screenshot?”
- “Describe this chart”
It’s multimodal—text and images together.
What It Does Well
1. Reading Text in Images
Test: Screenshot of an article Result: Reads text accurately, even at angles
Practical use:
- Extracting text from photos
- Reading signs, menus, documents
- Processing screenshots
Accuracy: 9/10 — Very reliable for clear text
2. Describing Scenes
Test: Photo of a city street Result: Identifies buildings, people, vehicles, signs, weather
Practical use:
- Image cataloging
- Accessibility descriptions
- Understanding visual content
Accuracy: 8/10 — Good general descriptions
3. Analyzing Charts and Graphs
Test: Bar chart with sales data Result: Identifies type of chart, reads values, summarizes trends
Practical use:
- Quick data analysis
- Understanding reports
- Chart summarization
Accuracy: 8/10 — Good for clear charts
4. Code Screenshots
Test: Screenshot of Python code with a bug Result: Reads code, identifies the bug, suggests fix
Practical use:
- Debugging from screenshots
- Code review from images
- Understanding code in tutorials
Accuracy: 9/10 — Excellent for code
5. Identifying Objects
Test: Various object photos Result: Correctly identifies most common objects
Practical use:
- “What is this thing?”
- Plant identification
- Product identification
Accuracy: 7/10 — Good for common items
What It Struggles With
1. Small or Low-Quality Text
Blurry photos, tiny text, unusual fonts—accuracy drops significantly.
2. Counting
“How many people are in this photo?” Often wrong. Counting is hard.
3. Spatial Reasoning
“What’s to the left of the chair?” Sometimes confused about spatial relationships.
4. Faces
GPT-4 intentionally won’t identify specific people. It will describe faces generally.
5. Complex Technical Diagrams
Detailed schematics, complex flowcharts—accuracy decreases with complexity.
6. Handwriting
Neat handwriting: okay. Messy handwriting: struggles.
Practical Use Cases
For Work
- Read receipts — Extract expenses from photos
- Analyze mockups — Get feedback on designs
- Process documents — When you only have images
- Debug code — Send screenshots of errors
For Learning
- Homework help — Photo of a problem, get explanation
- Identify things — “What bird is this?”
- Read foreign signs — Then translate
For Accessibility
- Describe images — For visually impaired users
- Read printed text — When you can’t type it
For Fun
- Analyze art — “What style is this painting?”
- Identify locations — “Where might this be?”
- Interpret memes — Yes, it understands most memes
How to Get Better Results
1. Image Quality Matters
Higher resolution = better analysis. Crop to the relevant part.
2. Be Specific in Questions
“What’s in this image?” → General description “Read the text on the top sign” → Specific answer
3. Ask Follow-Up Questions
Start broad, then get specific. The conversation maintains context.
4. Multiple Images
You can send multiple images for comparison or context.
Limitations to Know
- No real-time vision — Analyzes uploaded images, not live camera
- Privacy considerations — Your images go to OpenAI
- Not always right — Verify important information
- Not medical/legal advice — Don’t use for diagnoses
Cost
Vision is included in ChatGPT Plus ($20/month).
API pricing is based on image size:
- Small images: Few cents
- Larger images: More expensive
Our Verdict
GPT-4 Vision is genuinely useful. It’s not perfect, but it handles most common image analysis tasks well.
Best for:
- Text extraction
- Code screenshots
- General image understanding
- Charts and data visualization
Not for:
- Medical imaging
- Security/identification
- Precision counting
- Complex technical analysis
Rating: 8/10 — Impressive and practical
Multimodal AI is here. Images and text together open new possibilities.