AI Writing Detectors: Do They Actually Work?
People keep asking me if AI detectors can tell when content is AI-generated. Teachers want to catch students. Editors want to verify writers. Businesses want to audit content.
I decided to actually test this instead of guessing.
The Experiment
I prepared five pieces of content:
- 100% Human: An article I wrote myself in 2022, before ChatGPT existed
- 100% AI: Raw ChatGPT output, no editing
- AI + Light Edit: ChatGPT output with minor edits
- AI + Heavy Edit: ChatGPT structure, mostly rewritten
- Human + AI Polish: My writing, edited by Claude for clarity
Then I ran each through 8 AI detection tools.
The Results (Summarized)
| Content | GPTZero | Originality | Copyleaks | ZeroGPT | Writer | Content@Scale | Turnitin | Sapling |
|---|---|---|---|---|---|---|---|---|
| Human (2022) | 12% AI | 24% AI | Human | 8% AI | Human | 89% Human | 15% AI | 22% AI |
| Pure AI | 98% AI | 99% AI | AI | 94% AI | AI | 23% Human | 89% AI | 97% AI |
| AI + Light Edit | 87% AI | 91% AI | AI | 78% AI | Mixed | 45% Human | 72% AI | 84% AI |
| AI + Heavy Edit | 34% AI | 47% AI | Mixed | 28% AI | Human | 78% Human | 31% AI | 39% AI |
| Human + AI Polish | 45% AI | 52% AI | Mixed | 41% AI | AI | 67% Human | 48% AI | 51% AI |
What This Actually Means
Problem 1: They Flag Human Writing
My 2022 article - written before ChatGPT existed - was flagged as having AI content by multiple tools. Originality.ai said 24% AI. GPTZero said 12%.
If a human-written article fails the test, the test is broken.
Problem 2: Edited AI Content Passes
The heavily edited AI content was classified as human by most tools. Light editing wasn’t enough, but significant rewriting passed.
This means anyone who edits their AI output will pass detection.
Problem 3: Human + AI Polish Gets Flagged
Using AI just to improve human-written content triggered detection. So people using AI as an editing tool get flagged the same as people fully generating content.
Problem 4: Tools Disagree Wildly
The same content got vastly different scores across tools. “98% AI” on one tool, “23% Human” on another. Which do you trust?
Problem 5: Writing Style Matters More Than Origin
Formal, structured writing gets flagged as AI more often - even when human-written. Casual, first-person writing passes more often - even when AI-generated.
The detectors aren’t detecting AI. They’re detecting writing style.
What Fools The Detectors
Based on my testing, AI content passes when it:
- Uses first-person perspective
- Has varied sentence lengths
- Includes specific numbers and examples
- Contains casual language and contractions
- Avoids formal transition words
- Includes opinions and personal takes
In other words: write like a human, pass as human. The detectors catch lazy AI use, not skilled AI use.
The Real Problem
These tools are being used for high-stakes decisions:
- Teachers failing students
- Clients rejecting freelancers
- Publishers refusing submissions
But the accuracy isn’t good enough. False positives (human flagged as AI) happen too often.
Imagine failing a class because your natural writing style triggers AI detection. That’s happening.
My Recommendations
For Teachers:
Don’t use AI detectors as proof. Use them as one signal among many. Require in-class writing samples. Ask follow-up questions about the content. Require sources and research that demonstrate actual engagement.
For Publishers/Editors:
Same thing. Detection tools can flag content for review, not automatically reject it. Judge content quality, not origin.
For Writers:
If you’re using AI assistance (which is fine), edit thoroughly. Add personal experiences. Use your natural voice. Don’t publish raw AI output.
For Businesses:
Don’t build policies around AI detection tools being accurate. They’re not. Focus on quality standards instead.
Which Detector Is “Best”?
If I had to pick one, Originality.ai had the fewest obviously wrong results. But “best” is relative when all of them have significant issues.
None achieved the accuracy I’d trust for consequential decisions.
The Uncomfortable Truth
AI detection is fundamentally flawed because:
- AI is trained on human writing
- Some humans write formally (like AI tends to)
- Editing AI output makes it less detectable
- Writing style matters more than generation method
The tools are detecting patterns, not origin. And the patterns aren’t unique to AI.
As AI models improve and mimic human writing better, detection will get harder, not easier.
Bottom Line
AI writing detectors don’t work reliably enough for high-stakes use.
They can catch lazy, unedited AI output. They can’t reliably distinguish between:
- Skilled AI-assisted writing
- Formal human writing
- Human writing edited with AI
Use them as a screening tool, not a verdict. And never punish someone based solely on AI detection results.
The technology isn’t there yet. Maybe it never will be.
Frequently Asked Questions
Inconsistently. I tested 8 detectors with the same content - results varied wildly. Human-written content was flagged as AI 30%+ of the time. AI content passed as human 40%+ of the time. They're not reliable enough for high-stakes decisions.
AI detectors are unreliable enough that using them as proof is problematic. They flag human writing as AI frequently. Better approach: oral follow-ups, in-class writing samples, and assignments that require personal experience.
Originality.ai performed best in my testing but still had significant false positives. No detector achieved reliable accuracy. Results vary by writing style, topic, and length. None should be trusted completely.