Try the CAPTCHA yourself
Read the ASCII art and type what you see. State-of-the-art multimodal LLMs averaged <56% character similarity: you should find this trivial.
Loading fonts from CDN…
Paper results: ASCII CAPTCHA (250 samples per model)
Full accuracy requires an exact string match. Text similarity uses Python's SequenceMatcher ratio. No model achieved full accuracy on text input.
| Model | Input | Exact Accuracy | Text Similarity | Avg Response |
|---|---|---|---|---|
| Gemini 3 Flash Preview | image | 0.16% (1/628) | 55.48% | 3.25s |
| GPT-5.2 | image | 0.00% | 28.20% | 3.46s |
| Qwen3-VL-30B | image | 0.00% | 20.06% | 1.79s |
| Claude Sonnet 4.5 | image | 0.00% | 19.26% | 5.86s |
| Llama 4 Scout | image | 0.00% | 14.04% | 2.08s |
| Gemini 3 Flash Preview | text | 0.00% | 39.38% | 1.96s |
| Claude Sonnet 4.5 | text | 0.00% | 19.17% | 2.55s |
| DeepSeek v3.2-exp | text | 0.00% | 12.78% | 84.49s |
Model failure examples
Representative failure cases from the paper evaluation, rendered with the same fonts used in experiments. Model responses are example outputs from the evaluation runs.
Audio CAPTCHA: overlapping speech
CommonsenseQA 5-choice questions synthesised via XTTS-v2 TTS, then augmented with four noise conditions. Random baseline is 20% (5-choice).
| Model | Baseline (clean) | + Background noise | + Gaussian noise | + Overlapping speech |
|---|---|---|---|---|
| Gemini 3 Flash Preview | 75% | 50% | 59% | 48% |
| VoxTral Small | 73% | 31% | 46% | 40% |
| GPT Audio Mini | 46% | 23% | 20% | 27% |
| Random baseline | 20% | 20% | 20% | 20% |
Why ASCII art defeats frontier LLMs
Two orthogonal failure modes: one for vision models, one for text models: both rooted in deep mismatches with how LLMs process information.