captcha-llm

ASCII art and overlapping audio as CAPTCHAs resilient to frontier multimodal LLMs.
Zero models achieved full accuracy on any ASCII CAPTCHA sample.

Try the CAPTCHA yourself

Read the ASCII art and type what you see. State-of-the-art multimodal LLMs averaged <56% character similarity: you should find this trivial.

ASCII CAPTCHA: interactive
Loading fonts from CDN…
attempts 0
solved 0
your accuracy -
best AI (image) 0.16%

Paper results: ASCII CAPTCHA (250 samples per model)

Full accuracy requires an exact string match. Text similarity uses Python's SequenceMatcher ratio. No model achieved full accuracy on text input.

Model Input Exact Accuracy Text Similarity Avg Response
Gemini 3 Flash Preview image 0.16% (1/628) 55.48% 3.25s
GPT-5.2 image 0.00% 28.20% 3.46s
Qwen3-VL-30B image 0.00% 20.06% 1.79s
Claude Sonnet 4.5 image 0.00% 19.26% 5.86s
Llama 4 Scout image 0.00% 14.04% 2.08s
Gemini 3 Flash Preview text 0.00% 39.38% 1.96s
Claude Sonnet 4.5 text 0.00% 19.17% 2.55s
DeepSeek v3.2-exp text 0.00% 12.78% 84.49s
† Gemini 3 Flash Preview (image) achieved 1 correct answer across all 628 samples (250 per model, 2 modalities), giving 0.16% exact accuracy: the only non-zero result. DeepSeek v3.2-exp averaged 84.49s per response, still achieving 0% accuracy.

Model failure examples

Representative failure cases from the paper evaluation, rendered with the same fonts used in experiments. Model responses are example outputs from the evaluation runs.

Loading failure examples…

Audio CAPTCHA: overlapping speech

CommonsenseQA 5-choice questions synthesised via XTTS-v2 TTS, then augmented with four noise conditions. Random baseline is 20% (5-choice).

Model Baseline (clean) + Background noise + Gaussian noise + Overlapping speech
Gemini 3 Flash Preview 75% 50% 59% 48%
VoxTral Small 73% 31% 46% 40%
GPT Audio Mini 46% 23% 20% 27%
Random baseline 20% 20% 20% 20%
Under combined overlapping speech, all models approach the random 20% baseline. GPT Audio Mini reaches exactly 20% under Gaussian noise: indistinguishable from random guessing. 100 samples per noise condition.

Why ASCII art defeats frontier LLMs

Two orthogonal failure modes: one for vision models, one for text models: both rooted in deep mismatches with how LLMs process information.

Gestalt mismatch: Vision models detect local textures; humans ignore individual characters and perceive the global shape they form Tokenization breaks alignment: BPE tokenises a 2D grid as a 1D stream; vertical structure across rows is destroyed No training signal: ASCII art reading is rare in pre-training data; there is no learned perceptual shortcut to exploit Cost-of-attack asymmetry: 145s of reasoning for 1 char correct means solving at scale is economically non-viable Instant generation: 0.011s per sample; CAPTCHA generation is ~10,000× cheaper than solving 50+ fonts: Each font is a distinct visual encoding; models cannot specialise on one