captcha-llm: interactive demo

Try the CAPTCHA yourself

Read the ASCII art and type what you see. State-of-the-art multimodal LLMs averaged <56% character similarity: you should find this trivial.

ASCII CAPTCHA: interactive

Loading fonts from CDN…

attempts 0

solved 0

your accuracy -

best AI (image) 0.16%

Paper results: ASCII CAPTCHA (250 samples per model)

Full accuracy requires an exact string match. Text similarity uses Python's SequenceMatcher ratio. No model achieved full accuracy on text input.

Model	Input	Exact Accuracy	Text Similarity	Avg Response
Gemini 3 Flash Preview	image	0.16% (1/628)	55.48%	3.25s
GPT-5.2	image	0.00%	28.20%	3.46s
Qwen3-VL-30B	image	0.00%	20.06%	1.79s
Claude Sonnet 4.5	image	0.00%	19.26%	5.86s
Llama 4 Scout	image	0.00%	14.04%	2.08s
Gemini 3 Flash Preview	text	0.00%	39.38%	1.96s
Claude Sonnet 4.5	text	0.00%	19.17%	2.55s
DeepSeek v3.2-exp	text	0.00%	12.78%	84.49s

† Gemini 3 Flash Preview (image) achieved 1 correct answer across all 628 samples (250 per model, 2 modalities), giving 0.16% exact accuracy: the only non-zero result. DeepSeek v3.2-exp averaged 84.49s per response, still achieving 0% accuracy.

Model failure examples

Representative failure cases from the paper evaluation, rendered with the same fonts used in experiments. Model responses are example outputs from the evaluation runs.

Loading failure examples…

Audio CAPTCHA: overlapping speech

CommonsenseQA 5-choice questions synthesised via XTTS-v2 TTS, then augmented with four noise conditions. Random baseline is 20% (5-choice).

Model	Baseline (clean)	+ Background noise	+ Gaussian noise	+ Overlapping speech
Gemini 3 Flash Preview	75%	50%	59%	48%
VoxTral Small	73%	31%	46%	40%
GPT Audio Mini	46%	23%	20%	27%
Random baseline	20%	20%	20%	20%

Under combined overlapping speech, all models approach the random 20% baseline. GPT Audio Mini reaches exactly 20% under Gaussian noise: indistinguishable from random guessing. 100 samples per noise condition.

Why ASCII art defeats frontier LLMs

Two orthogonal failure modes: one for vision models, one for text models: both rooted in deep mismatches with how LLMs process information.

Gestalt mismatch: Vision models detect local textures; humans ignore individual characters and perceive the global shape they form Tokenization breaks alignment: BPE tokenises a 2D grid as a 1D stream; vertical structure across rows is destroyed No training signal: ASCII art reading is rare in pre-training data; there is no learned perceptual shortcut to exploit Cost-of-attack asymmetry: 145s of reasoning for 1 char correct means solving at scale is economically non-viable Instant generation: 0.011s per sample; CAPTCHA generation is ~10,000× cheaper than solving 50+ fonts: Each font is a distinct visual encoding; models cannot specialise on one