What is CAPTCHA-LLM?
CAPTCHA-LLM is a research project investigating whether tasks rooted in deep human perceptual advantages can produce CAPTCHAs resilient to frontier multimodal LLMs. Two CAPTCHA classes are introduced and evaluated:
- ASCII art CAPTCHAs: alphanumeric strings rendered using pyfiglet (50+ fonts), presented as either raw text or PNG images
- Overlapping audio CAPTCHAs: CommonsenseQA 5-choice questions synthesised via XTTS-v2 TTS, then augmented with background noise, Gaussian noise, or overlapping speech
The core hypothesis: humans evolved specialised neural processing for tasks like Gestalt pattern perception and selective auditory attention (the “cocktail-party effect”). CAPTCHAs that require these capabilities should be trivially solvable by humans but hard for AI systems that lack these specific adaptations.
Paper: arXiv:2604.03612
Key results
ASCII art CAPTCHAs:
- Zero models achieved full accuracy on any sample across all text-input and image-input conditions
- The only non-zero result was Gemini 3 Flash Preview (image input): 0.16% exact accuracy: 1 correct answer out of 628 total samples
- Best text similarity was 55.48% (Gemini 3 Flash, image): meaning even the best model gets roughly half the characters right on average, but never the complete string
- DeepSeek v3.2-exp averaged 84.49 seconds per response: still achieving 0% accuracy
- Gemini 3 Pro with extended thinking spent 145 seconds on a single sample, correctly identifying only 1 character
Audio CAPTCHAs:
- Under clean conditions, models perform reasonably (46–75% vs a 20% random baseline)
- Under combined overlapping speech, all models collapse toward the random baseline: Gemini 3 Flash 48%, VoxTral Small 40%, GPT Audio Mini 27%
- GPT Audio Mini reaches exactly 20% under Gaussian noise: statistically indistinguishable from random guessing on a 5-choice question
Why it works
ASCII art (vision models)
Modern vision models: CNNs and Vision Transformers: are optimised to detect local features such as texture, edges, and colour patches. Reading ASCII art requires the opposite: ignoring individual characters and perceiving the global shape they form together (Gestalt principles). Vision models see a noise field of sharp character edges; humans see the letter.
ASCII art (text models)
LLMs see text as a 1D stream of byte-pair encoded tokens, not a 2D grid. A vertical pipe | character across multiple rows tokenises differently depending on neighbouring characters in each row, destroying the spatial alignment that ASCII art encodes. The model cannot reconstruct the 2D structure from a 1D token sequence.
Audio (overlapping speech)
The human cocktail-party effect: the ability to selectively attend to one voice in a noisy environment: is an evolutionary adaptation that current audio models lack. Under overlapping speech conditions, LLMs cannot separate target audio from competing signals, causing accuracy to collapse.
Limitations
- No quantitative human benchmark study was conducted (only anecdotal evidence that humans find ASCII CAPTCHAs trivially readable)
- ASCII CAPTCHAs may disadvantage screen-reader users and the visually impaired
- Audio CAPTCHAs may disadvantage non-native speakers and the hearing-impaired
- The paper acknowledges that fine-tuning on ASCII art data would eventually break the ASCII CAPTCHA: the defence is not permanent
Repository
The codebase provides fully reproducible evaluation pipelines for both CAPTCHA classes, supporting OpenRouter, OpenAI, Gemini, and Anthropic APIs. Data generation (pyfiglet ASCII art, XTTS-v2 TTS), evaluation, and result aggregation scripts are all included.