CAPTCHA-LLM: Overview

What is CAPTCHA-LLM?

CAPTCHA-LLM is a research project investigating whether tasks rooted in deep human perceptual advantages can produce CAPTCHAs resilient to frontier multimodal LLMs. Two CAPTCHA classes are introduced and evaluated:

ASCII art CAPTCHAs: alphanumeric strings rendered using pyfiglet (50+ fonts), presented as either raw text or PNG images
Overlapping audio CAPTCHAs: CommonsenseQA 5-choice questions synthesised via XTTS-v2 TTS, then augmented with background noise, Gaussian noise, or overlapping speech

The core hypothesis: humans evolved specialised neural processing for tasks like Gestalt pattern perception and selective auditory attention (the “cocktail-party effect”). CAPTCHAs that require these capabilities should be trivially solvable by humans but hard for AI systems that lack these specific adaptations.

Paper: arXiv:2604.03612

Key results

ASCII art CAPTCHAs:

Zero models achieved full accuracy on any sample across all text-input and image-input conditions
The only non-zero result was Gemini 3 Flash Preview (image input): 0.16% exact accuracy: 1 correct answer out of 628 total samples
Best text similarity was 55.48% (Gemini 3 Flash, image): meaning even the best model gets roughly half the characters right on average, but never the complete string
DeepSeek v3.2-exp averaged 84.49 seconds per response: still achieving 0% accuracy
Gemini 3 Pro with extended thinking spent 145 seconds on a single sample, correctly identifying only 1 character

Audio CAPTCHAs:

Under clean conditions, models perform reasonably (46–75% vs a 20% random baseline)
Under combined overlapping speech, all models collapse toward the random baseline: Gemini 3 Flash 48%, VoxTral Small 40%, GPT Audio Mini 27%
GPT Audio Mini reaches exactly 20% under Gaussian noise: statistically indistinguishable from random guessing on a 5-choice question

Why it works

ASCII art (vision models)

Modern vision models: CNNs and Vision Transformers: are optimised to detect local features such as texture, edges, and colour patches. Reading ASCII art requires the opposite: ignoring individual characters and perceiving the global shape they form together (Gestalt principles). Vision models see a noise field of sharp character edges; humans see the letter.

ASCII art (text models)

LLMs see text as a 1D stream of byte-pair encoded tokens, not a 2D grid. A vertical pipe | character across multiple rows tokenises differently depending on neighbouring characters in each row, destroying the spatial alignment that ASCII art encodes. The model cannot reconstruct the 2D structure from a 1D token sequence.

Audio (overlapping speech)

The human cocktail-party effect: the ability to selectively attend to one voice in a noisy environment: is an evolutionary adaptation that current audio models lack. Under overlapping speech conditions, LLMs cannot separate target audio from competing signals, causing accuracy to collapse.

Limitations

No quantitative human benchmark study was conducted (only anecdotal evidence that humans find ASCII CAPTCHAs trivially readable)
ASCII CAPTCHAs may disadvantage screen-reader users and the visually impaired
Audio CAPTCHAs may disadvantage non-native speakers and the hearing-impaired
The paper acknowledges that fine-tuning on ASCII art data would eventually break the ASCII CAPTCHA: the defence is not permanent

Repository

The codebase provides fully reproducible evaluation pipelines for both CAPTCHA classes, supporting OpenRouter, OpenAI, Gemini, and Anthropic APIs. Data generation (pyfiglet ASCII art, XTTS-v2 TTS), evaluation, and result aggregation scripts are all included.

See github.com/horse-3903/captcha-llm.