captcha-llm CAPTCHALLMsSecurityMultimodal AIResearch

CAPTCHA-LLM: Overview

Project overview: the hypothesis, key findings, and why human perceptual advantages create robust CAPTCHAs against frontier LLMs.

What is CAPTCHA-LLM?

CAPTCHA-LLM is a research project investigating whether tasks rooted in deep human perceptual advantages can produce CAPTCHAs resilient to frontier multimodal LLMs. Two CAPTCHA classes are introduced and evaluated:

  1. ASCII art CAPTCHAs: alphanumeric strings rendered using pyfiglet (50+ fonts), presented as either raw text or PNG images
  2. Overlapping audio CAPTCHAs: CommonsenseQA 5-choice questions synthesised via XTTS-v2 TTS, then augmented with background noise, Gaussian noise, or overlapping speech

The core hypothesis: humans evolved specialised neural processing for tasks like Gestalt pattern perception and selective auditory attention (the “cocktail-party effect”). CAPTCHAs that require these capabilities should be trivially solvable by humans but hard for AI systems that lack these specific adaptations.

Paper: arXiv:2604.03612


Key results

ASCII art CAPTCHAs:

Audio CAPTCHAs:


Why it works

ASCII art (vision models)

Modern vision models: CNNs and Vision Transformers: are optimised to detect local features such as texture, edges, and colour patches. Reading ASCII art requires the opposite: ignoring individual characters and perceiving the global shape they form together (Gestalt principles). Vision models see a noise field of sharp character edges; humans see the letter.

ASCII art (text models)

LLMs see text as a 1D stream of byte-pair encoded tokens, not a 2D grid. A vertical pipe | character across multiple rows tokenises differently depending on neighbouring characters in each row, destroying the spatial alignment that ASCII art encodes. The model cannot reconstruct the 2D structure from a 1D token sequence.

Audio (overlapping speech)

The human cocktail-party effect: the ability to selectively attend to one voice in a noisy environment: is an evolutionary adaptation that current audio models lack. Under overlapping speech conditions, LLMs cannot separate target audio from competing signals, causing accuracy to collapse.


Limitations


Repository

The codebase provides fully reproducible evaluation pipelines for both CAPTCHA classes, supporting OpenRouter, OpenAI, Gemini, and Anthropic APIs. Data generation (pyfiglet ASCII art, XTTS-v2 TTS), evaluation, and result aggregation scripts are all included.

See github.com/horse-3903/captcha-llm.

← Back to CAPTCHA-LLM