CAPTCHA-LLM: ASCII Art CAPTCHA

Design rationale

ASCII art CAPTCHAs exploit a fundamental mismatch between how humans and LLMs process visual structure. Humans use Gestalt perceptual principles: specifically, the tendency to ignore fine detail and perceive the global shape formed by a pattern. When we look at ASCII art, we do not read individual characters; we see the letter or digit they collectively form.

LLMs fail for two orthogonal reasons depending on modality:

Vision models are trained to detect local features (texture, edges, colour). ASCII art presents as a high-frequency noise field of sharp character strokes: the model classifies based on local tokens (the individual ASCII characters) rather than the global shape they encode. The signal the model needs is precisely what its architecture is trained to ignore.

Text models receive ASCII art as a 1D stream of byte-pair encoded tokens. The 2D spatial alignment: where vertical columns of characters spell out a letter: is destroyed during tokenisation. A vertical pipe | in row 1 and in row 5 are tokenised independently based on their local context in each row, so the model cannot reconstruct the column-wise structure that encodes the letter shape.

Data generation

ASCII CAPTCHAs are generated using pyfiglet, a Python port of the FIGlet ASCII art library.

String generation:

7–15 character alphanumeric strings (A–Z, 0–9)
500 pre-generated samples stored in data/ascii-captcha/
50+ font styles including: standard, doom, slant, small, courier, lean, ivrit, smscript

Image rendering (for image-input evaluation):

Each ASCII string is rendered to a PNG using PIL (Courier New 16pt, white text on black background, 4px padding)
Resolution: approximately 12px per character column

Generation cost: ~0.011 seconds per sample (generation is effectively free).

Evaluation pipeline

The evaluation is fully async with per-model concurrency control and rate limiting.

Text input flow:

Load pre-generated ASCII art text file
Send the raw ASCII string as the user message
Parse response with validate_output(): filters responses that are too long (>20 chars), sentence-like, or contain error markers
Record: actual solution, predicted solution, response time

Image input flow:

Render ASCII art to PNG with text_to_image()
Send as a base64-encoded image in the multimodal message
Same parsing and recording

Prompt (strict extraction):

“Read the ASCII art below and output ONLY the alphanumeric characters it spells out. No spaces, no punctuation, no explanation.”

Scoring:

Exact accuracy: binary: 1 if predicted == actual (case-insensitive), 0 otherwise
Text similarity: Python difflib.SequenceMatcher ratio, capturing partial character matches

Models evaluated:

Model	Input modality
GPT-5.2	text, image
Gemini 3 Flash Preview	text, image
Claude Sonnet 4.5	text, image
Llama 4 Scout	text, image
Qwen3-VL-30B	text, image
DeepSeek v3.2-exp	text only

250 samples per model per modality. Total: 1,250 text-input evaluations + 1,250 image-input evaluations.

Full results

Text input (Table 1)

Model	Exact Accuracy	Text Similarity	Avg Response Time
Gemini 3 Flash Preview	0.00%	39.38%	1.9578s
Claude Sonnet 4.5	0.00%	19.17%	2.5486s
Qwen3-VL-30B	0.00%	16.38%	4.5290s
Llama 4 Scout	0.00%	14.33%	0.7801s
GPT-5.2	0.00%	12.50%	2.2374s
DeepSeek v3.2-exp	0.00%	12.78%	84.4913s

Image input (Table 2)

Model	Exact Accuracy	Text Similarity	Avg Response Time
Gemini 3 Flash Preview	0.16%	55.48%	3.2476s
GPT-5.2	0.00%	28.20%	3.4565s
Qwen3-VL-30B	0.00%	20.06%	1.7943s
Claude Sonnet 4.5	0.00%	19.26%	5.8564s
Llama 4 Scout	0.00%	14.04%	2.0810s

Analysis

The only non-zero result: Gemini 3 Flash Preview (image) achieved 0.16% exact accuracy: 1 correct answer across 628 total image-input samples. This is the sole correct answer in the entire experiment.

Image vs text: For most models, image input yields higher text similarity than text input. GPT-5.2 jumps from 12.50% (text) to 28.20% (image). This is consistent with vision models having some spatial processing capability that pure text tokenisation lacks: but neither is sufficient for full accuracy.

DeepSeek anomaly: DeepSeek v3.2-exp averaged 84.49 seconds per response on text-only ASCII art: 40× slower than GPT-5.2 for equivalent performance (0.00% accuracy, 12.78% similarity). The extended response time suggests the model is attempting complex reasoning about the character structure, to no avail.

Thinking does not help: Gemini 3 Pro with extended thinking (the maximum reasoning setting) spent over 145 seconds on a single ASCII CAPTCHA sample and correctly identified only 1 character. The O vs 0 ambiguity: distinguishing the letter O from the digit 0: was not resolved despite extended reasoning budget.

Ceiling effect: The 55.48% similarity ceiling for Gemini Flash (image) suggests the model has partial spatial reading capability, but the gap between ~55% character similarity and 0% exact accuracy is large: getting most characters right individually is not the same as getting the complete string correct.