captcha-llm CAPTCHAASCII ArtLLMsMultimodalComputer Vision

CAPTCHA-LLM: ASCII Art CAPTCHA

Design, generation methodology, evaluation pipeline, and full results for the ASCII art CAPTCHA experiments.

Design rationale

ASCII art CAPTCHAs exploit a fundamental mismatch between how humans and LLMs process visual structure. Humans use Gestalt perceptual principles: specifically, the tendency to ignore fine detail and perceive the global shape formed by a pattern. When we look at ASCII art, we do not read individual characters; we see the letter or digit they collectively form.

LLMs fail for two orthogonal reasons depending on modality:

Vision models are trained to detect local features (texture, edges, colour). ASCII art presents as a high-frequency noise field of sharp character strokes: the model classifies based on local tokens (the individual ASCII characters) rather than the global shape they encode. The signal the model needs is precisely what its architecture is trained to ignore.

Text models receive ASCII art as a 1D stream of byte-pair encoded tokens. The 2D spatial alignment: where vertical columns of characters spell out a letter: is destroyed during tokenisation. A vertical pipe | in row 1 and in row 5 are tokenised independently based on their local context in each row, so the model cannot reconstruct the column-wise structure that encodes the letter shape.


Data generation

ASCII CAPTCHAs are generated using pyfiglet, a Python port of the FIGlet ASCII art library.

String generation:

Image rendering (for image-input evaluation):

Generation cost: ~0.011 seconds per sample (generation is effectively free).


Evaluation pipeline

The evaluation is fully async with per-model concurrency control and rate limiting.

Text input flow:

  1. Load pre-generated ASCII art text file
  2. Send the raw ASCII string as the user message
  3. Parse response with validate_output(): filters responses that are too long (>20 chars), sentence-like, or contain error markers
  4. Record: actual solution, predicted solution, response time

Image input flow:

  1. Render ASCII art to PNG with text_to_image()
  2. Send as a base64-encoded image in the multimodal message
  3. Same parsing and recording

Prompt (strict extraction):

“Read the ASCII art below and output ONLY the alphanumeric characters it spells out. No spaces, no punctuation, no explanation.”

Scoring:

Models evaluated:

ModelInput modality
GPT-5.2text, image
Gemini 3 Flash Previewtext, image
Claude Sonnet 4.5text, image
Llama 4 Scouttext, image
Qwen3-VL-30Btext, image
DeepSeek v3.2-exptext only

250 samples per model per modality. Total: 1,250 text-input evaluations + 1,250 image-input evaluations.


Full results

Text input (Table 1)

ModelExact AccuracyText SimilarityAvg Response Time
Gemini 3 Flash Preview0.00%39.38%1.9578s
Claude Sonnet 4.50.00%19.17%2.5486s
Qwen3-VL-30B0.00%16.38%4.5290s
Llama 4 Scout0.00%14.33%0.7801s
GPT-5.20.00%12.50%2.2374s
DeepSeek v3.2-exp0.00%12.78%84.4913s

Image input (Table 2)

ModelExact AccuracyText SimilarityAvg Response Time
Gemini 3 Flash Preview0.16%55.48%3.2476s
GPT-5.20.00%28.20%3.4565s
Qwen3-VL-30B0.00%20.06%1.7943s
Claude Sonnet 4.50.00%19.26%5.8564s
Llama 4 Scout0.00%14.04%2.0810s

Analysis

The only non-zero result: Gemini 3 Flash Preview (image) achieved 0.16% exact accuracy: 1 correct answer across 628 total image-input samples. This is the sole correct answer in the entire experiment.

Image vs text: For most models, image input yields higher text similarity than text input. GPT-5.2 jumps from 12.50% (text) to 28.20% (image). This is consistent with vision models having some spatial processing capability that pure text tokenisation lacks: but neither is sufficient for full accuracy.

DeepSeek anomaly: DeepSeek v3.2-exp averaged 84.49 seconds per response on text-only ASCII art: 40× slower than GPT-5.2 for equivalent performance (0.00% accuracy, 12.78% similarity). The extended response time suggests the model is attempting complex reasoning about the character structure, to no avail.

Thinking does not help: Gemini 3 Pro with extended thinking (the maximum reasoning setting) spent over 145 seconds on a single ASCII CAPTCHA sample and correctly identified only 1 character. The O vs 0 ambiguity: distinguishing the letter O from the digit 0: was not resolved despite extended reasoning budget.

Ceiling effect: The 55.48% similarity ceiling for Gemini Flash (image) suggests the model has partial spatial reading capability, but the gap between ~55% character similarity and 0% exact accuracy is large: getting most characters right individually is not the same as getting the complete string correct.

← Back to CAPTCHA-LLM