Design rationale
ASCII art CAPTCHAs exploit a fundamental mismatch between how humans and LLMs process visual structure. Humans use Gestalt perceptual principles: specifically, the tendency to ignore fine detail and perceive the global shape formed by a pattern. When we look at ASCII art, we do not read individual characters; we see the letter or digit they collectively form.
LLMs fail for two orthogonal reasons depending on modality:
Vision models are trained to detect local features (texture, edges, colour). ASCII art presents as a high-frequency noise field of sharp character strokes: the model classifies based on local tokens (the individual ASCII characters) rather than the global shape they encode. The signal the model needs is precisely what its architecture is trained to ignore.
Text models receive ASCII art as a 1D stream of byte-pair encoded tokens. The 2D spatial alignment: where vertical columns of characters spell out a letter: is destroyed during tokenisation. A vertical pipe | in row 1 and in row 5 are tokenised independently based on their local context in each row, so the model cannot reconstruct the column-wise structure that encodes the letter shape.
Data generation
ASCII CAPTCHAs are generated using pyfiglet, a Python port of the FIGlet ASCII art library.
String generation:
- 7–15 character alphanumeric strings (A–Z, 0–9)
- 500 pre-generated samples stored in
data/ascii-captcha/ - 50+ font styles including:
standard,doom,slant,small,courier,lean,ivrit,smscript
Image rendering (for image-input evaluation):
- Each ASCII string is rendered to a PNG using PIL (
Courier New16pt, white text on black background, 4px padding) - Resolution: approximately 12px per character column
Generation cost: ~0.011 seconds per sample (generation is effectively free).
Evaluation pipeline
The evaluation is fully async with per-model concurrency control and rate limiting.
Text input flow:
- Load pre-generated ASCII art text file
- Send the raw ASCII string as the user message
- Parse response with
validate_output(): filters responses that are too long (>20 chars), sentence-like, or contain error markers - Record: actual solution, predicted solution, response time
Image input flow:
- Render ASCII art to PNG with
text_to_image() - Send as a base64-encoded image in the multimodal message
- Same parsing and recording
Prompt (strict extraction):
“Read the ASCII art below and output ONLY the alphanumeric characters it spells out. No spaces, no punctuation, no explanation.”
Scoring:
- Exact accuracy: binary: 1 if predicted == actual (case-insensitive), 0 otherwise
- Text similarity: Python
difflib.SequenceMatcherratio, capturing partial character matches
Models evaluated:
| Model | Input modality |
|---|---|
| GPT-5.2 | text, image |
| Gemini 3 Flash Preview | text, image |
| Claude Sonnet 4.5 | text, image |
| Llama 4 Scout | text, image |
| Qwen3-VL-30B | text, image |
| DeepSeek v3.2-exp | text only |
250 samples per model per modality. Total: 1,250 text-input evaluations + 1,250 image-input evaluations.
Full results
Text input (Table 1)
| Model | Exact Accuracy | Text Similarity | Avg Response Time |
|---|---|---|---|
| Gemini 3 Flash Preview | 0.00% | 39.38% | 1.9578s |
| Claude Sonnet 4.5 | 0.00% | 19.17% | 2.5486s |
| Qwen3-VL-30B | 0.00% | 16.38% | 4.5290s |
| Llama 4 Scout | 0.00% | 14.33% | 0.7801s |
| GPT-5.2 | 0.00% | 12.50% | 2.2374s |
| DeepSeek v3.2-exp | 0.00% | 12.78% | 84.4913s |
Image input (Table 2)
| Model | Exact Accuracy | Text Similarity | Avg Response Time |
|---|---|---|---|
| Gemini 3 Flash Preview | 0.16% | 55.48% | 3.2476s |
| GPT-5.2 | 0.00% | 28.20% | 3.4565s |
| Qwen3-VL-30B | 0.00% | 20.06% | 1.7943s |
| Claude Sonnet 4.5 | 0.00% | 19.26% | 5.8564s |
| Llama 4 Scout | 0.00% | 14.04% | 2.0810s |
Analysis
The only non-zero result: Gemini 3 Flash Preview (image) achieved 0.16% exact accuracy: 1 correct answer across 628 total image-input samples. This is the sole correct answer in the entire experiment.
Image vs text: For most models, image input yields higher text similarity than text input. GPT-5.2 jumps from 12.50% (text) to 28.20% (image). This is consistent with vision models having some spatial processing capability that pure text tokenisation lacks: but neither is sufficient for full accuracy.
DeepSeek anomaly: DeepSeek v3.2-exp averaged 84.49 seconds per response on text-only ASCII art: 40× slower than GPT-5.2 for equivalent performance (0.00% accuracy, 12.78% similarity). The extended response time suggests the model is attempting complex reasoning about the character structure, to no avail.
Thinking does not help: Gemini 3 Pro with extended thinking (the maximum reasoning setting) spent over 145 seconds on a single ASCII CAPTCHA sample and correctly identified only 1 character. The O vs 0 ambiguity: distinguishing the letter O from the digit 0: was not resolved despite extended reasoning budget.
Ceiling effect: The 55.48% similarity ceiling for Gemini Flash (image) suggests the model has partial spatial reading capability, but the gap between ~55% character similarity and 0% exact accuracy is large: getting most characters right individually is not the same as getting the complete string correct.