CAPTCHA-LLM: Audio CAPTCHA

Design rationale

The audio CAPTCHA targets the cocktail-party effect: the human ability to selectively attend to one voice in a noisy environment containing multiple competing audio streams. This is a deeply evolved perceptual capability that current audio-language models lack.

The hypothesis: under clean conditions, LLMs can transcribe and answer spoken QA questions reasonably well. Adding overlapping speech: a second concurrent voice speaking different content: should cause audio separation to fail, collapsing accuracy toward the random baseline.

Four noise conditions are tested, designed to progressively stress audio separation:

Condition	Description
Baseline	Clean synthesised speech, no augmentation
Background noise	Café/ambient noise mixed at 5× boost
Gaussian noise	White noise at RMS-normalised level 1.70
Overlapping speech	Target audio mixed with a second concurrent speech sample at 0.7× ratio

Data generation

Question source: CommonsenseQA: 5-choice multiple-choice questions requiring common-sense reasoning. 100 questions sampled per condition (400 total evaluation samples).

Audio synthesis: XTTS-v2 (Coqui TTS) generates WAV files at 24kHz. Each question and its 5 answer choices are synthesised separately.

Average generation time: ~2.1 seconds per sample
Voice: a single consistent speaker profile across all samples

Noise augmentation (src/audio-captcha/util.py):

# Background noise (café)
def add_background_noise(audio, bg_path, boost=5.0):
    bg = AudioSegment.from_wav(bg_path)
    return audio.overlay(bg - (10 * log10(boost)))

# Gaussian noise
def add_gaussian_noise(audio, level=1.70):
    samples = np.array(audio.get_array_of_samples())
    rms = np.sqrt(np.mean(samples ** 2))
    noise = np.random.normal(0, rms * level, len(samples))
    return audio._spawn((samples + noise).astype(np.int16))

# Overlapping speech
def combine_audio_files(base, overlapping, ratio=0.7):
    mixed = base.overlay(overlapping - (10 * log10(1/ratio)))
    return mixed

Evaluation pipeline

Each augmented audio file is sent to the model along with the 5 answer choices (A–E) as text. The model must listen and select the correct answer.

Prompt format:

“Listen to the audio and answer the multiple-choice question. Choose from: A) … B) … C) … D) … E) …”

Scoring: Exact match on the selected answer choice (A–E). Random baseline = 20% (5-choice).

Models evaluated:

Model	API
GPT Audio Mini	OpenAI audio API
Gemini 3 Flash Preview	Google Generative AI (audio)
VoxTral Small	OpenRouter

100 samples per noise condition per model.

Full results

Model	Baseline	Background	Gaussian	Overlapping	Notes
Gemini 3 Flash Preview	75%	50%	59%	48%	Most robust; stays above random
VoxTral Small	73%	31%	46%	40%	Most sensitive to background noise
GPT Audio Mini	46%	23%	20%	27%	Gaussian noise → exact random baseline
Random baseline	20%	20%	20%	20%	:

Average response times: GPT Audio Mini 1.71s, VoxTral Small 3.79s, Gemini 3 Flash 6.82s.

Analysis

Clean baseline gap: GPT Audio Mini performs only 46% under clean conditions: already low for a 5-choice question where random is 20%. This suggests either the synthesised XTTS-v2 voice or the question difficulty contributes to baseline degradation.

Degradation patterns differ by model: VoxTral Small is uniquely sensitive to background noise (73% → 31%), but relatively robust to Gaussian noise (46%). GPT Audio Mini shows the opposite pattern: more sensitive to Gaussian noise than to overlapping speech. This suggests different audio processing architectures with different noise sensitivity profiles.

Gaussian floor for GPT Audio Mini: At exactly 20% under Gaussian noise, GPT Audio Mini is statistically indistinguishable from random guessing. Gaussian white noise at the tested level is sufficient to completely defeat this model’s audio comprehension.

Overlapping speech is hardest on average: The combined overlapping speech condition (concurrent second voice at 0.7× ratio) produces the lowest or second-lowest accuracy for all three models. Gemini Flash at 48%: just barely above the 20% random floor: is the best result under this condition.

Cocktail-party effect confirmed: The selective attention required to separate two concurrent speech streams is absent in all tested models. Under overlapping speech, even Gemini: the most capable audio model in this evaluation: performs at 48%, barely above random.

Comparison: ASCII vs audio robustness

CAPTCHA type	Best model accuracy	Against random
ASCII art (image input)	0.16% (Gemini)	N/A (binary task)
Audio (overlapping speech)	48% (Gemini)	vs. 20% random baseline

ASCII art CAPTCHAs are currently stronger: no model achieves meaningful exact accuracy. Audio CAPTCHAs under overlapping speech degrade models significantly but do not fully defeat them; Gemini retains 28 percentage points above random. The combined approach (requiring both) would be stronger still.