captcha-llm CAPTCHAAudioSpeechLLMsNLP

CAPTCHA-LLM: Audio CAPTCHA

Design, noise augmentation pipeline, evaluation methodology, and full results for the overlapping-audio CAPTCHA experiments.

Design rationale

The audio CAPTCHA targets the cocktail-party effect: the human ability to selectively attend to one voice in a noisy environment containing multiple competing audio streams. This is a deeply evolved perceptual capability that current audio-language models lack.

The hypothesis: under clean conditions, LLMs can transcribe and answer spoken QA questions reasonably well. Adding overlapping speech: a second concurrent voice speaking different content: should cause audio separation to fail, collapsing accuracy toward the random baseline.

Four noise conditions are tested, designed to progressively stress audio separation:

ConditionDescription
BaselineClean synthesised speech, no augmentation
Background noiseCafé/ambient noise mixed at 5× boost
Gaussian noiseWhite noise at RMS-normalised level 1.70
Overlapping speechTarget audio mixed with a second concurrent speech sample at 0.7× ratio

Data generation

Question source: CommonsenseQA: 5-choice multiple-choice questions requiring common-sense reasoning. 100 questions sampled per condition (400 total evaluation samples).

Audio synthesis: XTTS-v2 (Coqui TTS) generates WAV files at 24kHz. Each question and its 5 answer choices are synthesised separately.

Noise augmentation (src/audio-captcha/util.py):

# Background noise (café)
def add_background_noise(audio, bg_path, boost=5.0):
    bg = AudioSegment.from_wav(bg_path)
    return audio.overlay(bg - (10 * log10(boost)))

# Gaussian noise
def add_gaussian_noise(audio, level=1.70):
    samples = np.array(audio.get_array_of_samples())
    rms = np.sqrt(np.mean(samples ** 2))
    noise = np.random.normal(0, rms * level, len(samples))
    return audio._spawn((samples + noise).astype(np.int16))

# Overlapping speech
def combine_audio_files(base, overlapping, ratio=0.7):
    mixed = base.overlay(overlapping - (10 * log10(1/ratio)))
    return mixed

Evaluation pipeline

Each augmented audio file is sent to the model along with the 5 answer choices (A–E) as text. The model must listen and select the correct answer.

Prompt format:

“Listen to the audio and answer the multiple-choice question. Choose from: A) … B) … C) … D) … E) …”

Scoring: Exact match on the selected answer choice (A–E). Random baseline = 20% (5-choice).

Models evaluated:

ModelAPI
GPT Audio MiniOpenAI audio API
Gemini 3 Flash PreviewGoogle Generative AI (audio)
VoxTral SmallOpenRouter

100 samples per noise condition per model.


Full results

ModelBaselineBackgroundGaussianOverlappingNotes
Gemini 3 Flash Preview75%50%59%48%Most robust; stays above random
VoxTral Small73%31%46%40%Most sensitive to background noise
GPT Audio Mini46%23%20%27%Gaussian noise → exact random baseline
Random baseline20%20%20%20%:

Average response times: GPT Audio Mini 1.71s, VoxTral Small 3.79s, Gemini 3 Flash 6.82s.


Analysis

Clean baseline gap: GPT Audio Mini performs only 46% under clean conditions: already low for a 5-choice question where random is 20%. This suggests either the synthesised XTTS-v2 voice or the question difficulty contributes to baseline degradation.

Degradation patterns differ by model: VoxTral Small is uniquely sensitive to background noise (73% → 31%), but relatively robust to Gaussian noise (46%). GPT Audio Mini shows the opposite pattern: more sensitive to Gaussian noise than to overlapping speech. This suggests different audio processing architectures with different noise sensitivity profiles.

Gaussian floor for GPT Audio Mini: At exactly 20% under Gaussian noise, GPT Audio Mini is statistically indistinguishable from random guessing. Gaussian white noise at the tested level is sufficient to completely defeat this model’s audio comprehension.

Overlapping speech is hardest on average: The combined overlapping speech condition (concurrent second voice at 0.7× ratio) produces the lowest or second-lowest accuracy for all three models. Gemini Flash at 48%: just barely above the 20% random floor: is the best result under this condition.

Cocktail-party effect confirmed: The selective attention required to separate two concurrent speech streams is absent in all tested models. Under overlapping speech, even Gemini: the most capable audio model in this evaluation: performs at 48%, barely above random.


Comparison: ASCII vs audio robustness

CAPTCHA typeBest model accuracyAgainst random
ASCII art (image input)0.16% (Gemini)N/A (binary task)
Audio (overlapping speech)48% (Gemini)vs. 20% random baseline

ASCII art CAPTCHAs are currently stronger: no model achieves meaningful exact accuracy. Audio CAPTCHAs under overlapping speech degrade models significantly but do not fully defeat them; Gemini retains 28 percentage points above random. The combined approach (requiring both) would be stronger still.

← Back to CAPTCHA-LLM