Design rationale
The audio CAPTCHA targets the cocktail-party effect: the human ability to selectively attend to one voice in a noisy environment containing multiple competing audio streams. This is a deeply evolved perceptual capability that current audio-language models lack.
The hypothesis: under clean conditions, LLMs can transcribe and answer spoken QA questions reasonably well. Adding overlapping speech: a second concurrent voice speaking different content: should cause audio separation to fail, collapsing accuracy toward the random baseline.
Four noise conditions are tested, designed to progressively stress audio separation:
| Condition | Description |
|---|---|
| Baseline | Clean synthesised speech, no augmentation |
| Background noise | Café/ambient noise mixed at 5× boost |
| Gaussian noise | White noise at RMS-normalised level 1.70 |
| Overlapping speech | Target audio mixed with a second concurrent speech sample at 0.7× ratio |
Data generation
Question source: CommonsenseQA: 5-choice multiple-choice questions requiring common-sense reasoning. 100 questions sampled per condition (400 total evaluation samples).
Audio synthesis: XTTS-v2 (Coqui TTS) generates WAV files at 24kHz. Each question and its 5 answer choices are synthesised separately.
- Average generation time: ~2.1 seconds per sample
- Voice: a single consistent speaker profile across all samples
Noise augmentation (src/audio-captcha/util.py):
# Background noise (café)
def add_background_noise(audio, bg_path, boost=5.0):
bg = AudioSegment.from_wav(bg_path)
return audio.overlay(bg - (10 * log10(boost)))
# Gaussian noise
def add_gaussian_noise(audio, level=1.70):
samples = np.array(audio.get_array_of_samples())
rms = np.sqrt(np.mean(samples ** 2))
noise = np.random.normal(0, rms * level, len(samples))
return audio._spawn((samples + noise).astype(np.int16))
# Overlapping speech
def combine_audio_files(base, overlapping, ratio=0.7):
mixed = base.overlay(overlapping - (10 * log10(1/ratio)))
return mixed
Evaluation pipeline
Each augmented audio file is sent to the model along with the 5 answer choices (A–E) as text. The model must listen and select the correct answer.
Prompt format:
“Listen to the audio and answer the multiple-choice question. Choose from: A) … B) … C) … D) … E) …”
Scoring: Exact match on the selected answer choice (A–E). Random baseline = 20% (5-choice).
Models evaluated:
| Model | API |
|---|---|
| GPT Audio Mini | OpenAI audio API |
| Gemini 3 Flash Preview | Google Generative AI (audio) |
| VoxTral Small | OpenRouter |
100 samples per noise condition per model.
Full results
| Model | Baseline | Background | Gaussian | Overlapping | Notes |
|---|---|---|---|---|---|
| Gemini 3 Flash Preview | 75% | 50% | 59% | 48% | Most robust; stays above random |
| VoxTral Small | 73% | 31% | 46% | 40% | Most sensitive to background noise |
| GPT Audio Mini | 46% | 23% | 20% | 27% | Gaussian noise → exact random baseline |
| Random baseline | 20% | 20% | 20% | 20% | : |
Average response times: GPT Audio Mini 1.71s, VoxTral Small 3.79s, Gemini 3 Flash 6.82s.
Analysis
Clean baseline gap: GPT Audio Mini performs only 46% under clean conditions: already low for a 5-choice question where random is 20%. This suggests either the synthesised XTTS-v2 voice or the question difficulty contributes to baseline degradation.
Degradation patterns differ by model: VoxTral Small is uniquely sensitive to background noise (73% → 31%), but relatively robust to Gaussian noise (46%). GPT Audio Mini shows the opposite pattern: more sensitive to Gaussian noise than to overlapping speech. This suggests different audio processing architectures with different noise sensitivity profiles.
Gaussian floor for GPT Audio Mini: At exactly 20% under Gaussian noise, GPT Audio Mini is statistically indistinguishable from random guessing. Gaussian white noise at the tested level is sufficient to completely defeat this model’s audio comprehension.
Overlapping speech is hardest on average: The combined overlapping speech condition (concurrent second voice at 0.7× ratio) produces the lowest or second-lowest accuracy for all three models. Gemini Flash at 48%: just barely above the 20% random floor: is the best result under this condition.
Cocktail-party effect confirmed: The selective attention required to separate two concurrent speech streams is absent in all tested models. Under overlapping speech, even Gemini: the most capable audio model in this evaluation: performs at 48%, barely above random.
Comparison: ASCII vs audio robustness
| CAPTCHA type | Best model accuracy | Against random |
|---|---|---|
| ASCII art (image input) | 0.16% (Gemini) | N/A (binary task) |
| Audio (overlapping speech) | 48% (Gemini) | vs. 20% random baseline |
ASCII art CAPTCHAs are currently stronger: no model achieves meaningful exact accuracy. Audio CAPTCHAs under overlapping speech degrade models significantly but do not fully defeat them; Gemini retains 28 percentage points above random. The combined approach (requiring both) would be stronger still.