Syllabus Map

Study map: Syllabus Study Map

Overview

Audio models handle speech recognition, audio understanding, and speech/audio generation.
Choice of model depends on output type:
- Text output (ASR)
- Class/embedding output (classification, retrieval)
- Waveform output (TTS, vocoders, generative audio).

Automatic Speech Recognition (ASR)

Main Approaches

CTC-based models: predict frame-level token probabilities with monotonic alignment.
Encoder-decoder seq2seq: decoder generates transcript autoregressively.
Transducer (RNN-T): streaming-friendly compromise between CTC and seq2seq.

CTC Objective

\mathcal{L}_{\text{CTC}}=-\log P(y\mid x)

$x$ is acoustic input and $y$ is transcript.
A frame is a short time slice of audio features (for example, ~10-25 ms) produced after windowing the waveform.
CTC sums over valid alignments between frames and tokens.

Typical ASR Pipeline

Step 1: Audio preprocessing

Resample, normalize, and chunk long recordings.

Step 2: Feature/encoder forward

Extract log-mel features or use raw-waveform frontend.
Run encoder to produce frame representations.

Step 3: Decode text

Greedy/beam search for CTC.
Beam search with language model fusion when needed.

Step 4: Evaluate

Report WER and CER.

Audio Classification and Event Detection

Tasks

Speech emotion recognition.
Speaker identification/verification.
Acoustic scene and event classification.

Model Families

CNN/CRNN on log-mel spectrograms.
Transformer encoders on spectrogram patches.
Pretrained audio encoders + linear head.

Typical Objective

\mathcal{L}_{\text{CE}}=-\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c}\log \hat{p}_{i,c}

Practical Notes

Handle class imbalance explicitly

Use class-balanced sampling or loss reweighting for rare events.

Use augmentation aligned with audio noise conditions

SpecAugment, time masking, background noise, and reverb often improve robustness.

Audio-Language Models

Core Idea

Combine an audio encoder with a language model decoder or projector.
Support instruction following over audio, speech QA, and multimodal chat.

Typical Architecture

Audio encoder produces token/segment embeddings.
Projection layer maps audio embeddings into LLM embedding space.
LLM decodes text conditioned on audio context.

Applications

Speech understanding with long-form reasoning.
Audio captioning.
Cross-modal retrieval and question answering.

Speech and Audio Generation

Text-to-Speech (TTS)

Input text -> acoustic model -> vocoder waveform.
Modern systems use diffusion/flow/neural codec decoders for quality.

Voice Conversion

Preserve linguistic content while changing speaker identity/style.
Often uses disentangled speaker/content embeddings.

Music/Sound Generation

Autoregressive token models or diffusion over audio latents.
Conditioning can include text, melody, or style prompts.

Example Models

Whisper: robust multilingual ASR encoder-decoder.
wav2vec 2.0 / HuBERT-based ASR stacks: strong speech encoder backbones.
Conformer ASR models: strong local+global sequence modeling for speech.
Qwen-Audio-style models: audio-language instruction systems.

Model Selection Checklist

Need streaming? choose transducer/streaming conformer.
Need best offline transcription quality? encoder-decoder ASR with beam search.
Need low-label setup? pretrained encoder + lightweight task head.
Need multimodal interaction? audio-language model with instruction tuning.

Practical Notes

Preprocessing Consistency

Match sample rate and frontend settings to pretrained checkpoint assumptions.

Domain Robustness

Domain mismatch (accent, channel, background noise) can dominate real-world errors.

Evaluation Strategy

Evaluate by domain slice, not only aggregate metrics.

Audio Models

Syllabus Map

Overview

Automatic Speech Recognition (ASR)

Main Approaches

CTC Objective

Typical ASR Pipeline

Step 1: Audio preprocessing

Step 2: Feature/encoder forward

Step 3: Decode text

Step 4: Evaluate

Audio Classification and Event Detection

Tasks

Model Families

Typical Objective

Practical Notes

Handle class imbalance explicitly

Use augmentation aligned with audio noise conditions

Audio-Language Models

Core Idea

Typical Architecture

Applications

Speech and Audio Generation

Text-to-Speech (TTS)

Voice Conversion

Music/Sound Generation

Example Models

Model Selection Checklist

Practical Notes

Preprocessing Consistency

Domain Robustness

Evaluation Strategy