Syllabus Map
- Study map: Syllabus Study Map
Overview
- Audio models handle speech recognition, audio understanding, and speech/audio generation.
- Choice of model depends on output type:
- Text output (ASR)
- Class/embedding output (classification, retrieval)
- Waveform output (TTS, vocoders, generative audio).
Automatic Speech Recognition (ASR)
Main Approaches
- CTC-based models: predict frame-level token probabilities with monotonic alignment.
- Encoder-decoder seq2seq: decoder generates transcript autoregressively.
- Transducer (RNN-T): streaming-friendly compromise between CTC and seq2seq.
CTC Objective
- is acoustic input and is transcript.
- A frame is a short time slice of audio features (for example, ~10-25 ms) produced after windowing the waveform.
- CTC sums over valid alignments between frames and tokens.
Typical ASR Pipeline
Step 1: Audio preprocessing
- Resample, normalize, and chunk long recordings.
Step 2: Feature/encoder forward
- Extract log-mel features or use raw-waveform frontend.
- Run encoder to produce frame representations.
Step 3: Decode text
- Greedy/beam search for CTC.
- Beam search with language model fusion when needed.
Step 4: Evaluate
- Report WER and CER.
Audio Classification and Event Detection
Tasks
- Speech emotion recognition.
- Speaker identification/verification.
- Acoustic scene and event classification.
Model Families
- CNN/CRNN on log-mel spectrograms.
- Transformer encoders on spectrogram patches.
- Pretrained audio encoders + linear head.
Typical Objective
Practical Notes
Handle class imbalance explicitly
- Use class-balanced sampling or loss reweighting for rare events.
Use augmentation aligned with audio noise conditions
- SpecAugment, time masking, background noise, and reverb often improve robustness.
Audio-Language Models
Core Idea
- Combine an audio encoder with a language model decoder or projector.
- Support instruction following over audio, speech QA, and multimodal chat.
Typical Architecture
- Audio encoder produces token/segment embeddings.
- Projection layer maps audio embeddings into LLM embedding space.
- LLM decodes text conditioned on audio context.
Applications
- Speech understanding with long-form reasoning.
- Audio captioning.
- Cross-modal retrieval and question answering.
Speech and Audio Generation
Text-to-Speech (TTS)
- Input text -> acoustic model -> vocoder waveform.
- Modern systems use diffusion/flow/neural codec decoders for quality.
Voice Conversion
- Preserve linguistic content while changing speaker identity/style.
- Often uses disentangled speaker/content embeddings.
Music/Sound Generation
- Autoregressive token models or diffusion over audio latents.
- Conditioning can include text, melody, or style prompts.
Example Models
- Whisper: robust multilingual ASR encoder-decoder.
- wav2vec 2.0 / HuBERT-based ASR stacks: strong speech encoder backbones.
- Conformer ASR models: strong local+global sequence modeling for speech.
- Qwen-Audio-style models: audio-language instruction systems.
Model Selection Checklist
- Need streaming? choose transducer/streaming conformer.
- Need best offline transcription quality? encoder-decoder ASR with beam search.
- Need low-label setup? pretrained encoder + lightweight task head.
- Need multimodal interaction? audio-language model with instruction tuning.
Practical Notes
Preprocessing Consistency
- Match sample rate and frontend settings to pretrained checkpoint assumptions.
Domain Robustness
- Domain mismatch (accent, channel, background noise) can dominate real-world errors.
Evaluation Strategy
- Evaluate by domain slice, not only aggregate metrics.