Voice Detection, Turn Detection and Diarization

with PyAnnote, Nvidia NeMo and Pipecat Smart Turn

Mar 17, 2025

Turn Detection and Transcription Re-emerge as Important

As voice assistants become more popular, there is a renewed importance for techniques to detect a) when a user stops speaking (so the AI responds in time, but also without interrupting) and b) the presence of multiple speakers.

Both tasks are closely related, so I describe the theory behind both. I then survey common libraries, namely Pyannote and Nvidia NeMo, as well as the new Smart Turn library from Pipecat.

Cheers, Ronan

🛠 Get the Voice Detection Scripts

💡 Consulting (Technical Assistance OR Market Insights)

🤝 Join the Trelis Team / Trelis Developer Colabs

💸 Grants Program

Turn Detection and Diarization

Turn detection and diarization serve distinct but related purposes in speech processing. Turn detection identifies when a speaker has finished speaking, while diarization attributes speech segments to specific speakers.

Turn Detection Fundamentals

Turn detection involves two key components:

Voice Activity Detection (VAD) to identify speech vs non-speech in 30ms frames
Turn completion detection to determine if a speaker has finished

The core challenge is handling:

Long pauses during thought
Filler words ("um", "and") that don't indicate completion
Tonal indicators of completion vs continuation

PipeCat's Smart Turn project addresses this using:

A 2.3GB wave2vec-BERT model for audio embeddings
Additional classification layers (<1MB) to predict completion
Training on both synthetic and human speech data (4000+ samples)

Diarization Architecture

The standard diarization pipeline consists of:

Voice activity detection (30ms frames)
Speech segmentation (combining VAD frames)
Speaker embedding extraction
Clustering to identify distinct speakers

Two main approaches exist:

Pyannote Pipeline:

Uses dedicated segmentation model instead of VAD
Handles up to 3 speakers with overlap detection
Employs bidirectional LSTM for context
25MB embedding models

NVIDIA NEMO Pipeline:

Uses MarbleNet VAD (smaller/faster than alternatives)
Implements multiscale embeddings at different time windows
TitaNet embeddings (~100MB)
Employs neural refinement for overlap detection (MSDD)

Key Technical Challenges

Performance degrades significantly with:

Overlapping speech
Background noise
Multiple speakers with short utterances
Varying speaker tones/styles

Current limitations:

Turn detection models are large (2.3GB) and slow
Diarization struggles with speaker overlap
Short utterances are difficult to attribute
Speaker count detection is unreliable without prior knowledge

The field remains active with ongoing work to:

Reduce model sizes
Improve overlap detection
Handle more dynamic speaking scenarios
Better integrate with downstream applications

For practical implementations, testing with representative audio samples and potentially combining approaches (e.g., Pyannote segmentation with NEMO speaker detection) may yield better results than any single solution.

Mar 17

Apologies, I pressed send on this "Turn Detection and Diarization" and then realised I've a very slow internet connection and the YouTube video is not live yet. It should be by 6 pm Irish time today Monday 17th March 2025. Apologies

Expand full comment