Voice Detection, Turn Detection and Diarization
with PyAnnote, Nvidia NeMo and Pipecat Smart Turn
Turn Detection and Transcription Re-emerge as Important
As voice assistants become more popular, there is a renewed importance for techniques to detect a) when a user stops speaking (so the AI responds in time, but also without interrupting) and b) the presence of multiple speakers.
Both tasks are closely related, so I describe the theory behind both. I then survey common libraries, namely Pyannote and Nvidia NeMo, as well as the new Smart Turn library from Pipecat.
Cheers, Ronan
🛠 Get the Voice Detection Scripts
💡 Consulting (Technical Assistance OR Market Insights)
🤝 Join the Trelis Team / Trelis Developer Colabs
Turn Detection and Diarization
Turn detection and diarization serve distinct but related purposes in speech processing. Turn detection identifies when a speaker has finished speaking, while diarization attributes speech segments to specific speakers.
Turn Detection Fundamentals
Turn detection involves two key components:
Voice Activity Detection (VAD) to identify speech vs non-speech in 30ms frames
Turn completion detection to determine if a speaker has finished
The core challenge is handling:
Long pauses during thought
Filler words ("um", "and") that don't indicate completion
Tonal indicators of completion vs continuation
PipeCat's Smart Turn project addresses this using:
A 2.3GB wave2vec-BERT model for audio embeddings
Additional classification layers (<1MB) to predict completion
Training on both synthetic and human speech data (4000+ samples)
Diarization Architecture
The standard diarization pipeline consists of:
Voice activity detection (30ms frames)
Speech segmentation (combining VAD frames)
Speaker embedding extraction
Clustering to identify distinct speakers
Two main approaches exist:
Pyannote Pipeline:
Uses dedicated segmentation model instead of VAD
Handles up to 3 speakers with overlap detection
Employs bidirectional LSTM for context
25MB embedding models
NVIDIA NEMO Pipeline:
Uses MarbleNet VAD (smaller/faster than alternatives)
Implements multiscale embeddings at different time windows
TitaNet embeddings (~100MB)
Employs neural refinement for overlap detection (MSDD)
Key Technical Challenges
Performance degrades significantly with:
Overlapping speech
Background noise
Multiple speakers with short utterances
Varying speaker tones/styles
Current limitations:
Turn detection models are large (2.3GB) and slow
Diarization struggles with speaker overlap
Short utterances are difficult to attribute
Speaker count detection is unreliable without prior knowledge
The field remains active with ongoing work to:
Reduce model sizes
Improve overlap detection
Handle more dynamic speaking scenarios
Better integrate with downstream applications
For practical implementations, testing with representative audio samples and potentially combining approaches (e.g., Pyannote segmentation with NEMO speaker detection) may yield better results than any single solution.
Apologies, I pressed send on this "Turn Detection and Diarization" and then realised I've a very slow internet connection and the YouTube video is not live yet. It should be by 6 pm Irish time today Monday 17th March 2025. Apologies
I am using also different prompts for diarisation with score to give a name to speakers..