Methodology
How We Evaluate Audio Quality
The test sets and metrics behind every claim on our speech enhancer page. We publish the methodology so the results can be checked, replicated, and updated.
Our Approach
Objective metrics first, listener panels for the close calls
Objective metrics — PESQ, STOI, DNSMOS — let us track regressions in CI on every model change. For close calls, we add a small blind listening panel so the published winner reflects what a human actually hears.
Reproducible
Every benchmark uses fixed seeds, versioned model weights, and a documented audio chain so the same files yield the same results on re-run.
Dated, not eternal
Every claim on the marketing pages is dated to its last verification run. Models change; our snapshots age and get refreshed.
Tied to deployed model
Benchmarks track the exact model weights shipped to production — not a research checkpoint. If the deployed model changes, the page changes.
Metrics
The objective metrics we track
No single metric captures speech quality. We watch several in parallel and flag any release that regresses on the core quality scores.
Perceptual Evaluation of Speech Quality
ITU-T P.862 standard for objective speech quality. Scored on a 1.0–4.5 scale. Correlates strongly with human MOS for telephony-band material.
Short-Time Objective Intelligibility
Predicts how intelligible speech remains after processing. Scored 0–1. Useful for catching cases where noise removal harms word clarity.
Microsoft DNSMOS (P.835)
Neural MOS estimator trained on subjective ratings. Separates speech quality (SIG), background intrusiveness (BAK), and overall (OVRL).
Test Set
A balanced set of real-world recording conditions
The benchmark set is balanced across languages and recording environments so a single condition cannot dominate aggregate scores. We refresh the set regularly.
🎙️
Studio voice (clean)
Professionally recorded narration across several languages. Used as a clean baseline to detect over-processing.
🏠
Home studio + light room noise
Podcast-style recordings with HVAC hum, computer fan, fridge noise, and untreated room reflections.
🌆
Outdoor + traffic
Street, café, and car-passing recordings. Tests how the model handles non-stationary noise without harming voice.
Model & Training
A denoising vocoder trained for speech
The deployed enhancer is a specialist model, not a general-purpose audio cleaner. The training mix is biased toward voice; the inference path is biased toward preserving the speaker's natural tone.
Training data
Trained on 180+ hours of our own collected data — paired clean / noisy speech across many language families, mixed at a wide range of signal-to-noise ratios. The noise material was captured by our team across studios, homes, and outdoor environments. No speaker identity is retained.
Honesty Section
What this benchmark protocol does not cover
Some failure modes are real and we do not paper over them. If your file falls into one of these buckets, the results on the marketing page do not apply.
Heavily clipped peaks. We can mask the perceived harshness, not reconstruct the lost signal.
Speech recorded at a significant distance from the mic in a reverberant room. The reverb tail exceeds what current speech enhancement can cleanly invert.
Calls compressed to narrow-band sample rates. The high-frequency information is gone from the source.
Music, sound effects, and non-speech audio. The model is tuned for voice; use the general audio enhancer for those.
Mixed-speaker overlap where two voices arrive at similar levels. We do not perform speaker separation in this pipeline.
Hear what the methodology produces
Numbers only matter if the audio sounds better. Drop a file on the speech enhancer and decide for yourself before any commitment.
Try the speech enhancer →