How We Evaluate Audio Quality

Methodology

The test sets and metrics behind every claim on our speech enhancer page. We publish the methodology so the results can be checked, replicated, and updated.

Benchmark refreshed regularly

Diverse real-world test conditions

Our Approach

Objective metrics first, listener panels for the close calls

Objective metrics — PESQ, STOI, DNSMOS — let us track regressions in CI on every model change. For close calls, we add a small blind listening panel so the published winner reflects what a human actually hears.

Reproducible

Every benchmark uses fixed seeds, versioned model weights, and a documented audio chain so the same files yield the same results on re-run.

Dated, not eternal

Every claim on the marketing pages is dated to its last verification run. Models change; our snapshots age and get refreshed.

Tied to deployed model

Benchmarks track the exact model weights shipped to production — not a research checkpoint. If the deployed model changes, the page changes.

Metrics

The objective metrics we track

No single metric captures speech quality. We watch several in parallel and flag any release that regresses on the core quality scores.

PESQ

Perceptual Evaluation of Speech Quality

ITU-T P.862 standard for objective speech quality. Scored on a 1.0–4.5 scale. Correlates strongly with human MOS for telephony-band material.

STOI

Short-Time Objective Intelligibility

Predicts how intelligible speech remains after processing. Scored 0–1. Useful for catching cases where noise removal harms word clarity.

DNSMOS P.835

Microsoft DNSMOS (P.835)

Neural MOS estimator trained on subjective ratings. Separates speech quality (SIG), background intrusiveness (BAK), and overall (OVRL).

Test Set

A balanced set of real-world recording conditions

The benchmark set is balanced across languages and recording environments so a single condition cannot dominate aggregate scores. We refresh the set regularly.

🎙️

Studio voice (clean)

Professionally recorded narration across several languages. Used as a clean baseline to detect over-processing.

🏠

Home studio + light room noise

Podcast-style recordings with HVAC hum, computer fan, fridge noise, and untreated room reflections.

🌆

Outdoor + traffic

Street, café, and car-passing recordings. Tests how the model handles non-stationary noise without harming voice.

Model & Training

A custom Mel-band neural network trained for speech

The deployed enhancer is a specialist model — a custom Mel-band neural network with a LoRA adaptation layer, not a general-purpose audio cleaner. The training mix is biased toward voice; the inference path is biased toward preserving the speaker's natural tone. It refines speech with a learned per-band mask rather than re-synthesizing it.

Training data

Trained on 170 hours of our own collected data — paired clean / noisy speech across many language families, mixed at a wide range of signal-to-noise ratios. The noise material was captured by our team across studios, homes, and outdoor environments. No speaker identity is retained.

Honesty Section

What this benchmark protocol does not cover

Some failure modes are real and we do not paper over them. If your file falls into one of these buckets, the results on the marketing page do not apply.

Heavily clipped peaks. We can mask the perceived harshness, not reconstruct the lost signal.

Speech recorded at a significant distance from the mic in a reverberant room. The reverb tail exceeds what current speech enhancement can cleanly invert.

Calls compressed to narrow-band sample rates. The high-frequency information is gone from the source.

Music, sound effects, and non-speech audio. The model is tuned for voice; use the general audio enhancer for those.

Mixed-speaker overlap where two voices arrive at similar levels. We do not perform speaker separation in this pipeline.

Hear what the methodology produces

Numbers only matter if the audio sounds better. Drop a file on the speech enhancer and decide for yourself before any commitment.

Try the speech enhancer →