Home Blog

How Our Noise Reduction AI Works: Inside the Mel-Band + LoRA Model

By the Inverse.AI ML Team · Updated June 2026 · ~7 min read

Quick answer

Noise Reducer doesn't run an off-the-shelf noise gate. It uses a custom Mel-band neural network with a LoRA adaptation layer — a band-split transformer that splits the sound into 60 frequency bands and uses attention across time and frequency to decide what is voice and what is noise. It then applies a learned mask to each band, refining the voice that is already there instead of re-synthesizing it. That is why it strips wind, traffic and hum while the voice still sounds like you.

The problem with traditional noise reduction

Most classic noise reduction works like a bouncer with one rule: anything quieter than a set threshold gets cut. A spectral gate or noise-subtraction filter samples the "noise profile," then pulls that energy out of the whole signal. It removes the hiss — but it also shaves the quiet edges of speech: the breath before a word, the tail of a consonant, the air in a vowel. The result is the familiar underwater, robotic artifact. You traded the noise for damage to the voice.

The deeper issue is that noise and voice are not neatly separable by volume. A hard "s" sound and tape hiss can sit in the same loudness range and the same high frequencies. A blunt filter can't tell them apart, so it guesses with a single rule applied everywhere. To keep speech intact, you need a model that looks at the audio the way a person does — frequency by frequency — and decides per region what to keep.

What "Mel-band" means, and why we use it

Human hearing isn't linear. We're far more sensitive to differences down where voices live than up in the high treble, so a model that spaces its attention evenly across raw frequency wastes effort where our ears barely notice. The Mel scale fixes that: it spaces frequency bands the way the ear actually perceives pitch, packing more resolution into the speech range.

Our model operates on 60 mel-spaced frequency bands across full-band, 44.1 kHz stereo audio. Crucially, it does not handle those bands in isolation. It uses cross-band attention, so the bands inform one another — a strong voice signature in the mid-range helps the model judge what's happening in a noisier neighbouring band. Treating the bands as a connected system, rather than 60 separate filters, is what lets it follow a voice through messy audio.

The architecture, briefly

Under the hood it's a band-split, dual-path transformer with rotary position embeddings — a RoFormer-style network. "Dual-path" means it alternates its attention along two axes: across time (how the sound evolves moment to moment) and across frequency bands (how energy is distributed at each instant). By switching between those two views, the model builds a rich picture of which time-frequency regions belong to the voice and which belong to the room.

This family of architecture is public research — it was introduced for separating vocals from music (see the citation below). The architecture itself isn't our secret; what we trained it to do is. That's the next section.

What the LoRA layer actually does — and why it's the differentiator

Here's the part no competitor can copy. The base network was originally trained for a different job: pulling vocals out of music — separating a singer from the instrumental track. That's a strong foundation, because a model that can isolate a voice from a dense mix already understands what "voice" looks like. But music separation is not noise removal.

So we used LoRA (Low-Rank Adaptation) to fine-tune it for the real world. Instead of retraining the entire network from scratch, LoRA inserts small, trainable layers that steer the existing model toward a new task — fast and efficiently. We trained those layers on clean speech mixed with additive real-world noise, teaching the model to pull a clean voice out of noisy recordings rather than out of music. That adaptation — a music-trained separator, retargeted onto noisy speech — is the core of what makes this our noise reducer instead of a generic, off-the-shelf separator.

Why the voice stays natural

This is the question people care about most, and the answer is specific. The model does not gate or subtract noise, and it does not reconstruct speech from scratch. For each of the 60 frequency bands it predicts a learned mask that reshapes both loudness and phase, trained against a multi-resolution spectral loss that measures the result at several time-frequency scales at once.

The right way to picture it: the model refines the voice that's already in the recording, dialing noise down band by band, rather than re-synthesizing a new voice over the top. Nothing about your timbre is invented or replaced — which is exactly why the cleaned audio sounds like a quiet room, not a robot.

What it handles well — and where it honestly struggles

It's strong on the everyday wreckers of audio: wind, traffic, AC hum, fan noise, hiss, room tone, keyboard clatter and background chatter. Those are steady or broadband sounds that sit clearly apart from speech, and the model clears them while keeping the voice intact.

Because it descends from a vocal-separation model, its hardest cases are the ones that overlap with the very thing it's trained to protect:

We'd rather tell you that up front. Clean capture always beats post-processing, and no model — ours included — can restore detail that was never recorded.

What it's trained on

The noise-removal adaptation was LoRA fine-tuned on 170 hours of custom data — clean speech paired with real-world noise so the model learns the difference under realistic conditions. It runs on full-band 44.1 kHz stereo, so it isn't quietly downsampling your audio to cut corners.

60
Mel frequency bands
170 hrs
Noise-adaptation training data
44.1 kHz
Full-band stereo

Try it on your own audio

The same model runs in the browser and powers our mobile apps, which have cleaned audio for more than 5,000,000 users at a 4.6-star rating. Upload a clip and hear the difference on your own recording.

Remove background noise free →

Frequently asked questions

What algorithm does Noise Reducer use?

A custom Mel-band neural network with a LoRA adaptation layer — a band-split transformer that splits sound into 60 frequency bands and uses attention across time and frequency to separate voice from noise. It is not DeepFilterNet or a generic off-the-shelf noise gate.

Why does it sound more natural than a typical noise remover?

Instead of gating or subtracting whole frequency ranges, it learns a per-band mask that reshapes loudness and phase. It refines the voice that's already there rather than re-synthesizing it, so consonants and natural timbre survive.

Does it work on languages other than English?

Yes. The model works on the acoustic spectrogram, not on words or phonemes, so it removes noise around any voice regardless of the language being spoken.

What does the AI struggle with?

Background music or singing, overlapping speakers, and severely clipped audio — cases where the clean voice overlaps with, or has been destroyed by, the thing you're trying to remove.

Reference. The base architecture builds on the Mel-Band RoFormer design: Ju-Chiang Wang, Wei-Tsung Lu, Minz Won, Mel-Band RoFormer for Music Source Separation, arXiv:2310.01809 (October 2023). arxiv.org/abs/2310.01809

Related: Remove noise from audio · Remove noise from video · Speech enhancer · Premiere Pro guide