Offline
Harmonic / percussive
two layersAn envelope detector traces the loudness contour of a waveform — the slow outline riding over the fast carrier inside it. Every graph on this page is drawn by the method's real algorithm, and the sliders at the top drive all of them at once.
The whole method, live
Score card
- Causality
- offline
- Signal model
- polyphonic
- Reads
- per-band
- Latency
- none (offline)
- Cost
- STFT + median
- Domain
- frequency
Scored qualitatively.
This method outputs a normalized contour (onset strength, per-band or perceptual loudness), not an amplitude in the units of the true envelope — so an amplitude error number would be meaningless. Its strength is the spectral axis: read the gallery below.
How it works
Music has two envelopes: rhythm and sustain. Median-filter the spectrogram along time to keep steady, horizontal (harmonic) energy, and along frequency to keep broadband, vertical (percussive) energy. The two soft masks split the signal into a sustained-harmonic layer and a transient layer, each with its own envelope: the percussive curve spikes on the hits and the sharp note attacks, while the harmonic curve follows the held tones.
Pulling these apart is often more useful than any single contour. The separation control is the median kernel size.
Key terms
- Harmonic / percussive separation (HPSS)
- Splitting a mix into two layers: a steady, pitched part — the held tones and sustained notes — and a transient, noisy part — the hits, clicks, and sharp note attacks. Each layer carries its own envelope, one for sustain and one for rhythm.
- Median filtering
- Replacing each spectrogram cell with the median of its neighbors. Slide the window along time and a steady tone survives as a horizontal ridge while a brief hit is voted away; slide it along frequency and a broadband burst survives as a vertical stripe while a narrow tone is voted away. The two passes isolate the harmonic and percussive content.
- Soft mask
- A per-cell weight in [0, 1] that apportions each cell's energy between the two layers rather than hard-assigning it, so ambiguous cells split smoothly instead of flipping. The median kernel size is the separation control — wider windows draw a cleaner line between sustain and rhythm.
Building the envelope, step by step
One contour can't describe a mix where held tones and percussive hits sound at once. The fix is to split the spectrogram two ways and follow each layer's energy separately — each graph below is drawn by the real algorithm on the page's polyphonic input.
- Step 1The raw mix
Start with the polyphonic input — sustained tones and percussive hits layered together, with no single carrier to demodulate. A lone amplitude follower would blur the rhythm into the sustain.
- Step 2Two separated contours
Median-filter the spectrogram along time (keeping horizontal, harmonic energy) and along frequency (keeping broadband, percussive energy), then read each soft mask's energy. The percussive curve spikes on the hits while the harmonic curve rides the held tones — the two layers are legible instead of summed.
The code
Six readable forms of the exact algorithm that draws the curves above — C, JS and Python ports, an optimized C, a fixed-coefficient version, and a user-controlled one whose parameters match the sliders.
#include <math.h>
#include <stdlib.h>
/* Assumed shared helper: an STFT magnitude spectrogram, B frequency bins by M
time frames, laid out row-major as mag[b * M + m]. (Same transform the rest of
the field guide uses; we only show the HPSS math here.)
void stft_mag(const double *x, int n, double *mag, int B, int M); */
/* Fold any index back into 0..len-1 by mirror reflection, so a median window has
well-defined neighbours at the spectrogram edges: -1->0, len->len-1, ... */
static int reflect_idx(int i, int len) {
if (len == 1) return 0;
int p = 2 * len;
i = ((i % p) + p) % p;
if (i >= len) i = p - 1 - i;
return i;
}
static int cmp_double(const void *a, const void *b) {
double d = *(const double *)a - *(const double *)b;
return (d > 0) - (d < 0);
}
/* Median of a length-win window of get(i0 + w - win/2), reflected at the edges. */
static double median_1d(int len, int i0, int win,
double (*get)(int idx, void *ctx), void *ctx) {
int half = win >> 1;
double *tmp = malloc((size_t)win * sizeof(double));
for (int w = 0; w < win; w++)
tmp[w] = get(reflect_idx(i0 + w - half, len), ctx);
qsort(tmp, win, sizeof(double), cmp_double);
double med = tmp[half];
free(tmp);
return med;
}
/* Getters that slide the median along one axis of mag[b * M + m]. */
typedef struct { const double *mag; int M; int b; } RowCtx; /* fixed bin, vary time */
typedef struct { const double *mag; int M; int m; } ColCtx; /* fixed frame, vary freq */
static double get_time(int idx, void *c) {
RowCtx *r = c; return r->mag[r->b * r->M + idx];
}
static double get_freq(int idx, void *c) {
ColCtx *k = c; return k->mag[idx * k->M + k->m];
}
/* Harmonic/percussive separation. kt = median window along time (harmonic),
kf = median window along frequency (percussive). Writes two per-frame energy
contours, each normalized to the shared peak. */
void hpss(const double *mag, int B, int M, int kt, int kf,
double *harm, double *perc) {
double *Hmed = malloc((size_t)B * M * sizeof(double));
double *Pmed = malloc((size_t)B * M * sizeof(double));
/* Harmonic estimate: median ALONG TIME within each frequency bin. */
for (int b = 0; b < B; b++) {
RowCtx ctx = { mag, M, b };
for (int m = 0; m < M; m++)
Hmed[b * M + m] = median_1d(M, m, kt, get_time, &ctx);
}
/* Percussive estimate: median ALONG FREQUENCY within each time frame. */
for (int m = 0; m < M; m++) {
ColCtx ctx = { mag, M, m };
for (int b = 0; b < B; b++)
Pmed[b * M + m] = median_1d(B, b, kf, get_freq, &ctx);
}
/* Soft mask + per-mask energy. The mask is the harmonic share of each cell;
split the real magnitude by it, then take RMS across bins. */
for (int m = 0; m < M; m++) {
double ah = 0.0, ap = 0.0;
for (int b = 0; b < B; b++) {
double hM = Hmed[b * M + m];
double pM = Pmed[b * M + m];
double hm = hM / (hM + pM + 1e-9); /* harmonic share */
double g = mag[b * M + m];
double vh = g * hm;
double vp = g * (1.0 - hm);
ah += vh * vh;
ap += vp * vp;
}
harm[m] = sqrt(ah / B);
perc[m] = sqrt(ap / B);
}
/* Normalize both contours to their shared peak. */
double mx = 0.0;
for (int m = 0; m < M; m++) {
if (harm[m] > mx) mx = harm[m];
if (perc[m] > mx) mx = perc[m];
}
if (mx == 0.0) mx = 1e-9;
for (int m = 0; m < M; m++) { harm[m] /= mx; perc[m] /= mx; }
free(Hmed);
free(Pmed);
}
// Assumed shared helper: stftMag(x) -> { mag, B, M }, where mag[b][m] is the
// magnitude of frequency bin b at time frame m. (The same STFT the rest of the
// field guide uses; HPSS only needs its magnitude.)
// Fold any index into 0..len-1 by mirror reflection, so the median windows have
// well-defined neighbours at the spectrogram edges: -1->0, len->len-1, ...
function reflectIdx(i, len) {
if (len === 1) return 0;
const p = 2 * len;
i = ((i % p) + p) % p;
if (i >= len) i = p - 1 - i;
return i;
}
// Median of a length-win window of get(i0 + w - win/2), reflected at the edges.
function median1D(len, i0, win, get) {
const half = win >> 1;
const tmp = [];
for (let w = 0; w < win; w++) tmp.push(get(reflectIdx(i0 + w - half, len)));
tmp.sort((a, b) => a - b);
return tmp[half];
}
// Harmonic/percussive separation. kt = median window along time (harmonic),
// kf = median window along frequency (percussive). Returns two normalized,
// per-frame energy contours.
function hpss(mag, B, M, kt, kf) {
const Hmed = Array.from({ length: B }, () => new Float64Array(M));
const Pmed = Array.from({ length: B }, () => new Float64Array(M));
// Harmonic estimate: median ALONG TIME within each frequency bin.
for (let b = 0; b < B; b++)
for (let m = 0; m < M; m++)
Hmed[b][m] = median1D(M, m, kt, (idx) => mag[b][idx]);
// Percussive estimate: median ALONG FREQUENCY within each time frame.
for (let m = 0; m < M; m++)
for (let b = 0; b < B; b++)
Pmed[b][m] = median1D(B, b, kf, (idx) => mag[idx][m]);
const harm = new Float64Array(M);
const perc = new Float64Array(M);
// Soft mask + per-mask RMS energy across bins.
for (let m = 0; m < M; m++) {
let ah = 0;
let ap = 0;
for (let b = 0; b < B; b++) {
const hM = Hmed[b][m];
const pM = Pmed[b][m];
const hm = hM / (hM + pM + 1e-9); // harmonic share
const g = mag[b][m];
const vh = g * hm;
const vp = g * (1 - hm);
ah += vh * vh;
ap += vp * vp;
}
harm[m] = Math.sqrt(ah / B);
perc[m] = Math.sqrt(ap / B);
}
// Normalize both contours to their shared peak.
let mx = 0;
for (let m = 0; m < M; m++) {
if (harm[m] > mx) mx = harm[m];
if (perc[m] > mx) mx = perc[m];
}
mx = mx || 1e-9;
for (let m = 0; m < M; m++) {
harm[m] /= mx;
perc[m] /= mx;
}
return { harm, perc };
}
import numpy as np
from scipy.ndimage import median_filter
# Idiomatic reference: this is essentially the librosa HPSS recipe. Median-filter
# the magnitude spectrogram along each axis, build a soft mask, and read off the
# per-mask energy. `mag` is an (B, M) array: B frequency bins by M time frames.
# (Assume a shared stft_mag(x) -> mag helper, the same STFT used elsewhere.)
def hpss(mag, kt, kf):
"""Split a magnitude spectrogram into harmonic and percussive energy contours.
kt = median window along time (axis 1) -> horizontal, harmonic energy
kf = median window along frequency (axis 0) -> vertical, percussive energy
"""
# 'reflect' matches the mirror-reflected edges of the hand-rolled version.
h_med = median_filter(mag, size=(1, kt), mode="reflect") # along time
p_med = median_filter(mag, size=(kf, 1), mode="reflect") # along frequency
# Soft mask: each cell's harmonic share. Split the real magnitude by it.
mask = h_med / (h_med + p_med + 1e-9)
vh = mag * mask
vp = mag * (1.0 - mask)
# Per-frame RMS across bins gives the two contours.
harm = np.sqrt(np.mean(vh ** 2, axis=0))
perc = np.sqrt(np.mean(vp ** 2, axis=0))
# Normalize both to their shared peak.
mx = max(harm.max(), perc.max()) or 1e-9
return harm / mx, perc / mx
The honest bottleneck is the median filtering: a naive qsort over each window costs O(win·log win) per cell across two full passes, and it dominates everything else here. The standard fix is a sliding median that keeps a running tally instead of re-sorting from scratch. Because the magnitudes are non-negative reals, quantize them into a small histogram and slide a window across each row/column: as the window advances, decrement the bin leaving and increment the one entering, then walk the histogram to the half-count to read the median — O(bins) per step, no per-window sort. The two axes stay separable (one pass along time, one along frequency), so each is an independent 1-D sliding median. The mask and energy loops are already linear and left as-is.
#include <math.h>
#include <string.h>
/* Sliding-median pass along a length-len signal, window win, with mirror-reflect
edges and a fixed-size value histogram (LEVELS buckets). Output med[i] is the
median of the window centred at i. Magnitudes are non-negative, so we map them
to buckets via a caller-supplied scale (1 / max-magnitude). */
#define LEVELS 1024
static int reflect_idx(int i, int len) {
if (len == 1) return 0;
int p = 2 * len;
i = ((i % p) + p) % p;
if (i >= len) i = p - 1 - i;
return i;
}
static void sliding_median(const double *sig, int len, int win, double inv_scale,
double *med) {
int half = win >> 1;
int target = win >> 1; /* the (win/2)-th order statistic */
int hist[LEVELS];
for (int i = 0; i < len; i++) {
/* Rebuilding the histogram per centre keeps the reflect logic simple and
is still O(win) — the win-time win·log win sort is what we removed. */
memset(hist, 0, sizeof(hist));
for (int w = 0; w < win; w++) {
double v = sig[reflect_idx(i + w - half, len)];
int b = (int)(v * inv_scale * (LEVELS - 1));
if (b < 0) b = 0; else if (b >= LEVELS) b = LEVELS - 1;
hist[b]++;
}
int count = 0, b = 0;
for (; b < LEVELS; b++) { count += hist[b]; if (count > target) break; }
med[i] = (double)b / (LEVELS - 1) / inv_scale;
}
}
/* mag[b * M + m]; the two median passes use it row-wise (time) and column-wise
(freq). Caller passes inv_scale = 1 / peak-magnitude so the histogram spans
the data range. */
void hpss_opt(const double *mag, int B, int M, int kt, int kf, double inv_scale,
double *harm, double *perc) {
double *Hmed = malloc((size_t)B * M * sizeof(double));
double *Pmed = malloc((size_t)B * M * sizeof(double));
double *row = malloc((size_t)(B > M ? B : M) * sizeof(double));
double *out = malloc((size_t)(B > M ? B : M) * sizeof(double));
/* Harmonic: sliding median along time within each bin. */
for (int b = 0; b < B; b++) {
for (int m = 0; m < M; m++) row[m] = mag[b * M + m];
sliding_median(row, M, kt, inv_scale, out);
for (int m = 0; m < M; m++) Hmed[b * M + m] = out[m];
}
/* Percussive: sliding median along frequency within each frame. */
for (int m = 0; m < M; m++) {
for (int b = 0; b < B; b++) row[b] = mag[b * M + m];
sliding_median(row, B, kf, inv_scale, out);
for (int b = 0; b < B; b++) Pmed[b * M + m] = out[b];
}
for (int m = 0; m < M; m++) {
double ah = 0.0, ap = 0.0;
for (int b = 0; b < B; b++) {
double hM = Hmed[b * M + m], pM = Pmed[b * M + m];
double hm = hM / (hM + pM + 1e-9);
double g = mag[b * M + m];
double vh = g * hm, vp = g * (1.0 - hm);
ah += vh * vh; ap += vp * vp;
}
harm[m] = sqrt(ah / B);
perc[m] = sqrt(ap / B);
}
double mx = 1e-9;
for (int m = 0; m < M; m++) {
if (harm[m] > mx) mx = harm[m];
if (perc[m] > mx) mx = perc[m];
}
for (int m = 0; m < M; m++) { harm[m] /= mx; perc[m] /= mx; }
free(Hmed); free(Pmed); free(row); free(out);
}
Both median windows hard-coded to the page default of 17 bins (the single Separation slider drives time and frequency together). No tuning knobs — a fixed kernel for a known spectrogram size.
#include <math.h>
#include <stdlib.h>
#define KT 17 /* median window along time (harmonic) */
#define KF 17 /* median window along frequency (percussive) */
static int reflect_idx(int i, int len) {
if (len == 1) return 0;
int p = 2 * len;
i = ((i % p) + p) % p;
if (i >= len) i = p - 1 - i;
return i;
}
static int cmp_double(const void *a, const void *b) {
double d = *(const double *)a - *(const double *)b;
return (d > 0) - (d < 0);
}
/* Median along time (bin b fixed) or frequency (frame m fixed) of mag[b * M + m]. */
static double median_time(const double *mag, int M, int b, int m) {
double tmp[KT];
for (int w = 0; w < KT; w++)
tmp[w] = mag[b * M + reflect_idx(m + w - (KT >> 1), M)];
qsort(tmp, KT, sizeof(double), cmp_double);
return tmp[KT >> 1];
}
static double median_freq(const double *mag, int B, int M, int b, int m) {
double tmp[KF];
for (int w = 0; w < KF; w++)
tmp[w] = mag[reflect_idx(b + w - (KF >> 1), B) * M + m];
qsort(tmp, KF, sizeof(double), cmp_double);
return tmp[KF >> 1];
}
void hpss_fixed(const double *mag, int B, int M, double *harm, double *perc) {
double mx = 1e-9;
for (int m = 0; m < M; m++) {
double ah = 0.0, ap = 0.0;
for (int b = 0; b < B; b++) {
double hM = median_time(mag, M, b, m);
double pM = median_freq(mag, B, M, b, m);
double hm = hM / (hM + pM + 1e-9);
double g = mag[b * M + m];
double vh = g * hm, vp = g * (1.0 - hm);
ah += vh * vh; ap += vp * vp;
}
harm[m] = sqrt(ah / B);
perc[m] = sqrt(ap / B);
}
for (int m = 0; m < M; m++) {
if (harm[m] > mx) mx = harm[m];
if (perc[m] > mx) mx = perc[m];
}
for (int m = 0; m < M; m++) { harm[m] /= mx; perc[m] /= mx; }
}
The page exposes one Separation slider (5..33 bins, default 17, forced odd) that drives both median windows together — but the algorithm takes them independently, so they are split out here. A longer time window (kt) demands more sustained, horizontal energy before a cell counts as harmonic, pushing more of the signal into the percussive layer's spikes; a longer frequency window (kf) demands more broadband, vertical energy before a cell counts as percussive, leaving more in the harmonic layer. Equal windows is the balanced split the page shows.
#include <math.h>
#include <stdlib.h>
static int reflect_idx(int i, int len) {
if (len == 1) return 0;
int p = 2 * len;
i = ((i % p) + p) % p;
if (i >= len) i = p - 1 - i;
return i;
}
static int cmp_double(const void *a, const void *b) {
double d = *(const double *)a - *(const double *)b;
return (d > 0) - (d < 0);
}
/* Window must be odd so there's a true centre sample; force the low bit on. */
static double median_along(const double *mag, int M, int b, int m,
int win, int axis_len, int along_time) {
int half = win >> 1;
double *tmp = malloc((size_t)win * sizeof(double));
for (int w = 0; w < win; w++) {
int idx = reflect_idx((along_time ? m : b) + w - half, axis_len);
tmp[w] = along_time ? mag[b * M + idx] : mag[idx * M + m];
}
qsort(tmp, win, sizeof(double), cmp_double);
double med = tmp[half];
free(tmp);
return med;
}
/* kt: time median window (slider; larger -> more harmonic)
kf: freq median window (slider; larger -> more percussive) */
void hpss_ctl(const double *mag, int B, int M, int kt, int kf,
double *harm, double *perc) {
kt |= 1; /* sliders are forced odd, mirroring the page control */
kf |= 1;
double mx = 1e-9;
for (int m = 0; m < M; m++) {
double ah = 0.0, ap = 0.0;
for (int b = 0; b < B; b++) {
double hM = median_along(mag, M, b, m, kt, M, 1); /* along time */
double pM = median_along(mag, M, b, m, kf, B, 0); /* along freq */
double hm = hM / (hM + pM + 1e-9);
double g = mag[b * M + m];
double vh = g * hm, vp = g * (1.0 - hm);
ah += vh * vh; ap += vp * vp;
}
harm[m] = sqrt(ah / B);
perc[m] = sqrt(ap / B);
}
for (int m = 0; m < M; m++) {
if (harm[m] > mx) mx = harm[m];
if (perc[m] > mx) mx = perc[m];
}
for (int m = 0; m < M; m++) { harm[m] /= mx; perc[m] /= mx; }
}