Offline

Harmonic / percussive

two layers
frequency-domainofflinepolyphonicper-band

An envelope detector traces the loudness contour of a waveform — the slow outline riding over the fast carrier inside it. Every graph on this page is drawn by the method's real algorithm, and the sliders at the top drive all of them at once.

The whole method, live

Harmonic / percussive
two layerspolyphonic
HarmonicPercussive
Separation17 bins

Score card

Causality
offline
Signal model
polyphonic
Reads
per-band
Latency
none (offline)
Cost
STFT + median
Domain
frequency

Scored qualitatively.

This method outputs a normalized contour (onset strength, per-band or perceptual loudness), not an amplitude in the units of the true envelope — so an amplitude error number would be meaningless. Its strength is the spectral axis: read the gallery below.

How it works

Music has two envelopes: rhythm and sustain. Median-filter the spectrogram along time to keep steady, horizontal (harmonic) energy, and along frequency to keep broadband, vertical (percussive) energy. The two soft masks split the signal into a sustained-harmonic layer and a transient layer, each with its own envelope: the percussive curve spikes on the hits and the sharp note attacks, while the harmonic curve follows the held tones.

Pulling these apart is often more useful than any single contour. The separation control is the median kernel size.

Key terms

Harmonic / percussive separation (HPSS)
Splitting a mix into two layers: a steady, pitched part — the held tones and sustained notes — and a transient, noisy part — the hits, clicks, and sharp note attacks. Each layer carries its own envelope, one for sustain and one for rhythm.
Median filtering
Replacing each spectrogram cell with the median of its neighbors. Slide the window along time and a steady tone survives as a horizontal ridge while a brief hit is voted away; slide it along frequency and a broadband burst survives as a vertical stripe while a narrow tone is voted away. The two passes isolate the harmonic and percussive content.
Soft mask
A per-cell weight in [0, 1] that apportions each cell's energy between the two layers rather than hard-assigning it, so ambiguous cells split smoothly instead of flipping. The median kernel size is the separation control — wider windows draw a cleaner line between sustain and rhythm.

Building the envelope, step by step

One contour can't describe a mix where held tones and percussive hits sound at once. The fix is to split the spectrogram two ways and follow each layer's energy separately — each graph below is drawn by the real algorithm on the page's polyphonic input.

  1. Step 1The raw mix

    Start with the polyphonic input — sustained tones and percussive hits layered together, with no single carrier to demodulate. A lone amplitude follower would blur the rhythm into the sustain.

  2. Step 2Two separated contours

    Median-filter the spectrogram along time (keeping horizontal, harmonic energy) and along frequency (keeping broadband, percussive energy), then read each soft mask's energy. The percussive curve spikes on the hits while the harmonic curve rides the held tones — the two layers are legible instead of summed.

The code

Six readable forms of the exact algorithm that draws the curves above — C, JS and Python ports, an optimized C, a fixed-coefficient version, and a user-controlled one whose parameters match the sliders.

#include <math.h>
#include <stdlib.h>

/* Assumed shared helper: an STFT magnitude spectrogram, B frequency bins by M
   time frames, laid out row-major as mag[b * M + m]. (Same transform the rest of
   the field guide uses; we only show the HPSS math here.)

       void stft_mag(const double *x, int n, double *mag, int B, int M); */

/* Fold any index back into 0..len-1 by mirror reflection, so a median window has
   well-defined neighbours at the spectrogram edges: -1->0, len->len-1, ... */
static int reflect_idx(int i, int len) {
    if (len == 1) return 0;
    int p = 2 * len;
    i = ((i % p) + p) % p;
    if (i >= len) i = p - 1 - i;
    return i;
}

static int cmp_double(const void *a, const void *b) {
    double d = *(const double *)a - *(const double *)b;
    return (d > 0) - (d < 0);
}

/* Median of a length-win window of get(i0 + w - win/2), reflected at the edges. */
static double median_1d(int len, int i0, int win,
                        double (*get)(int idx, void *ctx), void *ctx) {
    int half = win >> 1;
    double *tmp = malloc((size_t)win * sizeof(double));
    for (int w = 0; w < win; w++)
        tmp[w] = get(reflect_idx(i0 + w - half, len), ctx);
    qsort(tmp, win, sizeof(double), cmp_double);
    double med = tmp[half];
    free(tmp);
    return med;
}

/* Getters that slide the median along one axis of mag[b * M + m]. */
typedef struct { const double *mag; int M; int b; } RowCtx; /* fixed bin, vary time */
typedef struct { const double *mag; int M; int m; } ColCtx; /* fixed frame, vary freq */
static double get_time(int idx, void *c) {
    RowCtx *r = c; return r->mag[r->b * r->M + idx];
}
static double get_freq(int idx, void *c) {
    ColCtx *k = c; return k->mag[idx * k->M + k->m];
}

/* Harmonic/percussive separation. kt = median window along time (harmonic),
   kf = median window along frequency (percussive). Writes two per-frame energy
   contours, each normalized to the shared peak. */
void hpss(const double *mag, int B, int M, int kt, int kf,
          double *harm, double *perc) {
    double *Hmed = malloc((size_t)B * M * sizeof(double));
    double *Pmed = malloc((size_t)B * M * sizeof(double));

    /* Harmonic estimate: median ALONG TIME within each frequency bin. */
    for (int b = 0; b < B; b++) {
        RowCtx ctx = { mag, M, b };
        for (int m = 0; m < M; m++)
            Hmed[b * M + m] = median_1d(M, m, kt, get_time, &ctx);
    }
    /* Percussive estimate: median ALONG FREQUENCY within each time frame. */
    for (int m = 0; m < M; m++) {
        ColCtx ctx = { mag, M, m };
        for (int b = 0; b < B; b++)
            Pmed[b * M + m] = median_1d(B, b, kf, get_freq, &ctx);
    }

    /* Soft mask + per-mask energy. The mask is the harmonic share of each cell;
       split the real magnitude by it, then take RMS across bins. */
    for (int m = 0; m < M; m++) {
        double ah = 0.0, ap = 0.0;
        for (int b = 0; b < B; b++) {
            double hM = Hmed[b * M + m];
            double pM = Pmed[b * M + m];
            double hm = hM / (hM + pM + 1e-9);   /* harmonic share */
            double g  = mag[b * M + m];
            double vh = g * hm;
            double vp = g * (1.0 - hm);
            ah += vh * vh;
            ap += vp * vp;
        }
        harm[m] = sqrt(ah / B);
        perc[m] = sqrt(ap / B);
    }

    /* Normalize both contours to their shared peak. */
    double mx = 0.0;
    for (int m = 0; m < M; m++) {
        if (harm[m] > mx) mx = harm[m];
        if (perc[m] > mx) mx = perc[m];
    }
    if (mx == 0.0) mx = 1e-9;
    for (int m = 0; m < M; m++) { harm[m] /= mx; perc[m] /= mx; }

    free(Hmed);
    free(Pmed);
}