Charts

Kernel Density Estimation

Smooth discrete data into a continuous curve, revealing the underlying distribution.

What is Kernel Density Estimation (KDE)?

A density curve, or Kernel Density Estimation (KDE), is a non-parametric method used to estimate the probability density function (PDF) of a random variable. It is an alternative to the histogram, where instead of creating discrete bins, a KDE plot produces a smooth, continuous curve that intends to better model the underlying distribution of the data. KDE works by placing a "kernel" (a small, smooth function, most often a Gaussian bell curve) over each data point, sort of creating a small bump. These individual kernels are then summed together to form the overall estimated density curve.

Schematic illustration of kernel density estimation. Individual Gaussian (bell-shaped) curves are centered over each data point, and their summation forms a smooth, estimated probability density curve.

The height of a KDE curve reflects the density of data points at any given location and the total area under the curve always sums to 1. Typically, more important than the choice of "kernel" is the selection of "bandwidth", which controls the smoothness of the estimated curve. A smaller bandwidth results in a sharper, spikier estimate, while a larger bandwidth will create a smoother, more general representation.

Choosing the optimal bandwidth in KDE is analogous to selecting the appropriate bin width for a histogram: too small, and the results can be noisy; too large, and the data may be over-smoothed.

Default KDE plot

An estimate of the probability distribution of a continuous random variable

Density curve showing a normal distribution centered around a mean of 100.

Choice of kernel

With KDE, each individual data point adds a small volume that when stacked together form the overall estimated probability density curve. The shape of these small, symmetric functions is called the "kernel". Graphmatik provides two popular kernel functions for generating density curves:

Gaussian kernel

The Gaussian kernel applies a normal (bell-shaped) weighting function centered at each data point. It is a widely used approach whose smoothing characteristics mimic the normal probability density function.

A single Gaussian (bell-shaped) kernel, used to estimate density.

Epanechnikov kernel

The Epanechnikov kernel, also known as the parabolic kernel, is recognized for its compact support, meaning it's non-zero only within a finite interval extending from each data point. It's also generally considered theoretically optimal for KDE, with respect to minimizing the Mean Integrated Squared Error (MISE).

Setting the bandwidth

For density curves, selecting the correct "bandwidth" is often considered far more important than the choice of "kernel". This is because the bandwidth controls the level of smoothing and has a far greater effect on the shape of the estimated curve.

It's best to select a bandwidth according to the underlying data. This is, of course, easier said than done. Beyond manual adjustment, Graphmatik offers 2 automated methods for determining the optimal bandwidth for approximately normal datasets.

Silverman's rule

Silverman's "rule of thumb", is a popular method for selecting the bandwidth (h). It aims to find a bandwidth that is "optimal" in the sense of minimizing the Asymptotic Mean Integrated Squared Error (AMISE), assuming the underlying data distribution is normally distributed.

The equation for Silverman's rule is: $h = 0.9 \cdot min (σ, \frac{IQR}{1.35}) n^{- 1 / 5}$

Scott's "rule of thumb", is another simple method for bandwidth selection, similar to Silverman's rule. Like Silverman's, it also relies on the assumption that the underlying data is normally distributed.

The equation for Scott's rule is: $h \approx 1.06 \cdot σ n^{- 1 / 5}$

Scott's rule tends to be slightly less robust than Silverman's, often producing bandwidths that are marginally larger (resulting in more smoothing) for the same dataset, especially when outliers are present or the data is skewed.

This difference stems from Silverman's method incorporating the Interquartile range (IQR), a more robust meaure of spread compared solely to the standard deviation (σ) of Scott's. Futher, the Silverman method includes a slighly more conservative constant (0.9). It is chosen to be about 15% smaller than the optimal bandwidth for the perfect normal (~1.06σ), keeping Silverman's rule within 90% efficiency of that optimum.

Tips for designing impactful KDE plots

Optimal bandwidth

Too small a bandwidth will create a noisy KDE that overfits the data, whereas too large a bandwidth will oversmoothed the curve, obscuring any data patterns and biasing the estimate.

Density plot with kernel density estimation. The small bandwidth creates numerous sharp spikes, resulting in a noisy and fragmented curve that obscures the true underlying data distribution.

Density plot generated using kernel density estimation. The optimal bandwidth results in a smooth, clear bell-shaped curve, effectively revealing the underlying normal distribution.

Chart properties

Prop	Default	Description
`kernel`	gaussian	gaussian Applies a gaussian (or bell-shaped) weighting function centered at each data point. epanechnikov Applies a parabolic-shaped weighting function, optimal for minimizing mean squared error.
`bandwidth`	Silverman's rule	Silverman's rule A widely used rule of thumb, assumes approximately normal data, but performs reasonably well for other distributions. Scott's rule Another normal reference rule, similar to Silverman's, that calculates the optimal bandwidth assuming a normal distribution. It is typically marginally wider than Silverman's rule of thumb. manual Adjust the bandwidth manually.

Histogram

The first-choice for a quick and intuitive peek into a dataset.

Ridgeline plot

Arrange density plots in a visually striking series that resembles a mountain range.

Docs

Tutorials