Kernel Density Estimation
What is Kernel Density Estimation (KDE)?
A density curve, or Kernel Density Estimation (KDE), is a non-parametric method used to estimate the probability density function (PDF) of a random variable. It is an alternative to the histogram, where instead of creating discrete bins, a KDE plot produces a smooth, continuous curve that intends to better model the underlying distribution of the data. KDE works by placing a "kernel" (a small, smooth function, most often a Gaussian bell curve) over each data point, sort of creating a small bump. These individual kernels are then summed together to form the overall estimated density curve.
The height of a KDE curve reflects the density of data points at any given location and the total area under the curve always sums to 1. Typically, more important than the choice of "kernel" is the selection of "bandwidth", which controls the smoothness of the estimated curve. A smaller bandwidth results in a sharper, spikier estimate, while a larger bandwidth will create a smoother, more general representation.
Choice of kernel
With KDE, each individual data point adds a small volume that when stacked together form the overall estimated probability density curve. The shape of these small, symmetric functions is called the "kernel". Graphmatik provides two popular kernel functions for generating density curves:
Gaussian kernel
The Gaussian kernel applies a normal (bell-shaped) weighting function centered at each data point. It is a widely used approach whose smoothing characteristics mimic the normal probability density function.
Epanechnikov kernel
The Epanechnikov kernel, also known as the parabolic kernel, is recognized for its compact support, meaning it's non-zero only within a finite interval extending from each data point. It's also generally considered theoretically optimal for KDE, with respect to minimizing the Mean Integrated Squared Error (MISE).
Setting the bandwidth
For density curves, selecting the correct "bandwidth" is often considered far more important than the choice of "kernel". This is because the bandwidth controls the level of smoothing and has a far greater effect on the shape of the estimated curve.
It's best to select a bandwidth according to the underlying data. This is, of course, easier said than done. Beyond manual adjustment, Graphmatik offers 2 automated methods for determining the optimal bandwidth for approximately normal datasets.
Silverman's rule
Silverman's "rule of thumb", is a popular method for selecting the bandwidth (h). It aims to find a bandwidth that is "optimal" in the sense of minimizing the Asymptotic Mean Integrated Squared Error (AMISE), assuming the underlying data distribution is normally distributed.
The equation for Silverman's rule is:
Scott's "rule of thumb", is another simple method for bandwidth selection, similar to Silverman's rule. Like Silverman's, it also relies on the assumption that the underlying data is normally distributed.
The equation for Scott's rule is:
This difference stems from Silverman's method incorporating the Interquartile range (IQR), a more robust meaure of spread compared solely to the standard deviation (σ) of Scott's. Futher, the Silverman method includes a slighly more conservative constant (0.9). It is chosen to be about 15% smaller than the optimal bandwidth for the perfect normal (~1.06σ), keeping Silverman's rule within 90% efficiency of that optimum.
Tips for designing impactful KDE plots
Chart properties
Prop | Default | Description |
---|---|---|
kernel | gaussian | gaussian Applies a gaussian (or bell-shaped) weighting function centered at each data point. epanechnikov Applies a parabolic-shaped weighting function, optimal for minimizing mean squared error. |
bandwidth | Silverman's rule | Silverman's rule A widely used rule of thumb, assumes approximately normal data, but performs reasonably well for other distributions. Scott's rule Another normal reference rule, similar to Silverman's, that calculates the optimal bandwidth assuming a normal distribution. It is typically marginally wider than Silverman's rule of thumb. manual Adjust the bandwidth manually. |