Histogram
What is a histogram?
A histogram is a type of bar graph designed to show the distribution of a dataset. It groups data into "bins" or ranges, and then shows how many data points fall into each bin using bars. The height of each bar corresponds to the frequency of data within that particular bin, and the width of the bars represents the bin interval. In short, the taller the bar, the more data points there are within that specific range.
Histograms excel at visualizing the shape or spread of a dataset, and are often a first step in identifying patterns like skewness, or multi-modality. They are also a tool for quickly comparing the distributions of different datasets.
Frequency vs Relative Frequency vs Density
When plotting a histogram, Graphmatk gives you the choice to plot the data using frequency, relative frequency, or density on the vertical axis. Each measure is related, but defines a different aspect of the underlying data distribution:
Frequency
Plots the count (or frequency) of data points that fall into each bin. Meaning the height of each bar directly shows how many observations are within that specific range. Displaying the frequency is straightforward and easy to interpret.
Relative frequency
Plots the proportion or percentage of data points that fall into each bin. This is calculated by dividing the frequency of each bin by the total number of observations. The sum of all bar heights will equal 1.
Density
Plots the frequency density, which is the relative frequency of a bin divided by its width. With density, it's not the height, but rather the area of each bar that represents the proportion of data within that bin. The total area of all bars in a density histogram will always sum to 1.
Binning the data
Changing the bin size can have a dramatic effect on the shape of the histogram. Generally it's best to select a bin size according to the underlying data. Graphmatik provides you with 3 different way to set the bin width.
Sturges' rule
Sturges' rule is used to determine the optimal number of bins (k) for a histogram, based on the number of observations (n) in the dataset.
Sturge's formula is: . The method aims to create a visually informative histogram without over-smoothing or making it too choppy.
Scott's normal reference rule
Scott's rule provides a method for determining the optimal bin width (h) for a histogram, particularly when the data is approximately normally distributed. It aims to minimize the integrated mean squared error between the histogram and the underlying probability density function.
Scott's formula is: , where σ is the standard deviation of the data and n is the number of observations. This rule intends to balance the trade-offs between bias (oversmoothing) and variance (undersmoothing) in the histogram's appearance.
Freedman-Diaconis rule
The Freedman-Diaconis rule is a robust method for determining histogram bin width (h) that is less sensitive to outliers than Scott's rule. It calculates the optimal bin width using the interquartile range (IQR) of the data and the number of observations (n).
The Freedman-Diaconis formula is: .
Tips for designing meaningful histograms
Chart properties
Prop | Default | Description |
---|---|---|
bin measure | frequency | frequency Plots the count of data points within each bin. relative frequency Plots the proportion of data that fall into each bin. density The area of the bar represents the proportion of data within that bin. |
method | Sturges' rule | Sturges' rule A method for determining the bin size based on the number of observations. Scott's rule Calculates an optimal bin width using the data's standard deviation, useful for approximately normal data. Freedman-Diaconis rule Calculates the bin width using the interquartile range (IQR) of the data. |
kernel density | false | false No Kernel Density Estimate (KDE) overlay is applied. true A KDE line plot is overlaid on the histogram, with its smoothing determined by your chosen kernel and bandwidth parameters. |
kernel | gaussian | gaussian Applies a gaussian (or bell-shaped) weighting function centered at each data point. epanechnikov Applies a parabolic-shaped weighting function, optimal for minimizing mean squared error. |
bandwidth | Silverman's rule | Silverman's rule A widely used rule of thumb, assumes approximately normal data, but performs reasonably well for other distributions. Scott's rule Another normal reference rule, similar to Silverman's, that calculates the optimal bandwidth assuming a normal distribution. It is typically marginally wider than Silverman's rule of thumb. manual Adjust the bandwidth manually. |