Charts

Histogram

The first-choice for a quick and intuitive peek into a dataset.

What is a histogram?

A histogram is a type of bar graph designed to show the distribution of a dataset. It groups data into "bins" or ranges, and then shows how many data points fall into each bin using bars. The height of each bar corresponds to the frequency of data within that particular bin, and the width of the bars represents the bin interval. In short, the taller the bar, the more data points there are within that specific range.

Histograms excel at visualizing the shape or spread of a dataset, and are often a first step in identifying patterns like skewness, or multi-modality. They are also a tool for quickly comparing the distributions of different datasets.

Histograms intentionally omit gaps between bars, a design choice that underscores the continuous scale of the data they represent. This is opposed to the distinct categories typically shown in bar charts.

Default histogram

Displays the underlying spread of the data

Histogram showing the frequency distribution of a dataset. The data is normally distributed and centered around a mean of 100.

Frequency vs Relative Frequency vs Density

When plotting a histogram, Graphmatk gives you the choice to plot the data using frequency, relative frequency, or density on the vertical axis. Each measure is related, but defines a different aspect of the underlying data distribution:

Frequency

Plots the count (or frequency) of data points that fall into each bin. Meaning the height of each bar directly shows how many observations are within that specific range. Displaying the frequency is straightforward and easy to interpret.

WARNING: Frequency can be misleading when comparing datasets of different sizes, as larger datasets will naturally have higher bar heights even if their underlying distributions are similar

Relative frequency

Plots the proportion or percentage of data points that fall into each bin. This is calculated by dividing the frequency of each bin by the total number of observations. The sum of all bar heights will equal 1.

Relative frequency normalizes the data and is better for comparing two or more distributions.

Density

Plots the frequency density, which is the relative frequency of a bin divided by its width. With density, it's not the height, but rather the area of each bar that represents the proportion of data within that bin. The total area of all bars in a density histogram will always sum to 1.

Density is best used when the bin widths are unequal, since it correctly represents the proportion of data in each bin irrespective of width.

WARNING: If you choose to overlay the kernel density estimation on top of the histogram for comparision. The histogram will automatically be plotted with frequency density, as both have a total area of 1.

Binning the data

Changing the bin size can have a dramatic effect on the shape of the histogram. Generally it's best to select a bin size according to the underlying data. Graphmatik provides you with 3 different way to set the bin width.

Sturges' rule

Sturges' rule is used to determine the optimal number of bins (k) for a histogram, based on the number of observations (n) in the dataset.

Sturge's formula is: $k = 1 + 3.322 \log (n)$ . The method aims to create a visually informative histogram without over-smoothing or making it too choppy.

Scott's normal reference rule

Scott's rule provides a method for determining the optimal bin width (h) for a histogram, particularly when the data is approximately normally distributed. It aims to minimize the integrated mean squared error between the histogram and the underlying probability density function.

Scott's formula is: $h = 3.5 σ n^{- 1 / 3}$ , where σ is the standard deviation of the data and n is the number of observations. This rule intends to balance the trade-offs between bias (oversmoothing) and variance (undersmoothing) in the histogram's appearance.

Freedman-Diaconis rule

The Freedman-Diaconis rule is a robust method for determining histogram bin width (h) that is less sensitive to outliers than Scott's rule. It calculates the optimal bin width using the interquartile range (IQR) of the data and the number of observations (n).

The Freedman-Diaconis formula is: $h = 2 \cdot IQR \cdot n^{- 1 / 3}$ .

Tips for designing meaningful histograms

Use a zero baseline

Truncating the y-axis can dramatically distort the perceived distribution of the underlying data.

Histogram showing the frequency distribution of a dataset. The y-axis does not start at zero, visually exaggerating differences by truncating the lower half of the data.

Histogram showing data frequency. The y-axis starts at zero accurately representing the underlying data distribution.

Optimal bin width

Too many bins will make your histogram look choppy. Conversely, too few bins will oversimplify the data, hiding crucial patterns and details.

Histogram showing data frequency. The small bin width creates numerous narrow bars, making the overall distribution difficult to discern.

Histogram showing data frequency. The bin width is optimal making it easy to visualize the overall distribution.

Chart properties

Prop	Default	Description
`bin measure`	frequency	frequency Plots the count of data points within each bin. relative frequency Plots the proportion of data that fall into each bin. density The area of the bar represents the proportion of data within that bin.
`method`	Sturges' rule	Sturges' rule A method for determining the bin size based on the number of observations. Scott's rule Calculates an optimal bin width using the data's standard deviation, useful for approximately normal data. Freedman-Diaconis rule Calculates the bin width using the interquartile range (IQR) of the data.
`kernel density`	false	false No Kernel Density Estimate (KDE) overlay is applied. true A KDE line plot is overlaid on the histogram, with its smoothing determined by your chosen kernel and bandwidth parameters.
`kernel`	gaussian	gaussian Applies a gaussian (or bell-shaped) weighting function centered at each data point. epanechnikov Applies a parabolic-shaped weighting function, optimal for minimizing mean squared error.
`bandwidth`	Silverman's rule	Silverman's rule A widely used rule of thumb, assumes approximately normal data, but performs reasonably well for other distributions. Scott's rule Another normal reference rule, similar to Silverman's, that calculates the optimal bandwidth assuming a normal distribution. It is typically marginally wider than Silverman's rule of thumb. manual Adjust the bandwidth manually.

Enzyme Kinetics

The standard way to measure an enzyme's affinity for its substrate.

Kernel Density Estimation

Smooth discrete data into a continuous curve, revealing the underlying distribution.

Docs

Tutorials