Statistics Basics

Two questions every dataset answers

Whenever you have a list of numbers, two questions come up immediately: where is the centre, and how spread out are the values? Descriptive statistics is the toolkit for answering both, and it splits cleanly into two halves: measures of centre (mean, median, mode) and measures of spread (range, variance, standard deviation, IQR).

Measures of centre

Mean. The arithmetic average. Add everything up and divide by the count. Mathematically, μ = (1/n) Σ xᵢ. It is the most familiar centre, but it is sensitive to outliers — a single billionaire in a sample of incomes drags the mean upward dramatically.

Median. Sort the data and pick the middle value (or average the two middle values for an even count). The median is robust to outliers; that billionaire moves the median by exactly nothing. For skewed distributions like income or house prices the median is almost always a better summary.

Mode. The value that appears most often. Datasets can have no mode (everything unique), one mode (unimodal), or several (bimodal, multimodal). The mode is the only "centre" that makes sense for categorical data.

Measures of spread

Range. Max minus min. Quick but crude: it is decided entirely by the two extreme values.

Variance. The average squared distance from the mean: σ² = (1/n) Σ (xᵢ − μ)². Squaring removes the sign of each deviation so positives and negatives do not cancel.

Standard deviation. The square root of the variance. The advantage of the standard deviation over the variance is that it has the same units as the original data — if your data is in kilograms, the standard deviation is also in kilograms.

Interquartile range (IQR). The difference between the 75th percentile (Q3) and the 25th percentile (Q1). It captures the spread of the middle half of the data and ignores the extremes, which makes it the spread of choice for skewed distributions.

Sample vs population: the n vs n−1 question

When the data is a representative sample of a larger population, the sample variance divides the sum of squares by n − 1 rather than n. This is Bessel’s correction. The reasoning is that the sample mean is itself an estimate, and dividing by n underestimates the true population variance. Dividing by n − 1 gives an unbiased estimate. When your data is the entire population (every employee in the company, every measurement of a finite physical experiment), divide by n.

Detecting outliers

Outliers — values that are unusually far from the rest — deserve special attention because they often signal data-entry errors, instrument failures, or genuinely interesting cases. The standard rule of thumb is that any value more than 1.5 × IQR below Q1 or above Q3 is an outlier worth investigating. The statistics calculator flags these automatically.

What single number to report?

If your data is roughly symmetric and free of extreme outliers, the mean and the standard deviation are the right pair. If it is skewed, lean on the median and the IQR. If it is categorical or extremely peaked, the mode and the range may say everything you need. Always look at a histogram before deciding — the same five summary numbers can correspond to dramatically different distributions.