The science of collecting, analyzing, interpreting, and presenting data. Statistics and probability provide the essential toolkit for making informed decisions under uncertainty, uncovering patterns in data, and drawing reliable conclusions from evidence.
Statistics is the branch of mathematics devoted to the collection, organization, analysis, interpretation, and presentation of data. In an age overflowing with information, statistics empowers us to transform raw numbers into actionable knowledge — from predicting election outcomes and testing new medications to optimizing business strategies and training machine-learning models.
Statistics is broadly divided into two major areas:
Key terminology you will encounter throughout this page:
Statistics is foundational to nearly every empirical discipline — medicine, psychology, economics, physics, engineering, ecology, sports analytics, and data science all depend on statistical reasoning daily.
Descriptive statistics condense large datasets into a handful of meaningful numbers. We typically describe data using measures of central tendency (where is the center?) and measures of spread (how spread out are the values?).
The mean is the sum of all values divided by the number of values. It is the most common measure of center and is sensitive to every data point — including outliers.
Dataset: 4, 8, 6, 5, 3, 7, 9, 5
Step 1: Sum the values: 4 + 8 + 6 + 5 + 3 + 7 + 9 + 5 = 47
Step 2: Count the values: n = 8
Step 3: Divide: x̄ = 47 / 8 = 5.875
The median is the middle value when data is sorted in ascending order. If the dataset has an even number of values, the median is the average of the two middle values. The median is resistant to outliers, making it preferable for skewed distributions.
Dataset: 4, 8, 6, 5, 3, 7, 9, 5
Step 1: Sort the data: 3, 4, 5, 5, 6, 7, 8, 9
Step 2: Since n = 8 (even), the median is the average of the 4th and 5th values.
Step 3: Median = (5 + 6) / 2 = 5.5
The mode is the value that appears most frequently. A dataset can be unimodal (one mode), bimodal (two modes), multimodal (multiple modes), or have no mode (all values equally frequent).
Dataset: 4, 8, 6, 5, 3, 7, 9, 5
The value 5 appears twice; all other values appear once.
Mode = 5
The range is the difference between the largest and smallest values. It is simple but highly sensitive to outliers.
Dataset: 3, 4, 5, 5, 6, 7, 8, 9
Range = 9 − 3 = 6
Variance measures the average squared deviation from the mean. Squaring ensures that deviations above and below the mean don't cancel out. A larger variance indicates data more spread out from the center.
Dataset: 4, 8, 6, 5, 3 (x̄ = 26/5 = 5.2)
Step 1: Compute deviations from the mean:
(4 − 5.2)² = 1.44, (8 − 5.2)² = 7.84, (6 − 5.2)² = 0.64, (5 − 5.2)² = 0.04, (3 − 5.2)² = 4.84
Step 2: Sum the squared deviations: 1.44 + 7.84 + 0.64 + 0.04 + 4.84 = 14.80
Step 3: Divide by (n − 1): s² = 14.80 / 4 = 3.70
The standard deviation is the square root of the variance. It is expressed in the same units as the original data, which makes it far more interpretable than variance.
s = √3.70 ≈ 1.924
Interpretation: On average, data values deviate about 1.924 units from the mean of 5.2.
The IQR measures the spread of the middle 50% of data. It is resistant to outliers and is defined as:
where Q₁ is the 25th percentile (first quartile) and Q₃ is the 75th percentile (third quartile).
Sorted dataset: 2, 4, 5, 7, 8, 10, 12, 15
Q₁: Median of the lower half (2, 4, 5, 7) = (4 + 5) / 2 = 4.5
Q₃: Median of the upper half (8, 10, 12, 15) = (10 + 12) / 2 = 11
IQR = 11 − 4.5 = 6.5
Visualizing data is essential for understanding its shape, identifying patterns, detecting outliers, and communicating findings effectively. Here we cover the most important types of statistical plots.
A histogram displays the distribution of a continuous variable by dividing the data range into non-overlapping bins (intervals) and plotting a bar whose height represents the frequency (or relative frequency) of observations in each bin.
A box plot provides a concise five-number summary of a dataset:
The "box" spans from Q₁ to Q₃ (the IQR), with a line at the median. "Whiskers" extend to the most extreme data points within 1.5 × IQR of the quartiles. Points beyond the whiskers are plotted individually as potential outliers.
Dataset (sorted): 1, 3, 5, 7, 8, 12, 14, 16, 18, 50
Q₁ = 5, Median = (8 + 12)/2 = 10, Q₃ = 16
IQR = 16 − 5 = 11
Lower fence = 5 − 1.5(11) = −11.5 → Minimum in data is 1 (within fence)
Upper fence = 16 + 1.5(11) = 32.5 → The value 50 exceeds 32.5, so 50 is an outlier
Whiskers extend from 1 to 18; the point 50 is plotted as an individual outlier dot.
A scatter plot displays the relationship between two quantitative variables by plotting each observation as a point on a coordinate plane. Scatter plots reveal:
Probability is the mathematical framework for quantifying uncertainty. It assigns a number between 0 and 1 to events — where 0 means the event is impossible and 1 means the event is certain.
The sample space (S) is the set of all possible outcomes of a random experiment. An event (A) is any subset of the sample space.
Sample space: S = {1, 2, 3, 4, 5, 6}
Event A = "rolling an even number" = {2, 4, 6}
Event B = "rolling a number greater than 4" = {5, 6}
All of probability theory is built upon three axioms:
From these axioms, we can derive all the fundamental rules of probability.
The probability that event A does not occur equals 1 minus the probability that it does.
If the probability of rain tomorrow is P(Rain) = 0.35, then:
P(No rain) = 1 − 0.35 = 0.65
For any two events A and B:
We subtract P(A ∩ B) to avoid double-counting outcomes that belong to both events. If A and B are mutually exclusive (cannot occur simultaneously), then P(A ∩ B) = 0, and the formula simplifies to P(A ∪ B) = P(A) + P(B).
A standard deck of 52 cards. What is the probability of drawing a King or a Heart?
P(King) = 4/52, P(Heart) = 13/52, P(King ∩ Heart) = 1/52 (King of Hearts)
P(King ∪ Heart) = 4/52 + 13/52 − 1/52 = 16/52 = 4/13 ≈ 0.3077
For any two events A and B:
If A and B are independent (the occurrence of one does not affect the other), this simplifies to:
A bag contains 5 red and 3 blue marbles. You draw two marbles without replacement. What is P(both red)?
P(1st red) = 5/8
P(2nd red | 1st red) = 4/7 (one red removed, 4 red remain out of 7 total)
P(both red) = (5/8)(4/7) = 20/56 = 5/14 ≈ 0.357
Conditional probability measures the likelihood of an event given that another event has already occurred. It is a cornerstone of probabilistic reasoning and is essential for medical testing, spam filtering, forensic analysis, and machine learning.
Read as: "the probability of A given B." We restrict the sample space to outcomes where B has occurred and ask how likely A is within that restricted space.
In a class of 40 students, 15 take French, 10 take Spanish, and 5 take both.
What is the probability a student takes French, given they take Spanish?
P(French | Spanish) = P(French ∩ Spanish) / P(Spanish) = (5/40) / (10/40) = 5/10 = 0.5
Half of the Spanish students also take French.
Events A and B are independent if and only if:
Knowing B occurred gives no information about A.
If B₁, B₂, …, Bₙ form a partition of the sample space (mutually exclusive, collectively exhaustive), then for any event A:
This is invaluable when the probability of A is hard to compute directly but easy to compute within each partition piece.
Bayes' theorem allows us to reverse conditional probabilities — to update our belief about a cause after observing evidence.
In words: the posterior probability of A given B equals the likelihood P(B | A) times the prior P(A), divided by the marginal likelihood P(B). Using the law of total probability for P(B):
A disease affects 1% of a population. A test has a 95% sensitivity (P(positive | disease) = 0.95) and a 90% specificity (P(negative | no disease) = 0.90). If a person tests positive, what is the probability they actually have the disease?
Step 1: Define events. D = has disease, + = tests positive.
P(D) = 0.01, P(D') = 0.99
P(+ | D) = 0.95, P(+ | D') = 1 − 0.90 = 0.10 (false positive rate)
Step 2: Compute P(+) using the law of total probability:
P(+) = P(+ | D) · P(D) + P(+ | D') · P(D') = (0.95)(0.01) + (0.10)(0.99) = 0.0095 + 0.099 = 0.1085
Step 3: Apply Bayes' Theorem:
P(D | +) = (0.95)(0.01) / 0.1085 = 0.0095 / 0.1085 ≈ 0.0876 (about 8.8%)
Interpretation: Even with a positive test, there is only an 8.8% chance the person truly has the disease. This counterintuitive result arises because the disease is rare — the large number of false positives from the healthy majority overwhelms the true positives.
A random variable is a numerical quantity whose value is determined by the outcome of a random experiment. It is the bridge between probability theory and data: it assigns numbers to outcomes so we can use mathematical tools to analyze them.
For a discrete random variable X, the PMF gives the probability that X equals each possible value:
Requirements: p(x) ≥ 0 for all x, and Σ p(x) = 1.
X = number shown on a fair six-sided die.
p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6
Σ p(x) = 6 × (1/6) = 1 ✓
For a continuous random variable X, the PDF f(x) describes the relative likelihood of X being near a specific value. The probability that X falls within an interval [a, b] is the area under the PDF curve over that interval:
Requirements: f(x) ≥ 0 for all x, and ∫₋∞⁺∞ f(x) dx = 1.
The CDF applies to both discrete and continuous random variables and gives the probability that X is less than or equal to x:
Properties of the CDF:
The expected value (mean) of a random variable is the long-run average value:
The variance measures how much X deviates from its expected value:
X = number on a fair die. Each outcome has probability 1/6.
E(X) = (1 + 2 + 3 + 4 + 5 + 6) / 6 = 21/6 = 3.5
E(X²) = (1 + 4 + 9 + 16 + 25 + 36) / 6 = 91/6 ≈ 15.167
Var(X) = 91/6 − (21/6)² = 91/6 − 441/36 = 546/36 − 441/36 = 105/36 = 35/12 ≈ 2.917
Certain probability distributions appear so frequently in practice that they have been named and thoroughly studied. Here are the most important ones.
Models the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success.
Parameters: n (number of trials), p (probability of success on each trial).
A fair coin is flipped 10 times. What is the probability of getting exactly 7 heads?
n = 10, p = 0.5, k = 7
P(X = 7) = C(10, 7) · (0.5)⁷ · (0.5)³ = 120 · (0.5)¹⁰ = 120 / 1024 ≈ 0.1172
Models the number of events occurring in a fixed interval of time or space, when events occur independently at a constant average rate.
Parameter: λ (the average rate of occurrence).
A call center receives an average of 4 calls per minute. What is the probability of receiving exactly 6 calls in a given minute?
λ = 4, k = 6
P(X = 6) = (e⁻⁴ · 4⁶) / 6! = (0.01832 · 4096) / 720 ≈ 75.05 / 720 ≈ 0.1042
The most important distribution in statistics. The famous "bell curve" arises naturally in countless phenomena and is the basis of the Central Limit Theorem.
Parameters: μ (mean), σ (standard deviation).
Key properties of the normal distribution:
For data that follows a normal distribution:
A special case with μ = 0 and σ = 1. Any normal variable X can be standardized:
The Z-score tells you how many standard deviations X is from the mean.
Test scores are normally distributed with μ = 75 and σ = 10. What proportion of students score above 90?
Step 1: Standardize: Z = (90 − 75) / 10 = 1.5
Step 2: Look up P(Z ≤ 1.5) in the standard normal table: ≈ 0.9332
Step 3: P(X > 90) = 1 − 0.9332 = 0.0668 (about 6.68%)
Models the time between events in a Poisson process. If events occur at a constant average rate λ, the waiting time between successive events follows an exponential distribution.
Parameter: λ (rate parameter).
Customers arrive at a store at an average rate of 3 per hour. What is the probability that the next customer arrives within 10 minutes (1/6 hour)?
λ = 3 per hour, x = 1/6 hour
P(X ≤ 1/6) = 1 − e^(−3 · 1/6) = 1 − e^(−0.5) ≈ 1 − 0.6065 ≈ 0.3935 (about 39.35%)
Every outcome in a given range is equally likely.
Parameters: a (minimum), b (maximum).
A random number generator produces values uniformly between 0 and 10. What is P(3 ≤ X ≤ 7)?
P(3 ≤ X ≤ 7) = (7 − 3) / (10 − 0) = 4/10 = 0.4
In practice, we almost never know the true population parameters. Instead, we collect samples and use sample statistics to estimate population parameters. The reliability of these estimates depends critically on how we sample and how many observations we take.
If we repeatedly draw random samples of size n from a population with mean μ and standard deviation σ, the distribution of sample means (x̄) has special properties:
The Central Limit Theorem is arguably the most important theorem in all of statistics. It states:
This holds regardless of the shape of the original population distribution — even if the population is skewed, uniform, bimodal, or otherwise non-normal. This is what makes the CLT so powerful: it justifies using normal-distribution-based methods (like z-tests and confidence intervals) for sample means, even when the underlying data isn't normal.
A factory produces bolts whose lengths have μ = 5.00 cm and σ = 0.10 cm. The distribution of individual bolt lengths is unknown (not necessarily normal). A quality inspector measures a random sample of 36 bolts.
By the CLT: X̄ ~ N(5.00, (0.10)²/36) = N(5.00, 0.000278)
Standard error: SE = 0.10/√36 = 0.10/6 ≈ 0.01667
What is the probability the sample mean is between 4.97 and 5.03?
Z₁ = (4.97 − 5.00) / 0.01667 = −1.80
Z₂ = (5.03 − 5.00) / 0.01667 = 1.80
P(−1.80 ≤ Z ≤ 1.80) = P(Z ≤ 1.80) − P(Z ≤ −1.80) ≈ 0.9641 − 0.0359 = 0.9282
There is about a 92.8% chance the sample mean is within 0.03 cm of the true mean.
Hypothesis testing is a formal procedure for using data to decide between two competing claims about a population parameter. It is the backbone of scientific inference, clinical trials, A/B testing, and quality control.
When the population standard deviation σ is known and the sample is large (n ≥ 30), the test statistic for the population mean is:
A company claims its light bulbs last μ₀ = 1000 hours on average. A consumer group tests 50 bulbs and finds x̄ = 985 hours. The known population standard deviation is σ = 40 hours. Test at α = 0.05 (two-sided).
H₀: μ = 1000 H₁: μ ≠ 1000
Test statistic: Z = (985 − 1000) / (40/√50) = −15 / 5.657 ≈ −2.65
p-value: P(|Z| ≥ 2.65) = 2 × P(Z ≤ −2.65) ≈ 2 × 0.0040 = 0.008
Decision: p-value = 0.008 < 0.05 = α → Reject H₀
Conclusion: There is statistically significant evidence that the true mean lifespan differs from 1000 hours.
When the population standard deviation is unknown (the usual case) and we use the sample standard deviation s, the test statistic follows a t-distribution with (n − 1) degrees of freedom:
The t-distribution looks similar to the standard normal but has heavier tails, especially for small n. As n increases, the t-distribution approaches the standard normal.
A nutritionist claims that a new diet reduces cholesterol by μ₀ = 20 mg/dL. A study of 12 patients shows a mean reduction of x̄ = 24.5 mg/dL with s = 8.2 mg/dL. Test at α = 0.05 (one-sided: H₁: μ > 20).
H₀: μ = 20 H₁: μ > 20
Test statistic: t = (24.5 − 20) / (8.2/√12) = 4.5 / 2.367 ≈ 1.901
Degrees of freedom: df = 12 − 1 = 11
p-value: P(t₁₁ ≥ 1.901) ≈ 0.042
Decision: p-value = 0.042 < 0.05 = α → Reject H₀
Conclusion: There is statistically significant evidence that the mean cholesterol reduction exceeds 20 mg/dL.
Hypothesis testing can lead to two types of errors:
| H₀ is true | H₀ is false | |
|---|---|---|
| Reject H₀ | Type I error (α) — "false positive" | Correct decision (Power = 1 − β) |
| Fail to reject H₀ | Correct decision | Type II error (β) — "false negative" |
A confidence interval provides a range of plausible values for a population parameter. A 95% confidence interval means: if we repeated the sampling process many times, about 95% of our intervals would contain the true parameter.
where z* and t* are the critical values for the desired confidence level.
A sample of n = 25 students has x̄ = 82 and s = 6. Construct a 95% confidence interval for the population mean score.
df = 24, t* ≈ 2.064 (from t-table for 95% CI with 24 df)
Margin of error: E = 2.064 · (6/√25) = 2.064 · 1.2 = 2.477
CI: 82 ± 2.477 = (79.52, 84.48)
We are 95% confident that the true population mean lies between 79.52 and 84.48.
Regression and correlation are tools for exploring and quantifying relationships between variables. They are among the most widely used statistical techniques in science, business, and engineering.
The Pearson correlation coefficient measures the strength and direction of the linear relationship between two quantitative variables, X and Y.
Properties of r:
Data: hours studied (X) vs. exam score (Y)
| X (hours) | Y (score) |
|---|---|
| 2 | 65 |
| 3 | 70 |
| 5 | 80 |
| 7 | 85 |
| 8 | 92 |
x̄ = 5, ȳ = 78.4
Σ(xᵢ − x̄)(yᵢ − ȳ) = (−3)(−13.4) + (−2)(−8.4) + (0)(1.6) + (2)(6.6) + (3)(13.6) = 40.2 + 16.8 + 0 + 13.2 + 40.8 = 111.0
Σ(xᵢ − x̄)² = 9 + 4 + 0 + 4 + 9 = 26
Σ(yᵢ − ȳ)² = 179.56 + 70.56 + 2.56 + 43.56 + 184.96 = 481.2
r = 111.0 / √(26 × 481.2) = 111.0 / √12,511.2 = 111.0 / 111.85 ≈ 0.992
This indicates a very strong positive linear relationship between hours studied and exam score.
Simple linear regression fits a straight line through the data to predict Y from X. The equation of the least-squares regression line (the line that minimizes the sum of squared residuals) is:
Using the hours-studied vs. exam-score data:
Slope: b₁ = 111.0 / 26 ≈ 4.269
Intercept: b₀ = 78.4 − 4.269(5) = 78.4 − 21.345 = 57.055
Regression equation: ŷ = 57.055 + 4.269x
Interpretation: For each additional hour of study, the predicted exam score increases by about 4.27 points.
Prediction: If a student studies for 6 hours: ŷ = 57.055 + 4.269(6) = 57.055 + 25.614 = 82.67
R² measures the proportion of variance in Y that is explained by the linear relationship with X. For simple linear regression, R² = r².
From our previous example, r ≈ 0.992.
R² = (0.992)² ≈ 0.984
Interpretation: About 98.4% of the variation in exam scores can be explained by the linear relationship with hours studied. Only 1.6% is due to other factors or random variation.
For the results of linear regression to be valid, several assumptions must be satisfied (often remembered by the acronym LINE):
In practice, outcomes are rarely determined by a single predictor. Multiple linear regression extends simple linear regression to include multiple predictors:
Each coefficient bᵢ represents the effect of predictor xᵢ on Y, holding all other predictors constant. The interpretation and assumptions are analogous to simple regression, but with additional complexity around multicollinearity (predictors being correlated with each other).