Statistics & Probability

Introduction to Statistics

Statistics is the branch of mathematics devoted to the collection, organization, analysis, interpretation, and presentation of data. In an age overflowing with information, statistics empowers us to transform raw numbers into actionable knowledge — from predicting election outcomes and testing new medications to optimizing business strategies and training machine-learning models.

Statistics is broadly divided into two major areas:

Descriptive Statistics: Summarizing and describing the features of a dataset — through measures like averages, variability, and visual displays.
Inferential Statistics: Drawing conclusions about a larger population based on a smaller sample, using probability theory as the bridge.

A population is the entire set of individuals or observations you want to study. A sample is a subset drawn from that population. Since studying entire populations is usually impractical, statistics relies on samples to make inferences about the whole.

Key terminology you will encounter throughout this page:

Variable: A characteristic that can take on different values (e.g., height, test score, color).
Quantitative variable: Numerical data — either discrete (countable values) or continuous (measured values on a continuum).
Qualitative (categorical) variable: Non-numeric categories (e.g., blood type, favorite color).
Parameter: A numerical summary of a population (e.g., population mean μ).
Statistic: A numerical summary of a sample (e.g., sample mean x̄).

Statistics is foundational to nearly every empirical discipline — medicine, psychology, economics, physics, engineering, ecology, sports analytics, and data science all depend on statistical reasoning daily.

Descriptive Statistics

Descriptive statistics condense large datasets into a handful of meaningful numbers. We typically describe data using measures of central tendency (where is the center?) and measures of spread (how spread out are the values?).

Measures of Central Tendency

Mean (Arithmetic Average)

The mean is the sum of all values divided by the number of values. It is the most common measure of center and is sensitive to every data point — including outliers.

Population mean: μ = (Σ xᵢ) / N
Sample mean: x̄ = (Σ xᵢ) / n

Example: Calculate the mean

Dataset: 4, 8, 6, 5, 3, 7, 9, 5

Step 1: Sum the values: 4 + 8 + 6 + 5 + 3 + 7 + 9 + 5 = 47

Step 2: Count the values: n = 8

Step 3: Divide: x̄ = 47 / 8 = 5.875

Median

The median is the middle value when data is sorted in ascending order. If the dataset has an even number of values, the median is the average of the two middle values. The median is resistant to outliers, making it preferable for skewed distributions.

Example: Find the median

Dataset: 4, 8, 6, 5, 3, 7, 9, 5

Step 1: Sort the data: 3, 4, 5, 5, 6, 7, 8, 9

Step 2: Since n = 8 (even), the median is the average of the 4th and 5th values.

Step 3: Median = (5 + 6) / 2 = 5.5

Mode

The mode is the value that appears most frequently. A dataset can be unimodal (one mode), bimodal (two modes), multimodal (multiple modes), or have no mode (all values equally frequent).

Example: Find the mode

Dataset: 4, 8, 6, 5, 3, 7, 9, 5

The value 5 appears twice; all other values appear once.

Mode = 5

Measures of Spread (Dispersion)

Range

The range is the difference between the largest and smallest values. It is simple but highly sensitive to outliers.

Range = Maximum − Minimum

Example: Find the range

Dataset: 3, 4, 5, 5, 6, 7, 8, 9

Range = 9 − 3 = 6

Variance

Variance measures the average squared deviation from the mean. Squaring ensures that deviations above and below the mean don't cancel out. A larger variance indicates data more spread out from the center.

Population variance: σ² = Σ(xᵢ − μ)² / N
Sample variance: s² = Σ(xᵢ − x̄)² / (n − 1)

Why divide by (n − 1) instead of n for the sample variance? This is called Bessel's correction. Using (n − 1) makes s² an unbiased estimator of the population variance σ². The sample tends to underestimate variability because it is less likely to capture the extreme values of the population.

Example: Calculate the sample variance

Dataset: 4, 8, 6, 5, 3 (x̄ = 26/5 = 5.2)

Step 1: Compute deviations from the mean:

(4 − 5.2)² = 1.44, (8 − 5.2)² = 7.84, (6 − 5.2)² = 0.64, (5 − 5.2)² = 0.04, (3 − 5.2)² = 4.84

Step 2: Sum the squared deviations: 1.44 + 7.84 + 0.64 + 0.04 + 4.84 = 14.80

Step 3: Divide by (n − 1): s² = 14.80 / 4 = 3.70

Standard Deviation

The standard deviation is the square root of the variance. It is expressed in the same units as the original data, which makes it far more interpretable than variance.

Population: σ = √(σ²)
Sample: s = √(s²)

Example: Standard deviation (continued from above)

s = √3.70 ≈ 1.924

Interpretation: On average, data values deviate about 1.924 units from the mean of 5.2.

Interquartile Range (IQR)

The IQR measures the spread of the middle 50% of data. It is resistant to outliers and is defined as:

IQR = Q₃ − Q₁

where Q₁ is the 25th percentile (first quartile) and Q₃ is the 75th percentile (third quartile).

Example: Calculate Q₁, Q₃, and IQR

Sorted dataset: 2, 4, 5, 7, 8, 10, 12, 15

Q₁: Median of the lower half (2, 4, 5, 7) = (4 + 5) / 2 = 4.5

Q₃: Median of the upper half (8, 10, 12, 15) = (10 + 12) / 2 = 11

IQR = 11 − 4.5 = 6.5

Data Visualization

Visualizing data is essential for understanding its shape, identifying patterns, detecting outliers, and communicating findings effectively. Here we cover the most important types of statistical plots.

Histograms

A histogram displays the distribution of a continuous variable by dividing the data range into non-overlapping bins (intervals) and plotting a bar whose height represents the frequency (or relative frequency) of observations in each bin.

Bars are adjacent (no gaps), reflecting the continuous nature of the data.
The shape of a histogram reveals whether data is symmetric, left-skewed, right-skewed, uniform, or bimodal.
Choosing the right number of bins matters — too few bins hide detail; too many create noise.

A common rule of thumb for the number of bins is the Sturges' formula: k = 1 + 3.322 · log₁₀(n), where n is the number of data points. Another popular choice is the square root rule: k ≈ √n.

Box Plots (Box-and-Whisker Plots)

A box plot provides a concise five-number summary of a dataset:

Minimum (or lower fence)
Q₁ (first quartile — 25th percentile)
Median (Q₂ — 50th percentile)
Q₃ (third quartile — 75th percentile)
Maximum (or upper fence)

The "box" spans from Q₁ to Q₃ (the IQR), with a line at the median. "Whiskers" extend to the most extreme data points within 1.5 × IQR of the quartiles. Points beyond the whiskers are plotted individually as potential outliers.

Lower fence = Q₁ − 1.5 · IQR
Upper fence = Q₃ + 1.5 · IQR

Example: Construct a box plot summary

Dataset (sorted): 1, 3, 5, 7, 8, 12, 14, 16, 18, 50

Q₁ = 5, Median = (8 + 12)/2 = 10, Q₃ = 16

IQR = 16 − 5 = 11

Lower fence = 5 − 1.5(11) = −11.5 → Minimum in data is 1 (within fence)

Upper fence = 16 + 1.5(11) = 32.5 → The value 50 exceeds 32.5, so 50 is an outlier

Whiskers extend from 1 to 18; the point 50 is plotted as an individual outlier dot.

Scatter Plots

A scatter plot displays the relationship between two quantitative variables by plotting each observation as a point on a coordinate plane. Scatter plots reveal:

Direction: Positive (upward trend), negative (downward), or no association.
Form: Linear, curved, or clustered.
Strength: How tightly the points follow the pattern (tight = strong, spread = weak).
Outliers: Points that deviate markedly from the overall pattern.

Scatter plots should always be examined before computing correlation or fitting a regression line. A strong correlation coefficient can be misleading if the relationship is actually non-linear or if outliers dominate.

Probability Fundamentals

Probability is the mathematical framework for quantifying uncertainty. It assigns a number between 0 and 1 to events — where 0 means the event is impossible and 1 means the event is certain.

Sample Spaces and Events

The sample space (S) is the set of all possible outcomes of a random experiment. An event (A) is any subset of the sample space.

Example: Rolling a die

Sample space: S = {1, 2, 3, 4, 5, 6}

Event A = "rolling an even number" = {2, 4, 6}

Event B = "rolling a number greater than 4" = {5, 6}

Axioms of Probability (Kolmogorov's Axioms)

All of probability theory is built upon three axioms:

Non-negativity: P(A) ≥ 0 for any event A.
Normalization: P(S) = 1 — the probability of the entire sample space is 1.
Additivity: If A and B are mutually exclusive events (A ∩ B = ∅), then P(A ∪ B) = P(A) + P(B).

From these axioms, we can derive all the fundamental rules of probability.

The Complement Rule

P(A') = 1 − P(A)

The probability that event A does not occur equals 1 minus the probability that it does.

Example: Complement Rule

If the probability of rain tomorrow is P(Rain) = 0.35, then:

P(No rain) = 1 − 0.35 = 0.65

The Addition Rule

For any two events A and B:

P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

We subtract P(A ∩ B) to avoid double-counting outcomes that belong to both events. If A and B are mutually exclusive (cannot occur simultaneously), then P(A ∩ B) = 0, and the formula simplifies to P(A ∪ B) = P(A) + P(B).

Example: Addition Rule

A standard deck of 52 cards. What is the probability of drawing a King or a Heart?

P(King) = 4/52, P(Heart) = 13/52, P(King ∩ Heart) = 1/52 (King of Hearts)

P(King ∪ Heart) = 4/52 + 13/52 − 1/52 = 16/52 = 4/13 ≈ 0.3077

The Multiplication Rule

For any two events A and B:

P(A ∩ B) = P(A) · P(B | A)

If A and B are independent (the occurrence of one does not affect the other), this simplifies to:

P(A ∩ B) = P(A) · P(B) (independent events)

Example: Multiplication Rule

A bag contains 5 red and 3 blue marbles. You draw two marbles without replacement. What is P(both red)?

P(1st red) = 5/8

P(2nd red | 1st red) = 4/7 (one red removed, 4 red remain out of 7 total)

P(both red) = (5/8)(4/7) = 20/56 = 5/14 ≈ 0.357

Conditional Probability and Bayes' Theorem

Conditional probability measures the likelihood of an event given that another event has already occurred. It is a cornerstone of probabilistic reasoning and is essential for medical testing, spam filtering, forensic analysis, and machine learning.

Conditional Probability

P(A | B) = P(A ∩ B) / P(B), provided P(B) > 0

Read as: "the probability of A given B." We restrict the sample space to outcomes where B has occurred and ask how likely A is within that restricted space.

Example: Conditional Probability

In a class of 40 students, 15 take French, 10 take Spanish, and 5 take both.

What is the probability a student takes French, given they take Spanish?

P(French | Spanish) = P(French ∩ Spanish) / P(Spanish) = (5/40) / (10/40) = 5/10 = 0.5

Half of the Spanish students also take French.

Independence Revisited

Events A and B are independent if and only if:

P(A | B) = P(A) (equivalently, P(A ∩ B) = P(A) · P(B))

Knowing B occurred gives no information about A.

The Law of Total Probability

If B₁, B₂, …, Bₙ form a partition of the sample space (mutually exclusive, collectively exhaustive), then for any event A:

P(A) = Σ P(A | Bᵢ) · P(Bᵢ)

This is invaluable when the probability of A is hard to compute directly but easy to compute within each partition piece.

Bayes' Theorem

Bayes' theorem allows us to reverse conditional probabilities — to update our belief about a cause after observing evidence.

P(A | B) = P(B | A) · P(A) / P(B)

In words: the posterior probability of A given B equals the likelihood P(B | A) times the prior P(A), divided by the marginal likelihood P(B). Using the law of total probability for P(B):

P(A | B) = P(B | A) · P(A) / [P(B | A) · P(A) + P(B | A') · P(A')]

Example: Medical Testing with Bayes' Theorem

A disease affects 1% of a population. A test has a 95% sensitivity (P(positive | disease) = 0.95) and a 90% specificity (P(negative | no disease) = 0.90). If a person tests positive, what is the probability they actually have the disease?

Step 1: Define events. D = has disease, + = tests positive.

P(D) = 0.01, P(D') = 0.99

P(+ | D) = 0.95, P(+ | D') = 1 − 0.90 = 0.10 (false positive rate)

Step 2: Compute P(+) using the law of total probability:

P(+) = P(+ | D) · P(D) + P(+ | D') · P(D') = (0.95)(0.01) + (0.10)(0.99) = 0.0095 + 0.099 = 0.1085

Step 3: Apply Bayes' Theorem:

P(D | +) = (0.95)(0.01) / 0.1085 = 0.0095 / 0.1085 ≈ 0.0876 (about 8.8%)

Interpretation: Even with a positive test, there is only an 8.8% chance the person truly has the disease. This counterintuitive result arises because the disease is rare — the large number of false positives from the healthy majority overwhelms the true positives.

Bayes' Theorem is the foundation of Bayesian statistics, a branch of statistics that treats probability as a degree of belief and continuously updates this belief as new data arrives. It is central to spam filters, recommendation engines, and many machine-learning algorithms.

Random Variables and Distributions

A random variable is a numerical quantity whose value is determined by the outcome of a random experiment. It is the bridge between probability theory and data: it assigns numbers to outcomes so we can use mathematical tools to analyze them.

Discrete vs. Continuous Random Variables

Discrete: Takes on a countable number of distinct values (e.g., number of heads in 10 coin flips, the number of customers in a queue).
Continuous: Takes on any value within an interval or continuum (e.g., height, temperature, time).

Probability Mass Function (PMF) — Discrete

For a discrete random variable X, the PMF gives the probability that X equals each possible value:

p(x) = P(X = x)

Requirements: p(x) ≥ 0 for all x, and Σ p(x) = 1.

Example: PMF of a fair die

X = number shown on a fair six-sided die.

p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6

Σ p(x) = 6 × (1/6) = 1 ✓

Probability Density Function (PDF) — Continuous

For a continuous random variable X, the PDF f(x) describes the relative likelihood of X being near a specific value. The probability that X falls within an interval [a, b] is the area under the PDF curve over that interval:

P(a ≤ X ≤ b) = ∫ₐᵇ f(x) dx

Requirements: f(x) ≥ 0 for all x, and ∫₋∞⁺∞ f(x) dx = 1.

For a continuous random variable, P(X = any single value) = 0. Probability is defined only over intervals. This is because there are uncountably many possible values, so the probability of any single exact value is zero.

Cumulative Distribution Function (CDF)

The CDF applies to both discrete and continuous random variables and gives the probability that X is less than or equal to x:

F(x) = P(X ≤ x)

Properties of the CDF:

F(x) is non-decreasing.
lim(x → −∞) F(x) = 0 and lim(x → +∞) F(x) = 1.
For continuous variables: F'(x) = f(x) (the PDF is the derivative of the CDF).

Expected Value and Variance of a Random Variable

The expected value (mean) of a random variable is the long-run average value:

Discrete: E(X) = Σ x · p(x)
Continuous: E(X) = ∫₋∞⁺∞ x · f(x) dx

The variance measures how much X deviates from its expected value:

Var(X) = E[(X − μ)²] = E(X²) − [E(X)]²

Example: Expected value and variance of a die roll

X = number on a fair die. Each outcome has probability 1/6.

E(X) = (1 + 2 + 3 + 4 + 5 + 6) / 6 = 21/6 = 3.5

E(X²) = (1 + 4 + 9 + 16 + 25 + 36) / 6 = 91/6 ≈ 15.167

Var(X) = 91/6 − (21/6)² = 91/6 − 441/36 = 546/36 − 441/36 = 105/36 = 35/12 ≈ 2.917

Common Distributions

Certain probability distributions appear so frequently in practice that they have been named and thoroughly studied. Here are the most important ones.

Binomial Distribution

Models the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success.

Parameters: n (number of trials), p (probability of success on each trial).

P(X = k) = C(n, k) · pᵏ · (1 − p)ⁿ⁻ᵏ, k = 0, 1, 2, …, n

where C(n, k) = n! / [k!(n − k)!]

E(X) = np, Var(X) = np(1 − p)

Example: Binomial Distribution

A fair coin is flipped 10 times. What is the probability of getting exactly 7 heads?

n = 10, p = 0.5, k = 7

P(X = 7) = C(10, 7) · (0.5)⁷ · (0.5)³ = 120 · (0.5)¹⁰ = 120 / 1024 ≈ 0.1172

Poisson Distribution

Models the number of events occurring in a fixed interval of time or space, when events occur independently at a constant average rate.

Parameter: λ (the average rate of occurrence).

P(X = k) = (e⁻λ · λᵏ) / k!, k = 0, 1, 2, …

E(X) = λ, Var(X) = λ

Example: Poisson Distribution

A call center receives an average of 4 calls per minute. What is the probability of receiving exactly 6 calls in a given minute?

λ = 4, k = 6

P(X = 6) = (e⁻⁴ · 4⁶) / 6! = (0.01832 · 4096) / 720 ≈ 75.05 / 720 ≈ 0.1042

The Poisson distribution is the limiting case of the binomial distribution when n → ∞ and p → 0 such that np = λ remains constant. It is widely used to model rare events: arrivals at a server, radioactive decay, typos on a page, and traffic accidents.

Normal (Gaussian) Distribution

The most important distribution in statistics. The famous "bell curve" arises naturally in countless phenomena and is the basis of the Central Limit Theorem.

Parameters: μ (mean), σ (standard deviation).

f(x) = (1 / (σ√(2π))) · e^(−(x − μ)² / (2σ²)), −∞ < x < ∞

E(X) = μ, Var(X) = σ²

Key properties of the normal distribution:

Symmetric and bell-shaped about the mean μ.
Mean = Median = Mode = μ.
Completely determined by μ and σ.
The total area under the curve equals 1.

The Empirical Rule (68-95-99.7 Rule)

For data that follows a normal distribution:

About 68% of data falls within μ ± 1σ.
About 95% of data falls within μ ± 2σ.
About 99.7% of data falls within μ ± 3σ.

The Standard Normal Distribution (Z-distribution)

A special case with μ = 0 and σ = 1. Any normal variable X can be standardized:

Z = (X − μ) / σ

The Z-score tells you how many standard deviations X is from the mean.

Example: Using the Normal Distribution

Test scores are normally distributed with μ = 75 and σ = 10. What proportion of students score above 90?

Step 1: Standardize: Z = (90 − 75) / 10 = 1.5

Step 2: Look up P(Z ≤ 1.5) in the standard normal table: ≈ 0.9332

Step 3: P(X > 90) = 1 − 0.9332 = 0.0668 (about 6.68%)

Exponential Distribution

Models the time between events in a Poisson process. If events occur at a constant average rate λ, the waiting time between successive events follows an exponential distribution.

Parameter: λ (rate parameter).

f(x) = λ · e^(−λx), x ≥ 0

E(X) = 1/λ, Var(X) = 1/λ²

Example: Exponential Distribution

Customers arrive at a store at an average rate of 3 per hour. What is the probability that the next customer arrives within 10 minutes (1/6 hour)?

λ = 3 per hour, x = 1/6 hour

P(X ≤ 1/6) = 1 − e^(−3 · 1/6) = 1 − e^(−0.5) ≈ 1 − 0.6065 ≈ 0.3935 (about 39.35%)

The exponential distribution has the memoryless property: P(X > s + t | X > s) = P(X > t). No matter how long you have already waited, the probability of waiting at least t more units is the same. This is a unique and defining characteristic of the exponential distribution (among continuous distributions).

Uniform Distribution

Every outcome in a given range is equally likely.

Continuous Uniform Distribution

Parameters: a (minimum), b (maximum).

f(x) = 1 / (b − a), a ≤ x ≤ b

E(X) = (a + b) / 2, Var(X) = (b − a)² / 12

Example: Uniform Distribution

A random number generator produces values uniformly between 0 and 10. What is P(3 ≤ X ≤ 7)?

P(3 ≤ X ≤ 7) = (7 − 3) / (10 − 0) = 4/10 = 0.4

Sampling and the Central Limit Theorem

In practice, we almost never know the true population parameters. Instead, we collect samples and use sample statistics to estimate population parameters. The reliability of these estimates depends critically on how we sample and how many observations we take.

Sampling Methods

Simple Random Sampling: Every member of the population has an equal chance of being selected. This is the gold standard.
Stratified Sampling: The population is divided into subgroups (strata) based on a characteristic, and random samples are drawn from each stratum.
Cluster Sampling: The population is divided into clusters (often geographic), and entire clusters are randomly selected.
Systematic Sampling: Every kth individual is selected from a list after a random starting point.
Convenience Sampling: Selecting whoever is easiest to reach (prone to bias — avoid when possible).

Sampling Distribution of the Mean

If we repeatedly draw random samples of size n from a population with mean μ and standard deviation σ, the distribution of sample means (x̄) has special properties:

E(X̄) = μ (the sample mean is an unbiased estimator of the population mean)

SD(X̄) = σ / √n (called the standard error)

The standard error decreases as the sample size increases. Quadrupling the sample size cuts the standard error in half. This is why larger samples give more precise estimates.

The Central Limit Theorem (CLT)

The Central Limit Theorem is arguably the most important theorem in all of statistics. It states:

For a random sample of size n drawn from any population with mean μ and finite standard deviation σ,
the sampling distribution of x̄ approaches a normal distribution as n → ∞:

X̄ ~ N(μ, σ²/n) approximately, for large n

This holds regardless of the shape of the original population distribution — even if the population is skewed, uniform, bimodal, or otherwise non-normal. This is what makes the CLT so powerful: it justifies using normal-distribution-based methods (like z-tests and confidence intervals) for sample means, even when the underlying data isn't normal.

How large must n be? A common guideline is n ≥ 30, but this depends on how non-normal the population is. For roughly symmetric populations, n ≥ 15 may suffice. For highly skewed populations, larger samples (n ≥ 40 or more) may be needed.

Example: Central Limit Theorem in Action

A factory produces bolts whose lengths have μ = 5.00 cm and σ = 0.10 cm. The distribution of individual bolt lengths is unknown (not necessarily normal). A quality inspector measures a random sample of 36 bolts.

By the CLT: X̄ ~ N(5.00, (0.10)²/36) = N(5.00, 0.000278)

Standard error: SE = 0.10/√36 = 0.10/6 ≈ 0.01667

What is the probability the sample mean is between 4.97 and 5.03?

Z₁ = (4.97 − 5.00) / 0.01667 = −1.80

Z₂ = (5.03 − 5.00) / 0.01667 = 1.80

P(−1.80 ≤ Z ≤ 1.80) = P(Z ≤ 1.80) − P(Z ≤ −1.80) ≈ 0.9641 − 0.0359 = 0.9282

There is about a 92.8% chance the sample mean is within 0.03 cm of the true mean.

Hypothesis Testing

Hypothesis testing is a formal procedure for using data to decide between two competing claims about a population parameter. It is the backbone of scientific inference, clinical trials, A/B testing, and quality control.

The Framework

State the hypotheses:
- Null hypothesis (H₀): The default claim — typically "no effect" or "no difference." We assume H₀ is true until the data provides strong evidence against it.
- Alternative hypothesis (H₁ or Hₐ): The claim we are trying to find evidence for. It can be one-sided (e.g., μ > μ₀) or two-sided (μ ≠ μ₀).
Choose a significance level (α): The threshold for "strong evidence." Common choices are α = 0.05 (5%) or α = 0.01 (1%).
Compute the test statistic: A number summarizing how far the sample result is from what H₀ predicts.
Determine the p-value: The probability of observing a test statistic as extreme as (or more extreme than) the one calculated, assuming H₀ is true.
Make a decision:
- If p-value ≤ α: reject H₀ (the data provides sufficient evidence for H₁).
- If p-value > α: fail to reject H₀ (insufficient evidence to support H₁).

"Fail to reject H₀" is not the same as "accept H₀." We never prove the null hypothesis — we only assess whether the evidence is strong enough to reject it. The absence of evidence is not evidence of absence.

Z-Test (for large samples or known σ)

When the population standard deviation σ is known and the sample is large (n ≥ 30), the test statistic for the population mean is:

Z = (x̄ − μ₀) / (σ / √n)

Example: One-Sample Z-Test

A company claims its light bulbs last μ₀ = 1000 hours on average. A consumer group tests 50 bulbs and finds x̄ = 985 hours. The known population standard deviation is σ = 40 hours. Test at α = 0.05 (two-sided).

H₀: μ = 1000 H₁: μ ≠ 1000

Test statistic: Z = (985 − 1000) / (40/√50) = −15 / 5.657 ≈ −2.65

p-value: P(|Z| ≥ 2.65) = 2 × P(Z ≤ −2.65) ≈ 2 × 0.0040 = 0.008

Decision: p-value = 0.008 < 0.05 = α → Reject H₀

Conclusion: There is statistically significant evidence that the true mean lifespan differs from 1000 hours.

T-Test (for small samples or unknown σ)

When the population standard deviation is unknown (the usual case) and we use the sample standard deviation s, the test statistic follows a t-distribution with (n − 1) degrees of freedom:

t = (x̄ − μ₀) / (s / √n), df = n − 1

The t-distribution looks similar to the standard normal but has heavier tails, especially for small n. As n increases, the t-distribution approaches the standard normal.

Example: One-Sample T-Test

A nutritionist claims that a new diet reduces cholesterol by μ₀ = 20 mg/dL. A study of 12 patients shows a mean reduction of x̄ = 24.5 mg/dL with s = 8.2 mg/dL. Test at α = 0.05 (one-sided: H₁: μ > 20).

H₀: μ = 20 H₁: μ > 20

Test statistic: t = (24.5 − 20) / (8.2/√12) = 4.5 / 2.367 ≈ 1.901

Degrees of freedom: df = 12 − 1 = 11

p-value: P(t₁₁ ≥ 1.901) ≈ 0.042

Decision: p-value = 0.042 < 0.05 = α → Reject H₀

Conclusion: There is statistically significant evidence that the mean cholesterol reduction exceeds 20 mg/dL.

Type I and Type II Errors

Hypothesis testing can lead to two types of errors:

	H₀ is true	H₀ is false
Reject H₀	Type I error (α) — "false positive"	Correct decision (Power = 1 − β)
Fail to reject H₀	Correct decision	Type II error (β) — "false negative"

Type I error (α): Rejecting H₀ when it is actually true. The significance level α is the maximum tolerable probability of a Type I error.
Type II error (β): Failing to reject H₀ when it is actually false. The power of a test (1 − β) is the probability of correctly rejecting a false H₀.

There is a trade-off between Type I and Type II errors. Lowering α (making it harder to reject H₀) decreases the chance of a false positive but increases the chance of a false negative. The best way to reduce both error types simultaneously is to increase the sample size.

Confidence Intervals

A confidence interval provides a range of plausible values for a population parameter. A 95% confidence interval means: if we repeated the sampling process many times, about 95% of our intervals would contain the true parameter.

CI for μ (known σ): x̄ ± z* · (σ / √n)

CI for μ (unknown σ): x̄ ± t* · (s / √n)

where z* and t* are the critical values for the desired confidence level.

Example: 95% Confidence Interval

A sample of n = 25 students has x̄ = 82 and s = 6. Construct a 95% confidence interval for the population mean score.

df = 24, t* ≈ 2.064 (from t-table for 95% CI with 24 df)

Margin of error: E = 2.064 · (6/√25) = 2.064 · 1.2 = 2.477

CI: 82 ± 2.477 = (79.52, 84.48)

We are 95% confident that the true population mean lies between 79.52 and 84.48.

Regression and Correlation

Regression and correlation are tools for exploring and quantifying relationships between variables. They are among the most widely used statistical techniques in science, business, and engineering.

Correlation Coefficient (Pearson's r)

The Pearson correlation coefficient measures the strength and direction of the linear relationship between two quantitative variables, X and Y.

r = [Σ (xᵢ − x̄)(yᵢ − ȳ)] / √[Σ (xᵢ − x̄)² · Σ (yᵢ − ȳ)²]

Properties of r:

−1 ≤ r ≤ 1
r = 1: perfect positive linear relationship
r = −1: perfect negative linear relationship
r = 0: no linear relationship (but there may still be a non-linear one!)
|r| near 1 indicates a strong linear association; |r| near 0 indicates a weak one.

Correlation does not imply causation. Two variables can be strongly correlated without one causing the other. The correlation might be due to a lurking (confounding) variable, reverse causation, or pure coincidence. Establishing causation requires controlled experiments or careful observational study designs.

Example: Computing Pearson's r

Data: hours studied (X) vs. exam score (Y)

X (hours)	Y (score)
2	65
3	70
5	80
7	85
8	92

x̄ = 5, ȳ = 78.4

Σ(xᵢ − x̄)(yᵢ − ȳ) = (−3)(−13.4) + (−2)(−8.4) + (0)(1.6) + (2)(6.6) + (3)(13.6) = 40.2 + 16.8 + 0 + 13.2 + 40.8 = 111.0

Σ(xᵢ − x̄)² = 9 + 4 + 0 + 4 + 9 = 26

Σ(yᵢ − ȳ)² = 179.56 + 70.56 + 2.56 + 43.56 + 184.96 = 481.2

r = 111.0 / √(26 × 481.2) = 111.0 / √12,511.2 = 111.0 / 111.85 ≈ 0.992

This indicates a very strong positive linear relationship between hours studied and exam score.

Simple Linear Regression

Simple linear regression fits a straight line through the data to predict Y from X. The equation of the least-squares regression line (the line that minimizes the sum of squared residuals) is:

ŷ = b₀ + b₁x

where:
b₁ = [Σ (xᵢ − x̄)(yᵢ − ȳ)] / [Σ (xᵢ − x̄)²] (slope)
b₀ = ȳ − b₁x̄ (y-intercept)

b₁ (slope): The predicted change in Y for each one-unit increase in X.
b₀ (intercept): The predicted value of Y when X = 0 (may not always be meaningful).
ŷ: The predicted (fitted) value of Y for a given X.
Residual: eᵢ = yᵢ − ŷᵢ (the difference between the observed and predicted value).

Example: Fitting a Regression Line (continued from above)

Using the hours-studied vs. exam-score data:

Slope: b₁ = 111.0 / 26 ≈ 4.269

Intercept: b₀ = 78.4 − 4.269(5) = 78.4 − 21.345 = 57.055

Regression equation: ŷ = 57.055 + 4.269x

Interpretation: For each additional hour of study, the predicted exam score increases by about 4.27 points.

Prediction: If a student studies for 6 hours: ŷ = 57.055 + 4.269(6) = 57.055 + 25.614 = 82.67

Coefficient of Determination (R²)

R² measures the proportion of variance in Y that is explained by the linear relationship with X. For simple linear regression, R² = r².

R² = 1 − (SS_res / SS_tot)

where:
SS_res = Σ(yᵢ − ŷᵢ)² (residual sum of squares)
SS_tot = Σ(yᵢ − ȳ)² (total sum of squares)

R² = 0: The model explains none of the variability in Y.
R² = 1: The model explains all of the variability in Y (perfect fit).
R² = 0.85 means 85% of the variation in Y is explained by the linear relationship with X.

Example: Calculating R²

From our previous example, r ≈ 0.992.

R² = (0.992)² ≈ 0.984

Interpretation: About 98.4% of the variation in exam scores can be explained by the linear relationship with hours studied. Only 1.6% is due to other factors or random variation.

Assumptions of Linear Regression

For the results of linear regression to be valid, several assumptions must be satisfied (often remembered by the acronym LINE):

Linearity: The relationship between X and Y is linear.
Independence: Observations are independent of one another.
Normality: The residuals are approximately normally distributed.
Equal variance (Homoscedasticity): The spread of residuals is roughly constant across all values of X.

Always check these assumptions by examining residual plots. A plot of residuals vs. fitted values should show no obvious patterns — just a random scatter of points around zero. If you see a curve, fan shape, or other structure, the assumptions may be violated and alternative methods (e.g., transformations, non-linear regression) should be considered.

Multiple Regression (Preview)

In practice, outcomes are rarely determined by a single predictor. Multiple linear regression extends simple linear regression to include multiple predictors:

ŷ = b₀ + b₁x₁ + b₂x₂ + … + bₖxₖ

Each coefficient bᵢ represents the effect of predictor xᵢ on Y, holding all other predictors constant. The interpretation and assumptions are analogous to simple regression, but with additional complexity around multicollinearity (predictors being correlated with each other).