Chapter 5 Intermediate ~35 min read 5 sections

Statistical Foundations for Analytics

Sound statistical reasoning underpins all basketball analytics. This chapter covers the fundamental concepts including probability distributions, hypothesis testing, regression analysis, and the critical concept of sample size. You will learn to apply these methods appropriately while avoiding common statistical pitfalls.

Why Statistics Matter for Basketball

Statistics provides the foundation for drawing reliable conclusions from basketball data. Without proper statistical reasoning, analysts risk mistaking random variation for meaningful patterns, ignoring uncertainty that renders conclusions unreliable, or misinterpreting relationships between variables. The concepts covered in this chapter will appear throughout your analytical work, whether you are evaluating players, testing strategic hypotheses, or building predictive models.

Basketball presents particular statistical challenges that reward careful thinking. Sample sizes are often smaller than analysts would like—a season contains only 82 games, and playing time further limits observations for individual players. Performance varies substantially from game to game for reasons both systematic and random. Many outcomes depend on interactions between players, making individual attribution difficult. Understanding these challenges helps you calibrate how much confidence your conclusions deserve.

At the same time, the wealth of basketball data creates opportunities for sophisticated analysis. Decades of historical records enable studies of career trajectories and era adjustments. Play-by-play data provides thousands of possessions per game for detailed examination. Tracking data captures hundreds of observations per minute of play. The statistical methods introduced here help you extract reliable insights from this abundance while avoiding the pitfalls of overconfident interpretation.

Probability and Distributions

Probability quantifies uncertainty, providing a language for expressing how likely various outcomes are. When we say a player has a 40% three-point shooting percentage, we are making a probabilistic statement about the expected frequency of makes in a sequence of attempts. Understanding probability enables you to reason about expected outcomes, assess the likelihood of observations, and quantify the uncertainty in your estimates.

The binomial distribution models the number of successes in a sequence of independent trials, each with the same probability of success. Free throw shooting fits this model—each attempt either succeeds or fails, with probability determined by the player's true shooting ability. The binomial distribution tells us how likely we are to observe various numbers of makes given a specific number of attempts and underlying success probability. This distribution underlies many basketball calculations.

The normal distribution, the familiar bell curve, describes many quantities that arise from the combination of multiple random influences. The central limit theorem explains why: averages of many independent observations tend toward normally distributed values regardless of the distribution of individual observations. Player performance aggregated across games often follows approximately normal distributions, enabling the use of familiar statistical methods.

Understanding distributions helps you recognize when observations are unusual versus expected. A player who shoots 50% from three over a ten-game stretch is not necessarily a transformed shooter—random variation could easily produce such results for a true 35% shooter. Quantifying the probability of such observations guides interpretation and prevents overreaction to small-sample results.

Estimation and Confidence

Statistical estimation addresses the challenge of inferring true population values from limited samples. We observe a player making 84 of 200 three-point attempts and want to estimate their true three-point shooting ability. The sample proportion of 42% provides our best estimate, but we need to quantify how uncertain this estimate is. Confidence intervals provide this quantification.

A 95% confidence interval gives a range of values that would include the true parameter in 95% of hypothetical repetitions of the sampling process. For our shooter making 84 of 200 attempts, the 95% confidence interval for true shooting percentage might span from 35% to 49%. This wide range reflects the uncertainty inherent in 200 observations—substantial but hardly definitive evidence about true ability.

Sample size critically determines confidence interval width. More observations yield more precise estimates. This relationship follows the square root rule—quadrupling your sample size halves the width of confidence intervals. A player with 800 three-point attempts provides estimates roughly twice as precise as one with 200 attempts, all else equal. Understanding this relationship helps you assess when you have enough data for reliable conclusions.

Bayesian approaches offer an alternative framework for estimation that incorporates prior information. Rather than treating each player as completely unknown, Bayesian methods combine sample data with reasonable prior beliefs about typical performance. For rare events or small samples, this approach often produces more sensible estimates than treating the sample proportion as the best guess. Shrinkage estimators, which pull extreme sample values toward typical values, implement this intuition practically.

Hypothesis Testing

Hypothesis testing provides a framework for evaluating claims based on data. The null hypothesis represents a default position—often that no effect or difference exists—while the alternative hypothesis represents what we are trying to establish. We collect data, calculate a test statistic, and assess how likely such a result would be if the null hypothesis were true. Unlikely results provide evidence against the null hypothesis.

The p-value quantifies the probability of observing results as extreme as those obtained, assuming the null hypothesis is true. Small p-values suggest the null hypothesis is unlikely to have produced the observed data. By convention, p-values below 0.05 are often considered statistically significant, though this threshold is arbitrary and should not be interpreted as a bright line between truth and falsehood.

Statistical significance does not imply practical importance. With large enough samples, even tiny effects become statistically significant. A coaching intervention that increases three-point percentage by 0.1% might achieve statistical significance with enough observations while having negligible practical impact. Always consider effect size alongside statistical significance to assess whether findings matter for basketball decisions.

Multiple testing creates challenges when evaluating many hypotheses simultaneously. If you test twenty hypotheses at the 0.05 significance level, you expect about one false positive even when none of the hypotheses are true. Searching through data for significant results—sometimes called data dredging or p-hacking—inflates false discovery rates. Pre-registering hypotheses and adjusting for multiple comparisons helps maintain valid inference.

Regression Analysis

Regression analysis examines relationships between variables, enabling prediction and explanation. Simple linear regression models the relationship between one predictor and one outcome, fitting a line that best summarizes how the outcome changes as the predictor varies. Multiple regression extends this to many predictors simultaneously, estimating the independent contribution of each while controlling for others.

Interpreting regression coefficients requires care about causation versus correlation. A regression might find that teams who take more three-pointers tend to have better offensive efficiency. This association does not necessarily mean that taking more threes causes better efficiency—perhaps better offensive teams naturally generate more open three-point opportunities. Regression describes relationships in the data without establishing causal direction.

Model fit statistics indicate how well your regression captures variation in the outcome. The R-squared value gives the proportion of outcome variance explained by the predictors. While higher R-squared generally indicates better fit, maximizing R-squared can lead to overfitting—models that capture noise in the training data rather than true relationships. Cross-validation and out-of-sample testing help identify models that will generalize to new data.

Regression assumptions merit attention for reliable inference. Linear regression assumes a linear relationship, constant variance of errors, and normally distributed errors. Violations of these assumptions can bias coefficient estimates or invalidate standard errors. Diagnostic plots reveal potential violations, while robust methods or alternative models address them when necessary.

Practical Statistical Wisdom

Beyond technical methods, effective statistical practice requires judgment about when and how to apply these tools. Start with exploratory analysis before hypothesis testing—understanding your data thoroughly reduces the risk of errors and generates better hypotheses. Visualize distributions and relationships before computing summary statistics. Look for outliers and unusual patterns that might affect results.

Consider measurement quality alongside statistical analysis. Sophisticated methods cannot rescue fundamentally flawed data. Player performance statistics may be recorded inconsistently across arenas. Subjective judgments affect assist and block attribution. Tracking data may contain systematic biases from camera positioning. Understanding data limitations informs how much confidence to place in results.

Embrace uncertainty rather than seeking false precision. Reporting results as point estimates without uncertainty quantification overstates what the data supports. Confidence intervals, prediction intervals, and probabilistic forecasts communicate the limits of knowledge honestly. Audiences often appreciate this honesty more than the apparent confidence of unjustified precision.

Finally, remember that statistical analysis serves understanding, not just calculation. The goal is insight that informs decisions, not impressive numbers. A simple analysis that clearly illuminates a question often proves more valuable than a sophisticated analysis that obscures it. Keep your audience and purpose in mind throughout the analytical process.

Implementation in R

# Descriptive statistics for player data
library(tidyverse)

player_stats <- read_csv("player_stats.csv")

# Summary statistics
summary_stats <- player_stats %>%
  summarise(
    mean_pts = mean(pts, na.rm = TRUE),
    median_pts = median(pts, na.rm = TRUE),
    sd_pts = sd(pts, na.rm = TRUE),
    min_pts = min(pts, na.rm = TRUE),
    max_pts = max(pts, na.rm = TRUE),
    iqr_pts = IQR(pts, na.rm = TRUE)
  )

print(summary_stats)

# Hypothesis testing: Comparing two groups
library(tidyverse)

# Are guards better shooters than forwards?
guards <- player_stats %>% filter(position %in% c("PG", "SG"))
forwards <- player_stats %>% filter(position %in% c("SF", "PF"))

# Two-sample t-test for 3P%
t_test_result <- t.test(guards$fg3_pct, forwards$fg3_pct)

print(t_test_result)

# Effect size (Cohen's d)
cohens_d <- (mean(guards$fg3_pct) - mean(forwards$fg3_pct)) /
            sqrt((sd(guards$fg3_pct)^2 + sd(forwards$fg3_pct)^2) / 2)
cat("Cohen's d:", round(cohens_d, 3))

# Correlation analysis
library(tidyverse)

# Correlation matrix for key stats
cor_vars <- player_stats %>%
  select(pts, ast, reb, stl, blk, tov, min)

cor_matrix <- cor(cor_vars, use = "complete.obs")
print(round(cor_matrix, 3))

# Test significance of correlation
cor_test <- cor.test(player_stats$ast, player_stats$tov)
print(cor_test)

# Distribution analysis
library(tidyverse)
library(ggplot2)

# Check normality of scoring distribution
ggplot(player_stats, aes(x = pts)) +
  geom_histogram(aes(y = ..density..), bins = 30,
                 fill = "#1d428a", alpha = 0.7) +
  geom_density(color = "#c8102e", size = 1) +
  labs(title = "Distribution of Points Per Game",
       x = "Points", y = "Density") +
  theme_minimal()

# Shapiro-Wilk test for normality
shapiro.test(sample(player_stats$pts, 50))

Implementation in Python

# Descriptive statistics for player data
import pandas as pd
import numpy as np

player_stats = pd.read_csv("player_stats.csv")

# Summary statistics
summary_stats = player_stats["pts"].agg([
    "mean", "median", "std", "min", "max"
])
summary_stats["iqr"] = player_stats["pts"].quantile(0.75) - player_stats["pts"].quantile(0.25)

print(summary_stats)

# Hypothesis testing: Comparing two groups
from scipy import stats
import numpy as np

# Are guards better shooters than forwards?
guards = player_stats[player_stats["position"].isin(["PG", "SG"])]
forwards = player_stats[player_stats["position"].isin(["SF", "PF"])]

# Two-sample t-test for 3P%
t_stat, p_value = stats.ttest_ind(
    guards["fg3_pct"].dropna(),
    forwards["fg3_pct"].dropna()
)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Effect size (Cohen's d)
def cohens_d(group1, group2):
    n1, n2 = len(group1), len(group2)
    var1, var2 = group1.var(), group2.var()
    pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
    return (group1.mean() - group2.mean()) / pooled_std

d = cohens_d(guards["fg3_pct"].dropna(), forwards["fg3_pct"].dropna())
print(f"Cohen's d: {d:.3f}")

# Correlation analysis
import pandas as pd
from scipy import stats

# Correlation matrix for key stats
cor_vars = player_stats[["pts", "ast", "reb", "stl", "blk", "tov", "min"]]
cor_matrix = cor_vars.corr()
print(cor_matrix.round(3))

# Test significance of correlation
r, p = stats.pearsonr(
    player_stats["ast"].dropna(),
    player_stats["tov"].dropna()
)
print(f"\nAST-TOV Correlation: r={r:.3f}, p={p:.4f}")

# Distribution analysis
import matplotlib.pyplot as plt
from scipy import stats

# Histogram with density curve
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(player_stats["pts"], bins=30, density=True,
        alpha=0.7, color="#1d428a")

# Fit normal distribution
mu, std = player_stats["pts"].mean(), player_stats["pts"].std()
x = np.linspace(player_stats["pts"].min(), player_stats["pts"].max(), 100)
ax.plot(x, stats.norm.pdf(x, mu, std), color="#c8102e", linewidth=2)

ax.set_title("Distribution of Points Per Game")
ax.set_xlabel("Points")
ax.set_ylabel("Density")
plt.tight_layout()
plt.show()

# Shapiro-Wilk test for normality
stat, p = stats.shapiro(player_stats["pts"].sample(50))
print(f"Shapiro-Wilk test: W={stat:.4f}, p={p:.4f}")

Implementation in R

# Descriptive statistics for player data
library(tidyverse)

player_stats <- read_csv("player_stats.csv")

# Summary statistics
summary_stats <- player_stats %>%
  summarise(
    mean_pts = mean(pts, na.rm = TRUE),
    median_pts = median(pts, na.rm = TRUE),
    sd_pts = sd(pts, na.rm = TRUE),
    min_pts = min(pts, na.rm = TRUE),
    max_pts = max(pts, na.rm = TRUE),
    iqr_pts = IQR(pts, na.rm = TRUE)
  )

print(summary_stats)

# Hypothesis testing: Comparing two groups
library(tidyverse)

# Are guards better shooters than forwards?
guards <- player_stats %>% filter(position %in% c("PG", "SG"))
forwards <- player_stats %>% filter(position %in% c("SF", "PF"))

# Two-sample t-test for 3P%
t_test_result <- t.test(guards$fg3_pct, forwards$fg3_pct)

print(t_test_result)

# Effect size (Cohen's d)
cohens_d <- (mean(guards$fg3_pct) - mean(forwards$fg3_pct)) /
            sqrt((sd(guards$fg3_pct)^2 + sd(forwards$fg3_pct)^2) / 2)
cat("Cohen's d:", round(cohens_d, 3))

# Correlation analysis
library(tidyverse)

# Correlation matrix for key stats
cor_vars <- player_stats %>%
  select(pts, ast, reb, stl, blk, tov, min)

cor_matrix <- cor(cor_vars, use = "complete.obs")
print(round(cor_matrix, 3))

# Test significance of correlation
cor_test <- cor.test(player_stats$ast, player_stats$tov)
print(cor_test)

# Distribution analysis
library(tidyverse)
library(ggplot2)

# Check normality of scoring distribution
ggplot(player_stats, aes(x = pts)) +
  geom_histogram(aes(y = ..density..), bins = 30,
                 fill = "#1d428a", alpha = 0.7) +
  geom_density(color = "#c8102e", size = 1) +
  labs(title = "Distribution of Points Per Game",
       x = "Points", y = "Density") +
  theme_minimal()

# Shapiro-Wilk test for normality
shapiro.test(sample(player_stats$pts, 50))

Implementation in Python

# Descriptive statistics for player data
import pandas as pd
import numpy as np

player_stats = pd.read_csv("player_stats.csv")

# Summary statistics
summary_stats = player_stats["pts"].agg([
    "mean", "median", "std", "min", "max"
])
summary_stats["iqr"] = player_stats["pts"].quantile(0.75) - player_stats["pts"].quantile(0.25)

print(summary_stats)

# Hypothesis testing: Comparing two groups
from scipy import stats
import numpy as np

# Are guards better shooters than forwards?
guards = player_stats[player_stats["position"].isin(["PG", "SG"])]
forwards = player_stats[player_stats["position"].isin(["SF", "PF"])]

# Two-sample t-test for 3P%
t_stat, p_value = stats.ttest_ind(
    guards["fg3_pct"].dropna(),
    forwards["fg3_pct"].dropna()
)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Effect size (Cohen's d)
def cohens_d(group1, group2):
    n1, n2 = len(group1), len(group2)
    var1, var2 = group1.var(), group2.var()
    pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
    return (group1.mean() - group2.mean()) / pooled_std

d = cohens_d(guards["fg3_pct"].dropna(), forwards["fg3_pct"].dropna())
print(f"Cohen's d: {d:.3f}")

# Correlation analysis
import pandas as pd
from scipy import stats

# Correlation matrix for key stats
cor_vars = player_stats[["pts", "ast", "reb", "stl", "blk", "tov", "min"]]
cor_matrix = cor_vars.corr()
print(cor_matrix.round(3))

# Test significance of correlation
r, p = stats.pearsonr(
    player_stats["ast"].dropna(),
    player_stats["tov"].dropna()
)
print(f"\nAST-TOV Correlation: r={r:.3f}, p={p:.4f}")

# Distribution analysis
import matplotlib.pyplot as plt
from scipy import stats

# Histogram with density curve
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(player_stats["pts"], bins=30, density=True,
        alpha=0.7, color="#1d428a")

# Fit normal distribution
mu, std = player_stats["pts"].mean(), player_stats["pts"].std()
x = np.linspace(player_stats["pts"].min(), player_stats["pts"].max(), 100)
ax.plot(x, stats.norm.pdf(x, mu, std), color="#c8102e", linewidth=2)

ax.set_title("Distribution of Points Per Game")
ax.set_xlabel("Points")
ax.set_ylabel("Density")
plt.tight_layout()
plt.show()

# Shapiro-Wilk test for normality
stat, p = stats.shapiro(player_stats["pts"].sample(50))
print(f"Shapiro-Wilk test: W={stat:.4f}, p={p:.4f}")

Chapter Summary

You've completed Chapter 5: Statistical Foundations for Analytics.

Practice Exercises View Glossary Continue to Chapter 6