Chapter 56 Advanced ~55 min read

Machine Learning in Basketball

Application of machine learning techniques to basketball analytics problems.

Machine Learning Opportunities

Machine learning offers tools for pattern recognition that may exceed human capability. Computer vision can analyze video. Neural networks can find non-linear relationships in tracking data. Random forests can identify complex interactions among features. These tools extend what basketball analytics can accomplish.

Classification Problems

Many basketball questions are classification problems: will this shot go in? Will this player become an All-Star? Which plays lead to scores? ML classification algorithms can learn from historical examples to make predictions on new cases.

Feature Engineering

The success of ML in basketball depends heavily on feature engineering: creating meaningful inputs from raw data. Domain knowledge about basketball guides which features matter. Simply throwing data at algorithms rarely produces good results without careful feature construction.

from sklearn.ensemble import RandomForestClassifier

def train_shot_model(shot_data, features):
    """Train ML model to predict shot outcomes"""
    X = shot_data[features]
    y = shot_data['SHOT_MADE_FLAG']

    model = RandomForestClassifier(n_estimators=100, max_depth=10)
    model.fit(X, y)

    # Feature importance
    importance = dict(zip(features, model.feature_importances_))

    return model, importance

Avoiding Overfitting

Basketball data is limited—even a full season provides only ~1300 games of team-level data. Overfitting is a constant danger. Proper validation (holdout sets, cross-validation) is essential. Simple models often outperform complex ones by avoiding overfitting.

Interpretability Tradeoffs

Complex ML models may capture patterns invisible to simpler approaches but sacrifice interpretability. For many basketball applications, understanding why predictions are made is as important as accuracy. Teams may prefer interpretable models that enable strategic response.

Implementation in R

# Machine learning for player classification
library(tidyverse)
library(caret)
library(randomForest)

classify_players <- function(player_stats) {
  # Prepare features
  features <- player_stats %>%
    select(pts_100, ast_100, reb_100, stl_100, blk_100,
           fg3_rate, ts_pct, usg_pct, ast_pct)

  # K-means clustering
  set.seed(42)
  clusters <- kmeans(scale(features), centers = 5, nstart = 25)

  player_stats$cluster <- as.factor(clusters$cluster)

  # Describe clusters
  cluster_profiles <- player_stats %>%
    group_by(cluster) %>%
    summarise(across(c(pts_100, ast_100, reb_100, fg3_rate),
                     mean, .names = "avg_{.col}"))

  list(players = player_stats, profiles = cluster_profiles)
}

stats <- read_csv("player_stats.csv")
result <- classify_players(stats)
print(result$profiles)
# Random Forest for performance prediction
library(tidyverse)
library(randomForest)

predict_performance <- function(training_data, target = "next_season_ws") {
  # Prepare data
  features <- training_data %>%
    select(age, pts, ast, reb, ws, bpm, ts_pct, usg_pct,
           all_of(target)) %>%
    na.omit()

  # Train random forest
  set.seed(42)
  rf_model <- randomForest(
    as.formula(paste(target, "~ .")),
    data = features,
    ntree = 500,
    importance = TRUE
  )

  # Feature importance
  importance <- importance(rf_model) %>%
    as.data.frame() %>%
    rownames_to_column("feature") %>%
    arrange(desc(`%IncMSE`))

  list(model = rf_model, importance = importance)
}

training <- read_csv("historical_player_stats.csv")
result <- predict_performance(training)
print(result$importance)

Implementation in R

# Machine learning for player classification
library(tidyverse)
library(caret)
library(randomForest)

classify_players <- function(player_stats) {
  # Prepare features
  features <- player_stats %>%
    select(pts_100, ast_100, reb_100, stl_100, blk_100,
           fg3_rate, ts_pct, usg_pct, ast_pct)

  # K-means clustering
  set.seed(42)
  clusters <- kmeans(scale(features), centers = 5, nstart = 25)

  player_stats$cluster <- as.factor(clusters$cluster)

  # Describe clusters
  cluster_profiles <- player_stats %>%
    group_by(cluster) %>%
    summarise(across(c(pts_100, ast_100, reb_100, fg3_rate),
                     mean, .names = "avg_{.col}"))

  list(players = player_stats, profiles = cluster_profiles)
}

stats <- read_csv("player_stats.csv")
result <- classify_players(stats)
print(result$profiles)
# Random Forest for performance prediction
library(tidyverse)
library(randomForest)

predict_performance <- function(training_data, target = "next_season_ws") {
  # Prepare data
  features <- training_data %>%
    select(age, pts, ast, reb, ws, bpm, ts_pct, usg_pct,
           all_of(target)) %>%
    na.omit()

  # Train random forest
  set.seed(42)
  rf_model <- randomForest(
    as.formula(paste(target, "~ .")),
    data = features,
    ntree = 500,
    importance = TRUE
  )

  # Feature importance
  importance <- importance(rf_model) %>%
    as.data.frame() %>%
    rownames_to_column("feature") %>%
    arrange(desc(`%IncMSE`))

  list(model = rf_model, importance = importance)
}

training <- read_csv("historical_player_stats.csv")
result <- predict_performance(training)
print(result$importance)
Chapter Summary

You've completed Chapter 56: Machine Learning in Basketball.