Machine Learning Opportunities
Machine learning offers tools for pattern recognition that may exceed human capability. Computer vision can analyze video. Neural networks can find non-linear relationships in tracking data. Random forests can identify complex interactions among features. These tools extend what basketball analytics can accomplish.
Classification Problems
Many basketball questions are classification problems: will this shot go in? Will this player become an All-Star? Which plays lead to scores? ML classification algorithms can learn from historical examples to make predictions on new cases.
Feature Engineering
The success of ML in basketball depends heavily on feature engineering: creating meaningful inputs from raw data. Domain knowledge about basketball guides which features matter. Simply throwing data at algorithms rarely produces good results without careful feature construction.
from sklearn.ensemble import RandomForestClassifier
def train_shot_model(shot_data, features):
"""Train ML model to predict shot outcomes"""
X = shot_data[features]
y = shot_data['SHOT_MADE_FLAG']
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X, y)
# Feature importance
importance = dict(zip(features, model.feature_importances_))
return model, importance
Avoiding Overfitting
Basketball data is limited—even a full season provides only ~1300 games of team-level data. Overfitting is a constant danger. Proper validation (holdout sets, cross-validation) is essential. Simple models often outperform complex ones by avoiding overfitting.
Interpretability Tradeoffs
Complex ML models may capture patterns invisible to simpler approaches but sacrifice interpretability. For many basketball applications, understanding why predictions are made is as important as accuracy. Teams may prefer interpretable models that enable strategic response.
Implementation in R
# Machine learning for player classification
library(tidyverse)
library(caret)
library(randomForest)
classify_players <- function(player_stats) {
# Prepare features
features <- player_stats %>%
select(pts_100, ast_100, reb_100, stl_100, blk_100,
fg3_rate, ts_pct, usg_pct, ast_pct)
# K-means clustering
set.seed(42)
clusters <- kmeans(scale(features), centers = 5, nstart = 25)
player_stats$cluster <- as.factor(clusters$cluster)
# Describe clusters
cluster_profiles <- player_stats %>%
group_by(cluster) %>%
summarise(across(c(pts_100, ast_100, reb_100, fg3_rate),
mean, .names = "avg_{.col}"))
list(players = player_stats, profiles = cluster_profiles)
}
stats <- read_csv("player_stats.csv")
result <- classify_players(stats)
print(result$profiles)
# Random Forest for performance prediction
library(tidyverse)
library(randomForest)
predict_performance <- function(training_data, target = "next_season_ws") {
# Prepare data
features <- training_data %>%
select(age, pts, ast, reb, ws, bpm, ts_pct, usg_pct,
all_of(target)) %>%
na.omit()
# Train random forest
set.seed(42)
rf_model <- randomForest(
as.formula(paste(target, "~ .")),
data = features,
ntree = 500,
importance = TRUE
)
# Feature importance
importance <- importance(rf_model) %>%
as.data.frame() %>%
rownames_to_column("feature") %>%
arrange(desc(`%IncMSE`))
list(model = rf_model, importance = importance)
}
training <- read_csv("historical_player_stats.csv")
result <- predict_performance(training)
print(result$importance)
Implementation in R
# Machine learning for player classification
library(tidyverse)
library(caret)
library(randomForest)
classify_players <- function(player_stats) {
# Prepare features
features <- player_stats %>%
select(pts_100, ast_100, reb_100, stl_100, blk_100,
fg3_rate, ts_pct, usg_pct, ast_pct)
# K-means clustering
set.seed(42)
clusters <- kmeans(scale(features), centers = 5, nstart = 25)
player_stats$cluster <- as.factor(clusters$cluster)
# Describe clusters
cluster_profiles <- player_stats %>%
group_by(cluster) %>%
summarise(across(c(pts_100, ast_100, reb_100, fg3_rate),
mean, .names = "avg_{.col}"))
list(players = player_stats, profiles = cluster_profiles)
}
stats <- read_csv("player_stats.csv")
result <- classify_players(stats)
print(result$profiles)
# Random Forest for performance prediction
library(tidyverse)
library(randomForest)
predict_performance <- function(training_data, target = "next_season_ws") {
# Prepare data
features <- training_data %>%
select(age, pts, ast, reb, ws, bpm, ts_pct, usg_pct,
all_of(target)) %>%
na.omit()
# Train random forest
set.seed(42)
rf_model <- randomForest(
as.formula(paste(target, "~ .")),
data = features,
ntree = 500,
importance = TRUE
)
# Feature importance
importance <- importance(rf_model) %>%
as.data.frame() %>%
rownames_to_column("feature") %>%
arrange(desc(`%IncMSE`))
list(model = rf_model, importance = importance)
}
training <- read_csv("historical_player_stats.csv")
result <- predict_performance(training)
print(result$importance)