Chapter 16 Advanced ~55 min read

Real Plus-Minus (RPM)

Exploring ESPN's Real Plus-Minus, which combines ridge regression with box score priors.

The RPM Innovation

Real Plus-Minus (RPM) represented a significant leap forward in all-in-one metric construction when ESPN introduced it in 2014. Developed by Jeremias Engelmann and Steve Ilardi, RPM addressed fundamental limitations of both pure box score metrics and raw plus-minus approaches by combining them into a sophisticated hybrid. The metric uses regularized regression to estimate each player's impact on team scoring margin, incorporating box score data as informative priors that stabilize estimates for players with limited playing time.

The innovation of RPM lies in its treatment of the information problem that plagues plus-minus statistics. Raw plus-minus data requires large samples to produce reliable estimates. Box score statistics, conversely, stabilize quickly but miss aspects of play that don't generate statistics. RPM bridges these approaches, using box scores to inform initial estimates while allowing plus-minus data to adjust those estimates as more information accumulates.

The Regularized Adjusted Plus-Minus Framework

RPM builds on the adjusted plus-minus (APM) framework that attempts to isolate individual contributions from lineup-based point differential data. Standard APM uses regression to estimate each player's impact controlling for teammates and opponents on court. However, this approach suffers from collinearity issues and requires substantial playing time for reliable estimates.

Ridge regression addresses these problems through regularization. Instead of seeking the single best-fitting set of coefficients, ridge regression adds a penalty term that shrinks estimates toward zero. This shrinkage prevents extreme values while producing more stable estimates.

RPM extends basic ridge regression by incorporating informative priors rather than shrinking all players toward zero. Players are shrunk toward their expected impact based on box score statistics, age, and other predictive features. A player with elite box score production begins with a positive prior; the plus-minus data then adjusts this based on observed on-court impact.

Offensive and Defensive RPM

Like other comprehensive metrics, RPM separates into offensive and defensive components. ORPM measures impact on team offensive efficiency while DRPM quantifies effect on opponent scoring. Offensive RPM proves more reliable due to the same factors affecting other metrics: offense produces more measurable events.

Defensive RPM improves on purely box-score-based defensive estimates by incorporating actual on-court defensive outcomes. If a player consistently appears in lineups that defend well despite modest individual defensive statistics, DRPM will credit this contribution.

Multi-Year Estimation

RPM incorporates multi-year data to improve estimation stability. The methodology weights recent seasons more heavily while incorporating information from previous years, creating a "three-year RPM" that balances recency with sample size.

Interpreting RPM

RPM values express estimated impact on team point differential per 100 possessions. League average is approximately zero by construction. Historical RPM leaders typically align with consensus great players. LeBron James, Stephen Curry, Kawhi Leonard consistently appear among the top performers.

Implementation in R

# Understanding Real Plus-Minus (RPM)
# Note: True RPM requires play-by-play data and regularized regression
library(tidyverse)

# Simulate RPM-style calculation using on/off data
estimate_rpm <- function(on_off_data, player_priors) {
  on_off_data %>%
    left_join(player_priors, by = "player_id") %>%
    mutate(
      # Raw plus-minus per 100 possessions
      raw_pm_100 = (on_court_pts - on_court_opp_pts) /
                   on_court_poss * 100,

      # Regression to prior (box score prediction)
      regressed_pm = 0.7 * raw_pm_100 + 0.3 * prior_bpm,

      # Separate offensive and defensive
      orpm = regressed_pm * off_share,
      drpm = regressed_pm * (1 - off_share),
      rpm = orpm + drpm
    )
}

on_off <- read_csv("player_on_off.csv")
priors <- read_csv("player_bpm_priors.csv")

rpm_estimates <- estimate_rpm(on_off, priors)

# Top RPM players
top_rpm <- rpm_estimates %>%
  filter(min >= 1500) %>%
  arrange(desc(rpm)) %>%
  select(player_name, rpm, orpm, drpm) %>%
  head(15)

print(top_rpm)

Implementation in Python

# Understanding Real Plus-Minus (RPM)
import pandas as pd
import numpy as np
from sklearn.linear_model import RidgeCV

def estimate_rpm(on_off_data, player_priors):
    """Estimate RPM using on/off data and priors"""
    merged = on_off_data.merge(player_priors, on="player_id")

    # Raw plus-minus per 100 possessions
    merged["raw_pm_100"] = (
        (merged["on_court_pts"] - merged["on_court_opp_pts"]) /
        merged["on_court_poss"] * 100
    )

    # Regression to prior (box score prediction)
    merged["regressed_pm"] = 0.7 * merged["raw_pm_100"] + 0.3 * merged["prior_bpm"]

    # Separate offensive and defensive
    merged["orpm"] = merged["regressed_pm"] * merged["off_share"]
    merged["drpm"] = merged["regressed_pm"] * (1 - merged["off_share"])
    merged["rpm"] = merged["orpm"] + merged["drpm"]

    return merged

# Load data
on_off = pd.read_csv("player_on_off.csv")
priors = pd.read_csv("player_bpm_priors.csv")

rpm_estimates = estimate_rpm(on_off, priors)

# Top RPM players
top_rpm = rpm_estimates[rpm_estimates["min"] >= 1500].nlargest(15, "rpm")[
    ["player_name", "rpm", "orpm", "drpm"]
]
print(top_rpm)

Implementation in R

# Understanding Real Plus-Minus (RPM)
# Note: True RPM requires play-by-play data and regularized regression
library(tidyverse)

# Simulate RPM-style calculation using on/off data
estimate_rpm <- function(on_off_data, player_priors) {
  on_off_data %>%
    left_join(player_priors, by = "player_id") %>%
    mutate(
      # Raw plus-minus per 100 possessions
      raw_pm_100 = (on_court_pts - on_court_opp_pts) /
                   on_court_poss * 100,

      # Regression to prior (box score prediction)
      regressed_pm = 0.7 * raw_pm_100 + 0.3 * prior_bpm,

      # Separate offensive and defensive
      orpm = regressed_pm * off_share,
      drpm = regressed_pm * (1 - off_share),
      rpm = orpm + drpm
    )
}

on_off <- read_csv("player_on_off.csv")
priors <- read_csv("player_bpm_priors.csv")

rpm_estimates <- estimate_rpm(on_off, priors)

# Top RPM players
top_rpm <- rpm_estimates %>%
  filter(min >= 1500) %>%
  arrange(desc(rpm)) %>%
  select(player_name, rpm, orpm, drpm) %>%
  head(15)

print(top_rpm)

Implementation in Python

# Understanding Real Plus-Minus (RPM)
import pandas as pd
import numpy as np
from sklearn.linear_model import RidgeCV

def estimate_rpm(on_off_data, player_priors):
    """Estimate RPM using on/off data and priors"""
    merged = on_off_data.merge(player_priors, on="player_id")

    # Raw plus-minus per 100 possessions
    merged["raw_pm_100"] = (
        (merged["on_court_pts"] - merged["on_court_opp_pts"]) /
        merged["on_court_poss"] * 100
    )

    # Regression to prior (box score prediction)
    merged["regressed_pm"] = 0.7 * merged["raw_pm_100"] + 0.3 * merged["prior_bpm"]

    # Separate offensive and defensive
    merged["orpm"] = merged["regressed_pm"] * merged["off_share"]
    merged["drpm"] = merged["regressed_pm"] * (1 - merged["off_share"])
    merged["rpm"] = merged["orpm"] + merged["drpm"]

    return merged

# Load data
on_off = pd.read_csv("player_on_off.csv")
priors = pd.read_csv("player_bpm_priors.csv")

rpm_estimates = estimate_rpm(on_off, priors)

# Top RPM players
top_rpm = rpm_estimates[rpm_estimates["min"] >= 1500].nlargest(15, "rpm")[
    ["player_name", "rpm", "orpm", "drpm"]
]
print(top_rpm)
Chapter Summary

You've completed Chapter 16: Real Plus-Minus (RPM).