Chapter 2 Beginner ~25 min read 5 sections

NBA Data Sources and APIs

Data lies at the heart of any analytics project, and basketball offers an extraordinarily rich ecosystem of data sources. This chapter provides a comprehensive survey of publicly available NBA data, from traditional box scores to advanced tracking metrics. You will learn to access data programmatically through APIs and web scraping.

The NBA Data Ecosystem

The modern basketball analyst has access to more data than at any point in the sport's history. From the founding of the Basketball Association of America in 1946 through the present day, historical statistics document every game played at the professional level. This wealth of information enables analyses spanning decades, though understanding the context and limitations of data from different eras requires careful attention to how statistical recording has changed over time.

Today's data ecosystem encompasses multiple interconnected sources, each offering distinct strengths and serving different analytical purposes. The official NBA statistics portal provides comprehensive contemporary data, including the tracking metrics derived from camera systems in every arena. Third-party aggregators like Basketball-Reference offer convenient access to historical data with calculated advanced metrics. Specialized providers focus on particular aspects of the game, from shot charts to salary information. Understanding this ecosystem allows you to select appropriate sources for your analytical questions.

The distinction between official and third-party data matters more than you might initially expect. Official NBA data represents the authoritative source for contemporary statistics, but the league's historical archives contain gaps and inconsistencies. Third-party providers have often invested considerable effort in cleaning historical data and standardizing formats across eras. However, their calculated metrics may differ subtly from official versions due to methodological choices. Knowing the provenance of your data helps you interpret results appropriately.

NBA Official Statistics

The NBA official statistics portal at stats.nba.com provides the most comprehensive source of contemporary basketball data. The website offers a user-friendly interface for exploring statistics, but its real power lies in the underlying API that powers the site. This API returns structured data in JSON format, allowing programmatic access to vast amounts of information that would be tedious to collect manually.

The breadth of data available through the official API is remarkable. Traditional box score statistics cover every game back to the 1996-97 season in comprehensive detail. Player biographical information includes height, weight, draft position, and career history. League-wide leaderboards rank players across dozens of statistical categories. Detailed logs track individual game performances throughout each season.

Since the 2013-14 season, the official statistics have included tracking data derived from the camera systems installed in NBA arenas. These metrics quantify aspects of player performance invisible in traditional statistics. Speed and distance statistics measure how players move. Shooting data includes defender distance and shot quality indicators. Passing metrics track touches, assists, and ball movement. Rebounding statistics identify contested versus uncontested boards. This tracking data has enabled entirely new categories of analysis.

Accessing the official API directly requires understanding its structure and handling authentication. Fortunately, community-developed packages abstract away these complications. In Python, the nba_api package provides clean functions for every API endpoint, handling the request formatting and response parsing automatically. In R, the hoopR package offers similar functionality. These packages make it straightforward to retrieve the data you need with minimal code.

Basketball-Reference

Basketball-Reference.com has established itself as the definitive historical database for professional basketball statistics. The site aggregates data from multiple sources, including official NBA records, newspaper archives, and historical research, presenting it in standardized formats that facilitate comparison across eras. For historical analysis, Basketball-Reference often provides the most convenient access to comprehensive data.

Beyond providing raw statistics, Basketball-Reference calculates and publishes a suite of advanced metrics. Their Box Plus-Minus (BPM) and Value Over Replacement Player (VORP) calculations have become industry standards, widely cited in basketball analysis. Win Shares quantifies individual contributions to team wins. These calculated metrics save analysts considerable effort while providing consistent methodology across players and seasons.

The site's data can be accessed through web scraping or through the basketball_reference_scraper package in Python. While scraping requires respecting rate limits and the site's terms of service, it provides access to data not easily available through APIs. The Play Index tool on Basketball-Reference enables sophisticated queries across decades of data, though access requires a subscription.

Specialized Data Sources

Several specialized providers focus on particular aspects of basketball analytics, often providing data or analysis not available elsewhere. Understanding these sources expands your analytical toolkit while helping you select appropriate data for specific questions.

Cleaning the Glass offers analytics focusing on shot quality and luck adjustment. Their four factors calculations separate free throw luck from skill, providing cleaner measurements of team performance. Lineup data quantifies how different player combinations perform together. For subscribers, the site provides valuable data particularly useful for team and lineup analysis.

Synergy Sports provides detailed play-type classification based on video analysis. Every possession is categorized by the action that generated the shot—pick and roll, isolation, spot-up, transition, and so forth. This granular information reveals how players and teams score, enabling more sophisticated strategic analysis than aggregate statistics allow. Many NBA teams subscribe to Synergy for their detailed breakdowns.

Second Spectrum, as the official tracking data provider for the NBA, possesses the most comprehensive spatial data available. While much of their data remains proprietary, they publish some metrics through the NBA statistics portal and license additional data to teams and researchers. Their player tracking data forms the foundation for many modern analytics applications.

Draft data from sources like Tank Athon and NBADraft.net provides historical draft information, combine measurements, and prospect evaluations. Contract data from Spotrac details salary information essential for understanding team construction within cap constraints. Injury reports from various sources enable analysis of player availability and workload management.

Accessing Data Programmatically

While websites provide convenient interfaces for exploration, serious analytical work requires programmatic data access. Writing code to retrieve data ensures reproducibility, enables automation, and scales to the volumes required for sophisticated analysis. Both R and Python offer excellent tools for data acquisition.

In Python, the nba_api package provides comprehensive access to official NBA statistics. The package organizes endpoints into logical modules—players, teams, games, and so forth—with consistent interfaces for making requests. For example, retrieving the current season's player statistics requires just a few lines of code, returning a pandas DataFrame ready for analysis.

The hoopR package provides similar functionality in R, with additional access to ESPN, NBA, and college basketball data sources. The package returns tibbles compatible with tidyverse workflows, enabling immediate analysis with familiar tools. Play-by-play data, shot charts, and player tracking statistics are all accessible through straightforward function calls.

For data not available through packages, web scraping provides an alternative approach. The rvest package in R and BeautifulSoup in Python enable extracting data from web pages. While more complex than using dedicated packages, scraping extends your reach to any publicly available data. However, always respect rate limits and terms of service when scraping websites.

Data Quality and Limitations

No data source is perfect, and understanding limitations helps you avoid analytical errors. Historical statistics face particular challenges due to changes in recording practices over time. The three-point line did not exist until 1979. Blocks and steals were not officially recorded until 1973. Offensive rebounds were not separated from defensive rebounds in early records. These gaps limit the analyses possible with historical data.

Even contemporary data contains quality issues. Official statistics rely on human scorekeepers making real-time judgments about assists, blocks, and other events. Studies have documented meaningful variation across arenas in how certain events are scored. Tracking data, while more objective, can suffer from camera occlusions and algorithmic errors in player identification.

Sample size limitations affect individual player statistics significantly. A player might shoot exceptionally well or poorly over a few games due to random variation rather than true skill differences. Understanding statistical reliability helps you avoid overinterpreting small samples. The exercises in later chapters will help you develop intuition for when sample sizes support reliable conclusions.

Finally, remember that all statistics measure what happened, not necessarily player quality or contribution. A player may post strong statistics on a bad team, or modest statistics while contributing significantly to a good team. Context matters enormously in basketball analysis, and statistics alone rarely tell the complete story.

Implementation in R

# Accessing NBA Stats API with httr
library(httr)
library(jsonlite)

# Function to query NBA Stats API
get_nba_stats <- function(endpoint, params = list()) {
  base_url <- "https://stats.nba.com/stats/"

  headers <- c(
    "User-Agent" = "Mozilla/5.0",
    "Referer" = "https://www.nba.com/",
    "Accept" = "application/json"
  )

  response <- GET(
    paste0(base_url, endpoint),
    add_headers(.headers = headers),
    query = params
  )

  content(response, as = "parsed")
}

# Get league leaders
leaders <- get_nba_stats("leagueleaders", list(
  Season = "2023-24",
  SeasonType = "Regular Season",
  PerMode = "PerGame"
))
# Web scraping Basketball Reference
library(rvest)

# Scrape player season stats
scrape_bbref_stats <- function(season = 2024) {
  url <- paste0(
    "https://www.basketball-reference.com/leagues/NBA_",
    season, "_per_game.html"
  )

  page <- read_html(url)

  table <- page %>%
    html_element("#per_game_stats") %>%
    html_table()

  # Clean data
  table <- table %>%
    filter(Player != "Player") %>%
    mutate(across(c(G, GS, MP, PTS, TRB, AST), as.numeric))

  return(table)
}

stats_2024 <- scrape_bbref_stats(2024)
head(stats_2024)

Implementation in Python

# Accessing NBA Stats API
import requests
import pandas as pd

def get_nba_stats(endpoint, params=None):
    """Query NBA Stats API"""
    base_url = "https://stats.nba.com/stats/"

    headers = {
        "User-Agent": "Mozilla/5.0",
        "Referer": "https://www.nba.com/",
        "Accept": "application/json"
    }

    response = requests.get(
        f"{base_url}{endpoint}",
        headers=headers,
        params=params
    )

    data = response.json()

    # Convert to DataFrame
    headers = data["resultSet"]["headers"]
    rows = data["resultSet"]["rowSet"]
    return pd.DataFrame(rows, columns=headers)

# Get league leaders
leaders = get_nba_stats("leagueleaders", {
    "Season": "2023-24",
    "SeasonType": "Regular Season",
    "PerMode": "PerGame"
})
# Web scraping Basketball Reference
from bs4 import BeautifulSoup
import requests
import pandas as pd

def scrape_bbref_stats(season=2024):
    """Scrape player stats from Basketball Reference"""
    url = f"https://www.basketball-reference.com/leagues/NBA_{season}_per_game.html"

    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    table = soup.find("table", {"id": "per_game_stats"})
    df = pd.read_html(str(table))[0]

    # Clean data
    df = df[df["Player"] != "Player"]
    numeric_cols = ["G", "GS", "MP", "PTS", "TRB", "AST"]
    df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors="coerce")

    return df

stats_2024 = scrape_bbref_stats(2024)
print(stats_2024.head())

Implementation in R

# Accessing NBA Stats API with httr
library(httr)
library(jsonlite)

# Function to query NBA Stats API
get_nba_stats <- function(endpoint, params = list()) {
  base_url <- "https://stats.nba.com/stats/"

  headers <- c(
    "User-Agent" = "Mozilla/5.0",
    "Referer" = "https://www.nba.com/",
    "Accept" = "application/json"
  )

  response <- GET(
    paste0(base_url, endpoint),
    add_headers(.headers = headers),
    query = params
  )

  content(response, as = "parsed")
}

# Get league leaders
leaders <- get_nba_stats("leagueleaders", list(
  Season = "2023-24",
  SeasonType = "Regular Season",
  PerMode = "PerGame"
))
# Web scraping Basketball Reference
library(rvest)

# Scrape player season stats
scrape_bbref_stats <- function(season = 2024) {
  url <- paste0(
    "https://www.basketball-reference.com/leagues/NBA_",
    season, "_per_game.html"
  )

  page <- read_html(url)

  table <- page %>%
    html_element("#per_game_stats") %>%
    html_table()

  # Clean data
  table <- table %>%
    filter(Player != "Player") %>%
    mutate(across(c(G, GS, MP, PTS, TRB, AST), as.numeric))

  return(table)
}

stats_2024 <- scrape_bbref_stats(2024)
head(stats_2024)

Implementation in Python

# Accessing NBA Stats API
import requests
import pandas as pd

def get_nba_stats(endpoint, params=None):
    """Query NBA Stats API"""
    base_url = "https://stats.nba.com/stats/"

    headers = {
        "User-Agent": "Mozilla/5.0",
        "Referer": "https://www.nba.com/",
        "Accept": "application/json"
    }

    response = requests.get(
        f"{base_url}{endpoint}",
        headers=headers,
        params=params
    )

    data = response.json()

    # Convert to DataFrame
    headers = data["resultSet"]["headers"]
    rows = data["resultSet"]["rowSet"]
    return pd.DataFrame(rows, columns=headers)

# Get league leaders
leaders = get_nba_stats("leagueleaders", {
    "Season": "2023-24",
    "SeasonType": "Regular Season",
    "PerMode": "PerGame"
})
# Web scraping Basketball Reference
from bs4 import BeautifulSoup
import requests
import pandas as pd

def scrape_bbref_stats(season=2024):
    """Scrape player stats from Basketball Reference"""
    url = f"https://www.basketball-reference.com/leagues/NBA_{season}_per_game.html"

    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    table = soup.find("table", {"id": "per_game_stats"})
    df = pd.read_html(str(table))[0]

    # Clean data
    df = df[df["Player"] != "Player"]
    numeric_cols = ["G", "GS", "MP", "PTS", "TRB", "AST"]
    df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors="coerce")

    return df

stats_2024 = scrape_bbref_stats(2024)
print(stats_2024.head())
Chapter Summary

You've completed Chapter 2: NBA Data Sources and APIs.