Chapter 1 Beginner ~20 min read 5 sections

Setting Up Your Analytics Environment

Before diving into basketball analytics, you need a solid foundation in the tools of the trade. This chapter guides you through setting up a complete analytics environment, including R and RStudio or Python and Jupyter notebooks. You will learn to install essential packages for data manipulation, visualization, and statistical modeling.

Choosing Your Tools

Every craftsperson needs quality tools, and data analysis is no exception. Before you can explore the rich world of basketball analytics, you must first establish a properly configured computing environment. This chapter walks you through the process of setting up a professional-grade analytics workspace that will serve you throughout this textbook and throughout your career in data analysis.

The choice between R and Python generates passionate debate in the data science community, but the reality is that both languages are excellent choices for basketball analytics. Each has distinct strengths, and many professional analysts maintain proficiency in both. This textbook provides complete examples in both languages, allowing you to work in whichever environment best suits your background and goals, or to learn both as you progress through the chapters.

R emerged from the statistics community and reflects that heritage in its design. The language offers particularly elegant syntax for data manipulation and visualization, with the tidyverse collection of packages providing a cohesive and expressive toolkit for data analysis. R excels at exploratory data analysis, statistical modeling, and producing publication-quality graphics. Many academic researchers prefer R, and the sports analytics community has produced excellent R packages specifically for basketball data.

Python provides a more general-purpose programming environment with broad applications beyond data analysis. Its syntax emphasizes readability and simplicity, making it accessible to programmers coming from other languages. Python dominates machine learning applications thanks to libraries like scikit-learn, TensorFlow, and PyTorch. Many NBA analytics departments use Python for production systems that need to integrate with other software infrastructure.

Installing R and RStudio

For R users, the recommended setup begins with installing R itself from the Comprehensive R Archive Network, universally known as CRAN. Navigate to cran.r-project.org and download the installer appropriate for your operating system. The installation process is straightforward—simply run the installer and accept the default options. You now have a functional R installation, though you will rarely interact with base R directly.

RStudio provides the integrated development environment that makes working with R pleasant and productive. Download RStudio Desktop from posit.co (formerly RStudio, PBC) and install it using the standard process for your operating system. When you launch RStudio, it will automatically detect your R installation and configure itself appropriately.

The RStudio interface divides the screen into four panes, each serving a distinct purpose. The source pane in the upper left displays your code files, allowing you to write and edit scripts. The console in the lower left provides direct interaction with R, showing output from your commands and allowing you to execute code interactively. The environment pane in the upper right shows the objects currently in memory, while the files/plots/packages/help pane in the lower right provides access to these various utilities.

Understanding this layout accelerates your productivity. You will typically write code in the source pane, execute it to see results in the console, monitor your data objects in the environment pane, and view visualizations in the plots tab. RStudio provides keyboard shortcuts for common operations—learning these will significantly speed your workflow. The most essential shortcut is Ctrl+Enter (Cmd+Enter on Mac), which executes the current line or selection.

Essential R Packages

Base R provides fundamental capabilities, but the real power comes from packages that extend its functionality. The tidyverse collection represents the most important set of packages for modern R data analysis. Install it by running the command install.packages("tidyverse") in your R console. This single command installs a coordinated set of packages designed to work together seamlessly.

The tidyverse includes several packages you will use constantly. The dplyr package provides functions for data manipulation—filtering rows, selecting columns, creating new variables, and summarizing data. The ggplot2 package implements the grammar of graphics for creating sophisticated visualizations. The readr package offers fast and friendly functions for reading data files. The tidyr package helps reshape data between wide and long formats. Together, these packages provide a complete toolkit for data analysis.

Beyond the tidyverse, you will need packages specifically designed for accessing basketball data. The hoopR package provides programmatic access to NBA statistics, play-by-play data, and player information. Install it with install.packages("hoopR"). This package handles the complexity of API requests and data formatting, allowing you to focus on analysis rather than data acquisition.

Additional packages extend R's capabilities in useful directions. The lubridate package simplifies working with dates and times, common in sports data. The scales package provides formatting functions for visualizations. The patchwork package enables combining multiple plots into composite figures. As you develop your analytical practice, you will discover additional packages suited to your specific needs.

Installing Python and Jupyter

Python users should install the Anaconda distribution, which bundles Python with the most commonly used data science libraries. Download Anaconda from anaconda.com and run the installer for your operating system. The installation process takes several minutes as it sets up Python and dozens of pre-configured packages. Accept the default options unless you have specific reasons to customize the installation.

Anaconda includes Jupyter Notebook, an interactive computing environment particularly well-suited to data analysis and exploration. Launch Jupyter Notebook from the Anaconda Navigator or by typing "jupyter notebook" in your terminal. This opens a browser-based interface where you can create and edit notebooks combining code, visualizations, and narrative text.

A Jupyter notebook consists of cells that can contain either code or markdown text. You execute code cells by pressing Shift+Enter, which runs the code and displays any output directly below the cell. This immediate feedback makes notebooks ideal for exploratory analysis, allowing you to develop your understanding incrementally. Markdown cells let you document your analysis with formatted text, creating reproducible research documents.

JupyterLab provides a more feature-rich alternative to the classic notebook interface. It offers a flexible layout with multiple panes, better file management, and additional capabilities. Many analysts prefer JupyterLab for its more IDE-like experience while retaining the benefits of notebook-based computing. Launch it with "jupyter lab" from your terminal.

Essential Python Libraries

The Anaconda distribution includes the core libraries for data analysis. NumPy provides the foundation for numerical computing in Python, implementing efficient array operations and mathematical functions. Pandas builds on NumPy to offer DataFrame structures and tools for data manipulation—this package is central to virtually all Python data analysis work. Matplotlib provides comprehensive visualization capabilities, while Seaborn extends it with statistical graphics and attractive default styles.

For basketball data specifically, install the nba_api package using pip: run "pip install nba_api" from your terminal or command prompt. This package wraps the official NBA statistics API, providing clean Python functions for accessing a wealth of basketball data. The package handles authentication, rate limiting, and data formatting, making it straightforward to retrieve the data you need.

Additional packages extend Python's analytics capabilities. Scikit-learn provides machine learning algorithms for classification, regression, and clustering. Plotly enables interactive visualizations that can be explored dynamically. Statsmodels implements statistical models and tests for more rigorous analysis. Install these as needed with pip to expand your toolkit.

Managing package dependencies can become complex as your environment grows. Anaconda includes conda, a package manager that handles dependencies automatically and allows you to create isolated environments for different projects. While you may not need this capability initially, understanding environment management becomes important as you work on more sophisticated projects.

Configuring Your Workspace

Beyond software installation, a well-organized workspace improves your productivity and reduces errors. Create a dedicated directory for your basketball analytics projects, with subdirectories for data, scripts, and output. Consistent organization makes it easy to find files and helps you maintain reproducibility across projects.

Version control with Git provides essential capabilities for managing code and collaborating with others. Install Git from git-scm.com and create an account on GitHub. RStudio and many Python IDEs integrate with Git, making it easy to track changes to your code and share your work. Even for personal projects, version control protects against accidental loss and helps you understand how your analysis evolved.

Consider establishing consistent coding style early in your analytics journey. For R, the tidyverse style guide (style.tidyverse.org) provides sensible conventions. For Python, PEP 8 defines the standard style. Consistent style makes your code more readable and maintainable, especially important if you will collaborate with others or return to projects after time away.

Finally, cultivate habits that support reproducible analysis. Write code that can be run from start to finish without manual intervention. Document your data sources and any transformations you apply. Use relative rather than absolute file paths so your projects work on different computers. These practices pay dividends as your analyses grow in complexity and importance.

Implementation in R

# Setting up your R environment for NBA analytics
# Install required packages
install.packages(c(
  "tidyverse",    # Data manipulation and visualization
  "httr",         # API requests
  "jsonlite",     # JSON parsing
  "rvest",        # Web scraping
  "nbastatR",     # NBA Stats API wrapper
  "hoopR"         # Modern NBA data package
))

# Load core libraries
library(tidyverse)
library(nbastatR)
library(hoopR)

# Verify installation
packageVersion("tidyverse")
packageVersion("hoopR")
# Basic NBA data retrieval with hoopR
library(hoopR)

# Get current season player stats
player_stats <- nba_leaguedashplayerstats(season = "2023-24")

# View top scorers
top_scorers <- player_stats %>%
  arrange(desc(PTS)) %>%
  select(PLAYER_NAME, TEAM_ABBREVIATION, GP, PTS, AST, REB) %>%
  head(10)

print(top_scorers)

Implementation in Python

# Setting up your Python environment for NBA analytics
# Install required packages (run in terminal)
# pip install pandas numpy matplotlib seaborn
# pip install nba_api requests beautifulsoup4
# pip install scikit-learn scipy

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Verify installations
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")
# Basic NBA data retrieval with nba_api
from nba_api.stats.endpoints import leaguedashplayerstats
from nba_api.stats.static import players

# Get current season player stats
player_stats = leaguedashplayerstats.LeagueDashPlayerStats(
    season="2023-24",
    per_mode_detailed="PerGame"
)

df = player_stats.get_data_frames()[0]

# View top scorers
top_scorers = df.nlargest(10, "PTS")[
    ["PLAYER_NAME", "TEAM_ABBREVIATION", "GP", "PTS", "AST", "REB"]
]
print(top_scorers)

Implementation in R

# Setting up your R environment for NBA analytics
# Install required packages
install.packages(c(
  "tidyverse",    # Data manipulation and visualization
  "httr",         # API requests
  "jsonlite",     # JSON parsing
  "rvest",        # Web scraping
  "nbastatR",     # NBA Stats API wrapper
  "hoopR"         # Modern NBA data package
))

# Load core libraries
library(tidyverse)
library(nbastatR)
library(hoopR)

# Verify installation
packageVersion("tidyverse")
packageVersion("hoopR")
# Basic NBA data retrieval with hoopR
library(hoopR)

# Get current season player stats
player_stats <- nba_leaguedashplayerstats(season = "2023-24")

# View top scorers
top_scorers <- player_stats %>%
  arrange(desc(PTS)) %>%
  select(PLAYER_NAME, TEAM_ABBREVIATION, GP, PTS, AST, REB) %>%
  head(10)

print(top_scorers)

Implementation in Python

# Setting up your Python environment for NBA analytics
# Install required packages (run in terminal)
# pip install pandas numpy matplotlib seaborn
# pip install nba_api requests beautifulsoup4
# pip install scikit-learn scipy

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Verify installations
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")
# Basic NBA data retrieval with nba_api
from nba_api.stats.endpoints import leaguedashplayerstats
from nba_api.stats.static import players

# Get current season player stats
player_stats = leaguedashplayerstats.LeagueDashPlayerStats(
    season="2023-24",
    per_mode_detailed="PerGame"
)

df = player_stats.get_data_frames()[0]

# View top scorers
top_scorers = df.nlargest(10, "PTS")[
    ["PLAYER_NAME", "TEAM_ABBREVIATION", "GP", "PTS", "AST", "REB"]
]
print(top_scorers)
Chapter Summary

You've completed Chapter 1: Setting Up Your Analytics Environment.