Chapter 3 Beginner ~30 min read 6 sections

Data Wrangling with tidyverse and pandas

Raw data rarely arrives in analysis-ready form. This chapter teaches essential data wrangling skills using the tidyverse in R and pandas in Python. Through practical examples with real NBA data, you will learn to filter, transform, aggregate, and join datasets, developing the foundational skills that enable sophisticated analysis.

The Importance of Data Wrangling

Data wrangling encompasses the processes of cleaning, transforming, and restructuring data to prepare it for analysis. While perhaps less glamorous than building models or creating visualizations, these skills represent the most important practical competency for any data analyst. Studies consistently show that data preparation consumes the majority of time in most analytics projects—often seventy percent or more. Mastering these techniques dramatically increases your productivity and the reliability of your analyses.

Basketball data arrives in many formats and levels of cleanliness. Official statistics come through APIs in structured formats that nonetheless require transformation for many analyses. Web-scraped data often needs significant cleaning to standardize formatting and handle missing values. Historical data may use different conventions or contain errors that accumulated over decades. The ability to wrangle diverse data into consistent, analysis-ready form enables you to work with the full breadth of available information.

Beyond practical necessity, data wrangling skills improve your understanding of the data itself. The process of cleaning and transforming data forces careful examination of its structure, relationships, and limitations. Analysts who skip this step, relying instead on pre-processed datasets, miss opportunities to catch errors and develop intuition about their data. The time invested in careful data preparation pays dividends throughout the analytical process.

The tidyverse Approach in R

The tidyverse in R provides an elegant and consistent grammar for data manipulation. Rather than learning dozens of unrelated functions with different conventions, the tidyverse offers a coherent vocabulary of verbs that compose naturally to express complex operations. This consistency makes code more readable, easier to write, and simpler to maintain.

The pipe operator, written as %>% or |>, forms the syntactic backbone of tidyverse programming. The pipe takes the output of one function and passes it as the first argument to the next function, enabling you to chain operations into readable pipelines. Rather than nesting functions inside each other or creating intermediate variables, you can express a sequence of transformations as a linear flow that reads like a recipe.

The dplyr package provides the core verbs for data manipulation. The filter function selects rows meeting specified conditions—for example, filtering to players who averaged more than twenty points per game. The select function chooses which columns to keep, allowing you to focus on relevant variables. The mutate function creates new columns, such as calculating true shooting percentage from its components. The arrange function sorts rows by specified variables. The summarize function collapses multiple rows into summary statistics.

These five verbs, combined with grouping operations, handle the vast majority of data manipulation tasks. The group_by function partitions data by one or more variables, causing subsequent operations to apply within each group. For example, grouping by team before summarizing calculates team-level statistics from player data. This grouping mechanism provides enormous power with minimal complexity.

pandas in Python

The pandas library provides Python's equivalent toolkit for data manipulation. Its DataFrame object represents tabular data with labeled rows and columns, offering rich functionality for filtering, transformation, and aggregation. While the syntax differs from R, the underlying concepts translate directly, and proficiency in one framework accelerates learning the other.

Boolean indexing forms the primary mechanism for filtering data in pandas. You create a boolean Series indicating which rows meet your criteria, then use it to select matching rows from the DataFrame. For example, selecting players averaging more than twenty points involves comparing the points column to twenty, generating a boolean Series, then indexing with that Series. While more explicit than dplyr's filter function, this approach offers flexibility in constructing complex conditions.

Column selection in pandas uses bracket notation with column names. Selecting a single column returns a Series, while selecting multiple columns returns a DataFrame. Creating new columns follows similar syntax, assigning values to a new column name. The assign method provides an alternative that enables method chaining more naturally.

Aggregation in pandas uses the groupby method followed by an aggregation function. Grouping returns a DataFrameGroupBy object that you then aggregate with methods like mean, sum, or agg for custom aggregations. The agg method accepts dictionaries specifying different aggregations for different columns, enabling sophisticated summary calculations in a single call.

Common Data Wrangling Tasks

Working with real NBA data presents common challenges that illustrate general data wrangling principles. Player names appear in different formats across sources—sometimes including suffixes like Jr. or III, sometimes using middle initials, sometimes using nicknames. Standardizing player names enables joining data from multiple sources. Regular expressions and lookup tables help automate this standardization.

Missing data arises frequently and requires thoughtful handling. Players who miss games appear as missing values in per-game statistics. Statistics not recorded in earlier eras are missing for historical players. Some advanced metrics cannot be calculated for players with insufficient minutes. Understanding why data is missing informs how to handle it—sometimes imputation is appropriate, sometimes exclusion, sometimes keeping missing values explicitly.

Statistical categories have changed over time, creating challenges for historical analysis. Three-point shooting was not introduced until 1979. Blocks and steals were first recorded in 1973. Offensive and defensive rebounds were initially combined. Tracking data only exists since 2013. Analyses spanning eras must account for these changes, either restricting to common measures or explicitly handling missing categories.

Data types require attention, particularly when reading data from files or APIs. Numeric values sometimes arrive as text that must be converted. Dates and times need parsing into appropriate formats. Categorical variables may need explicit type conversion for proper handling in analyses. Catching type issues early prevents confusing errors later in the analytical process.

Joining Multiple Data Sources

Sophisticated analysis often requires combining data from multiple sources. Player statistics from one source might need enrichment with biographical information, salary data, or tracking metrics from other sources. Understanding how to join data correctly enables you to construct the integrated datasets needed for comprehensive analysis.

Both dplyr and pandas provide join functions that combine data based on matching key values. Inner joins keep only rows present in both tables. Left joins keep all rows from the left table, adding data from the right table where matches exist. Full joins keep all rows from both tables. Choosing the appropriate join type depends on your analytical needs and whether you want to preserve or exclude non-matching rows.

Key matching requires careful attention. Joining on player names may fail when names are formatted differently across sources. Joining on multiple keys—such as player name and season—increases precision but requires exact matches on all keys. Creating standardized identifier columns often simplifies joining, though this requires initial effort to establish consistent identifiers across your data sources.

Join quality should be verified after each operation. Check that the number of rows matches expectations. Look for unexpected missing values created by non-matching keys. Examine a sample of joined records to confirm correct alignment. These verification steps catch errors that could otherwise propagate through your analysis.

Reshaping Data

Data comes in wide and long formats, and many analyses require converting between them. Wide format has one row per entity with multiple columns for repeated measurements—for example, one row per player with columns for each season's statistics. Long format has multiple rows per entity with columns identifying the measurement—for example, one row per player-season with a season column and a statistics column. Different analyses and visualizations require different formats.

The tidyr package in R provides pivot_longer and pivot_wider for reshaping. The pivot_longer function converts wide data to long format, specifying which columns to pivot and what to name the new key and value columns. The pivot_wider function does the reverse, specifying which column provides the new column names and which provides the values. These functions handle the common cases elegantly while providing options for more complex transformations.

In pandas, the melt function converts wide to long format, while pivot and pivot_table convert long to wide format. The pivot function requires unique index-column combinations, while pivot_table handles duplicates through aggregation. These functions cover similar territory to tidyr, with syntax reflecting pandas conventions.

Reshaping often forms one step in a larger transformation pipeline. You might need to reshape data, perform calculations, then reshape back to the original format. Planning your data structure before beginning analysis helps you anticipate necessary reshaping operations and organize your code efficiently.

Implementation in R

# Data wrangling with tidyverse
library(tidyverse)

# Load player stats
player_stats <- read_csv("player_stats_2024.csv")

# Filter, select, and transform
guards <- player_stats %>%
  filter(position %in% c("PG", "SG")) %>%
  select(player_name, team, pts, ast, tov, min) %>%
  mutate(
    ast_to_tov = round(ast / pmax(tov, 0.1), 2),
    pts_per_min = round(pts / min, 3)
  ) %>%
  arrange(desc(ast_to_tov))

head(guards, 10)

# Joining multiple data sources
library(tidyverse)

# Combine player stats with salary data
player_stats <- read_csv("player_stats.csv")
salaries <- read_csv("player_salaries.csv")

# Join and calculate value metrics
player_value <- player_stats %>%
  inner_join(salaries, by = "player_id") %>%
  mutate(
    salary_millions = salary / 1e6,
    pts_per_million = round(pts / salary_millions, 1),
    wins_per_million = round(win_shares / salary_millions, 2)
  ) %>%
  select(player_name, team, pts, win_shares, salary_millions,
         pts_per_million, wins_per_million) %>%
  arrange(desc(wins_per_million))

head(player_value, 15)

# Reshaping data: wide to long format
library(tidyverse)

# Monthly scoring data in wide format
monthly_wide <- tibble(
  player = c("Player A", "Player B"),
  oct = c(22.5, 18.3),
  nov = c(24.1, 19.7),
  dec = c(25.8, 21.2),
  jan = c(23.4, 20.5)
)

# Pivot to long format for analysis
monthly_long <- monthly_wide %>%
  pivot_longer(
    cols = oct:jan,
    names_to = "month",
    values_to = "ppg"
  )

print(monthly_long)

Implementation in Python

# Data wrangling with pandas
import pandas as pd

# Load player stats
player_stats = pd.read_csv("player_stats_2024.csv")

# Filter, select, and transform
guards = player_stats[
    player_stats["position"].isin(["PG", "SG"])
][["player_name", "team", "pts", "ast", "tov", "min"]].copy()

guards["ast_to_tov"] = (guards["ast"] / guards["tov"].clip(lower=0.1)).round(2)
guards["pts_per_min"] = (guards["pts"] / guards["min"]).round(3)
guards = guards.sort_values("ast_to_tov", ascending=False)

print(guards.head(10))

# Joining multiple data sources
import pandas as pd

# Combine player stats with salary data
player_stats = pd.read_csv("player_stats.csv")
salaries = pd.read_csv("player_salaries.csv")

# Merge and calculate value metrics
player_value = player_stats.merge(salaries, on="player_id")
player_value["salary_millions"] = player_value["salary"] / 1e6
player_value["pts_per_million"] = (
    player_value["pts"] / player_value["salary_millions"]
).round(1)
player_value["wins_per_million"] = (
    player_value["win_shares"] / player_value["salary_millions"]
).round(2)

result = player_value[[
    "player_name", "team", "pts", "win_shares",
    "salary_millions", "pts_per_million", "wins_per_million"
]].sort_values("wins_per_million", ascending=False)

print(result.head(15))

# Reshaping data: wide to long format
import pandas as pd

# Monthly scoring data in wide format
monthly_wide = pd.DataFrame({
    "player": ["Player A", "Player B"],
    "oct": [22.5, 18.3],
    "nov": [24.1, 19.7],
    "dec": [25.8, 21.2],
    "jan": [23.4, 20.5]
})

# Melt to long format for analysis
monthly_long = monthly_wide.melt(
    id_vars=["player"],
    var_name="month",
    value_name="ppg"
)

print(monthly_long)

Implementation in R

# Data wrangling with tidyverse
library(tidyverse)

# Load player stats
player_stats <- read_csv("player_stats_2024.csv")

# Filter, select, and transform
guards <- player_stats %>%
  filter(position %in% c("PG", "SG")) %>%
  select(player_name, team, pts, ast, tov, min) %>%
  mutate(
    ast_to_tov = round(ast / pmax(tov, 0.1), 2),
    pts_per_min = round(pts / min, 3)
  ) %>%
  arrange(desc(ast_to_tov))

head(guards, 10)

# Joining multiple data sources
library(tidyverse)

# Combine player stats with salary data
player_stats <- read_csv("player_stats.csv")
salaries <- read_csv("player_salaries.csv")

# Join and calculate value metrics
player_value <- player_stats %>%
  inner_join(salaries, by = "player_id") %>%
  mutate(
    salary_millions = salary / 1e6,
    pts_per_million = round(pts / salary_millions, 1),
    wins_per_million = round(win_shares / salary_millions, 2)
  ) %>%
  select(player_name, team, pts, win_shares, salary_millions,
         pts_per_million, wins_per_million) %>%
  arrange(desc(wins_per_million))

head(player_value, 15)

# Reshaping data: wide to long format
library(tidyverse)

# Monthly scoring data in wide format
monthly_wide <- tibble(
  player = c("Player A", "Player B"),
  oct = c(22.5, 18.3),
  nov = c(24.1, 19.7),
  dec = c(25.8, 21.2),
  jan = c(23.4, 20.5)
)

# Pivot to long format for analysis
monthly_long <- monthly_wide %>%
  pivot_longer(
    cols = oct:jan,
    names_to = "month",
    values_to = "ppg"
  )

print(monthly_long)

Implementation in Python

# Data wrangling with pandas
import pandas as pd

# Load player stats
player_stats = pd.read_csv("player_stats_2024.csv")

# Filter, select, and transform
guards = player_stats[
    player_stats["position"].isin(["PG", "SG"])
][["player_name", "team", "pts", "ast", "tov", "min"]].copy()

guards["ast_to_tov"] = (guards["ast"] / guards["tov"].clip(lower=0.1)).round(2)
guards["pts_per_min"] = (guards["pts"] / guards["min"]).round(3)
guards = guards.sort_values("ast_to_tov", ascending=False)

print(guards.head(10))

# Joining multiple data sources
import pandas as pd

# Combine player stats with salary data
player_stats = pd.read_csv("player_stats.csv")
salaries = pd.read_csv("player_salaries.csv")

# Merge and calculate value metrics
player_value = player_stats.merge(salaries, on="player_id")
player_value["salary_millions"] = player_value["salary"] / 1e6
player_value["pts_per_million"] = (
    player_value["pts"] / player_value["salary_millions"]
).round(1)
player_value["wins_per_million"] = (
    player_value["win_shares"] / player_value["salary_millions"]
).round(2)

result = player_value[[
    "player_name", "team", "pts", "win_shares",
    "salary_millions", "pts_per_million", "wins_per_million"
]].sort_values("wins_per_million", ascending=False)

print(result.head(15))

# Reshaping data: wide to long format
import pandas as pd

# Monthly scoring data in wide format
monthly_wide = pd.DataFrame({
    "player": ["Player A", "Player B"],
    "oct": [22.5, 18.3],
    "nov": [24.1, 19.7],
    "dec": [25.8, 21.2],
    "jan": [23.4, 20.5]
})

# Melt to long format for analysis
monthly_long = monthly_wide.melt(
    id_vars=["player"],
    var_name="month",
    value_name="ppg"
)

print(monthly_long)

Chapter Summary

You've completed Chapter 3: Data Wrangling with tidyverse and pandas.

Practice Exercises View Glossary Continue to Chapter 4