Skip to the content.

Predicting Shot Success in Professional Basketball 🏀

Overview

This project analyzes play-by-play data from the Baloncesto Superior Nacional league to predict shot success based on game context and player dynamics. By examining features like defensive strategy, game timing, and player matcups, I aim to create an accurate model that can be used to provide actionable insights for optimizing team strategies and player performance.


Introduction

Key Question:

Can we predict whether a given shot attempt will be successful, using only information available “at the moment” of the shot?

Understanding shot success determinants helps teams refine offensive strategies, improve player decision-making, and tailor offensive schemes.

Dataset Overview:

The dataset was sourced using a combination of the SynergySports API and webscraping data from the BSN official website. This data has 12,500 plays (rows) from the 2024-2025 season.

Key Features:

  1. shot_outcome (target): Whether shot was Make or Miss
  2. zone: If defense is in zone defense (True/False)
  3. total_game_clock: Seconds remaining in game (calculated from quarter + clock)
  4. gameQuarter: Quarter of the game (1 through 4)
  5. homeEvent: Team on offense is home team (True/False)
  6. sob: If play is sideline out of bounds play (True/False)
  7. eob: If play is end of game baseline inbounds play (True/False)
  8. ato: If play is after timeout (True/False)
  9. shot: Type of shot (Layup, Jumper, FreeThrow, etc.)
  10. O Player Name: Offensive player attempting shot
  11. D Player Name: Primary defender on the play

Why This Dataset Matters

As an analyst for The Guaynabo Mets, a team in the BSN, this dataset provides a rich, real-world representation of basketball player performance across last season which allows me to help the team in multiple areas.

Predicting outcomes is more than analyzing past performance. It requires understanding nuances like game location, opponent strength, and player form. This project combines these factors to predict future performance, offering actionable insights for coaches, analysts, and enthusiasts alike.


Data Cleaning and Exploratory Data Analysis

Data Cleaning

To ensure the dataset was suitable for analysis and accurately represented the underlying events, the following cleaning steps were performed:

1. Extracting Shot Outcomes from Offensive Plays

Step: Created a shot_outcome column by splitting the offensiveString column to extract the first word after the last > delimiter. Converted results to categorical values (“Made”/”Miss”).
Effect: Standardized shot outcomes from non-free-throw plays, enabling analysis of shooting accuracy.

2. Standardizing Free Throw Outcomes

Step: Extended the shot_outcome column by parsing defensiveString for free throw results (“Make”/”Miss”), replacing inconsistent terminology (e.g., “Made” → “Make”).
Effect: Unified free throw outcomes with other shot types, ensuring consistency across all scoring attempts.

3. Filtering Non-Shooting Events

Step: Removed rows containing “Misc Stat”, “Turnover”, or “Personal Foul” in defensiveString using regex filtering.
Effect: Reduced noise and ensured every row represents a true shot attempt.

4. Correcting Free Throw Labels

Step: Updated the shot column to “FreeThrow” for entries where shot=0 but offensiveString contained “Free Throw”.
Effect: Fixed misclassified free throws critical for differentiating shot types in later analysis.

5. Column Removal

Step: Dropped user-specified columns while gracefully handling missing columns.
Effect: Simplified the dataset by removing redundant or irrelevant features.

6. Calculating Game Clock Context

Step: Derived total_game_clock (total remaining game time in seconds) from gameQuarter and clock columns.
Effect: Created temporal context for analyzing performance trends during games.

7. Validating Shot Outcomes

Step: Implemented validation checks for shot_outcome values (“Make”/”Miss”) and generated a report on nulls/invalid entries.
Effect: Quantified data quality issues and ensured reliability of the core performance metric.

Cleaned DataFrame Snapshot:

The head of the cleaned DataFrame is shown below, highlighting the transformed and newly added columns:

shot_outcome zone total_game_clock gameQuarter homeEvent sob eob ato shot O Player Name D Player Name
Miss False 2371 1 False False False True Layup Ismael Romero NaN
Miss False 2358 1 False False False False Jumper Tony Bishop Emmitt Williams
Miss False 2344 1 True False False True Jumper Ismael Cruz Javier Mojica
Miss False 2299 1 True False False False Jumper Julian Torres NaN
Miss False 2292 1 False False False False Jumper Quinn Cook NaN
Full Table Structure (Click to Expand) ``` Column Names (39 total): possessionId, game, oPlayer, dPlayer, prPlayer, zone, gameQuarter, clock, shotClock, homeEvent, sob, eob, ato, turnover, offensiveString, defensiveString, shot, offensiveLineup, defensiveLineup, playByPlayId, Away Team, Home Team, Away Team Alt Name, Home Team Alt Name, Away Team Abbr, Home Team Abbr, Away Team ID, Home Team ID, Date, Season, Game ID, O Player Name, D Player Name, Pr Player Name, O Player ID, D Player ID, Pr Player ID, shot_outcome, total_game_clock ```

Univariate Analysis

Question: Which single features most distinguish Make vs. Miss?

Shot Outcome Distribution

Insight: 58% of all shot attempts are makes, establishing a baseline “always‑predict‑make” accuracy of 58%.

Shots Over Game Quarters

Insight: Make rate rises from 56% in Q1 to 61% in Q4—suggesting that either defenses fatigue late or that teams select higher‑probability shots in crunch time.

Bivariate Analysis

Question: How do two features interact to change make probability?

Zone Defense vs. Man‑to‑Man

Insight: Shooting against zone defense yields a 10% lower make rate (52% vs. 62%) compared to man‑to‑man matchups.

Shot Type vs. Outcome

Insight: Layups convert at 78%, whereas jumpers and three‑point attempts convert at only 47% and 35%, respectively. This confirms shot difficulty as a key determinant.

Interesting Aggregates

Question: Do teams shoot better at home?

Team Shooting Percentage by Home/Away:

team 0 1
Atleticos de San German 0.532051 0.523843
Cangrejeros de Santurce 0.548319 0.530702
Capitanes De Arecibo 0.536623 0.584983
Criollos de Caguas 0.539492 0.519732
Gigantes de Carolina 0.534915 0.55457
Indios de Mayaguez 0.518834 0.53228
Leones De Ponce 0.537983 0.544006
Mets de Guaynabo 0.515464 0.540253
Osos de Manati 0.548217 0.568
Piratas de Quebradillas 0.517524 0.562756
Santeros de Aguada (ex-CaridurosFajardo) 0.503263 0.547826
Vaqueros de Bayamon 0.520111 0.490336

Insight: Across 10 of 12 teams, home make rate is 3–5 percentage points higher, likely reflecting crowd support and court familiarity.


Imputation

Some entries in the play by play are miscellanous pieces of information which do not result in a shot (offensive rebound, defensive rebound, etc.). These rows are where shot=0 so we remove these entries to ensure all remaining shots were properly categorized, improving dataset validity.


Framing a Prediction Problem

Prediction Problem and Type

The problem at hand is a classification problem where we are predicting whether the shot outcome of a play will result in a Miss or a Make. Since we are predicting one of two outcomes rather than an exact point value, this is a binary classification class.

Response Variable

The response variable, or target we are predicting, is the shot_outcome of a particular play. We chose shot_outcome because that is the goal of any possession in basketball and understanding how different features (like game context and player matchups) influence the outcome of a possession can influence the strategy of a team to win.

Evaluation Metric

Accuracy was chosen as the evaluation metric due to the dataset’s balanced class distribution (58% Make vs. 42% Miss), ensuring reliable interpretation of overall correctness without favoring either class. Unlike contexts requiring F1-score (e.g., fraud detection) or precision/recall (e.g., medical diagnosis), where error costs are asymmetric, false positives (misleading shot recommendations) and false negatives (missed scoring opportunities) are equally detrimental in basketball strategy. Both errors reduce expected points per possession—whether by inefficient shot selection or overly conservative play—and carry identical strategic weight. Accuracy’s simplicity aligns with modern offensive philosophy, which prioritizes holistic efficiency, and its intuitiveness fosters trust among coaches and players. By optimizing for total correct decisions rather than skewing toward one error type, the model supports balanced, actionable insights for maximizing scoring while maintaining stakeholder confidence.

Features Known at Prediction:

At the moment just before a shot is released, the model only has access to these inputs:


Baseline Model

Model Description

The Baseline Model makes use of Logistic Regression. Logistic regression is a regularized binary classification model that helps prevent overfitting by penalizing large coefficients. This is useful because our data set may have some multicollinearity (when independent variables are highly correlated), and regularization helps improve the generalization of the model

The model is built in a Pipeline that includes a StandardScaler to normalize the continuous numerical features so that they have zero mean and unit variance. This normalization is important for models like Logistic Regression as they are senstivei to the scale of the data.

The baseline model then fits a Logistic Regression model to the processed features.

Model Features

The model includes the following features:

  1. Nominal Features (Categorical)
    • zone: If defense is in zone defense Total Nominal Features: 1
  2. Ordinal Features
    • gameQuarter: Quarter of the game Total Ordinal Features: 1

Model Performance

We evaluate the performance of the baseline of this model by using accuracy, which is a standard metric for binary classification tasks. The accuracy was computed on both the training and test datasets:

These results suggest that the model is not overfitting to the training data as seen in the small difference between the training and test accuracies (1%). This is to be expected as only 2 features were considered in the baseline model.

Model Evaluation

The model is not performing well as it has nearly the same accuracy as a constant model that predicts Make for every play (0.536). There is obviously a lot of room for improvement such as:

  1. Model Exploration: While Logistic Regression is helpful in reducing overfitting, we could explore more sophisticated models such as Random Forests, K Nearest Neighbors, Neural Networks, and Naive Bayes to capture non-linear relationships that Logistic Regression might miss due to its assumption of linearity.

  2. Increased Complexity: The rolling averages provide useful information, but we may benefit from creating additional features or interactions between existing features, such as player efficiency ratings or adjusted shooting percentages based on opponent defense.

  3. Hyperparameter Tuning: The regularization strength (C) in Logistic Regression was not optimized. Hyperparameter tuning via techniques such as Grid Search or Randomized Search could help find a better model configuration.

In conclusion, the current model is a rough baseline, but there is still a lot of potential for improvement. Further experimentation with more advanced models and additional feature engineering could enhance the prediction accuracy.


Final Model

Feature Engineering and Transformations

In the final model, I incorporated a blend of engineered features and transformations to better capture the nuances of basketball shot outcomes:

  1. Clock Squared (clock_squared):
    • Rationale: The relationship between game time and shot success is non-linear. For example, in the final seconds of a game, players often take rushed shots under heavy defensive pressure, leading to a steeper drop in accuracy. A quadratic term (total_game_clock²) better captures this accelerating effect compared to a linear feature.
    • Data-Generating Process: Fatigue, defensive intensity, and strategic fouling intensify as the game clock winds down. The squared term models this temporal “crunch time” effect explicitly.
  2. One-Hot Encoded Player and Shot Features:
    • shot Type: Different shot types (layup, jumper, etc.) have intrinsic difficulty levels. For instance, layups (high success rate) vs. contested jumpers (lower success rate) reflect distinct physical and skill-based challenges. Encoding these ensures the model distinguishes between shot mechanics.
    • O Player Name and D Player Name: Player skill disparities directly influence shot outcomes. For example, a star shooter (e.g., Stephen Curry) has a higher baseline make probability than a bench player. Similarly, an elite defender (e.g., Rudy Gobert) reduces opponents’ success rates. One-hot encoding allows the model to assign unique weights to each player’s offensive/defensive impact.
  3. Standard Scaling:
    • Applied to total_game_clock and gameQuarter to standardize their scales. While tree-based models are scale-invariant, scaling benefits logistic regression and neural networks by ensuring gradient descent converges faster.

Model Features

The model includes the following features:

  1. Quantitative Features (Numerical)
    • total_game_clock: Seconds remaining in game
    • clock_squared (non‑linear time feature, = total_game_clock²) Total Quantitative Features: 2
  2. Nominal Features (Categorical)
    • zone: If defense is in zone defense
    • homeEvent: Team on offense is home team
    • sob: If play is sideline out of bounds play
    • eob: If play is end of game baseline inbounds play
    • ato: If play is after timeout
    • shot: Type of shot (Layup, Jumper, FreeThrow, etc.)
    • O Player Name: Offensive player attempting shot
    • D Player Name: Primary defender on the play Total Nominal Features: 8
  3. Ordinal Features
    • gameQuarter: Quarter of the game Total Ordinal Features: 1

Modeling Algorithm and Hyperparameters

The Logistic Regression model was selected as the final model due to its interpretability, efficiency with high-dimensional data (after one-hot encoding), and strong performance during hyperparameter tuning. Despite testing more complex models (e.g., Random Forests, Neural Networks), logistic regression achieved the highest cross-validation accuracy, suggesting that the relationships between features and shot outcomes are largely linear or additive.

Model Comparison

Model CV Accuracy
Logistic Regression 0.647
KNN 0.555
Random Forest 0.651
Neural Network 0.539
Naive Bayes. 0.535

Hyperparameter Tuning:

Performance Improvements Over Baseline

The final model’s performance was evaluated against the baseline using accuracy:

Metric Baseline Model Final Model
Cross-Validation Accuracy 0.541 0.647
Test Accuracy 0.532 0.643

The Final Model’s performance gains stem from features that mirror the real‑world process of a shot attempt:

Together, these engineered features recreate the multifaceted nature of basketball shooting—who’s on the court, who’s defending, how much time remains, the special play context, and the shot’s inherent difficulty—enabling the model to predict shot success with significantly higher accuracy.


Conclusion & Next Steps

Future Work: To further improve the predictive power and practical utility of the shot success model, the following directions could be explored:

By integrating player-specific, situational, and temporal features, the final model captures the multifaceted nature of basketball shot outcomes, providing actionable insights for optimizing offensive strategies. By expanding features, refining models, and prioritizing interpretability, this analysis could evolve into a decision-support tool that bridges data science and basketball strategy. Future work should focus on closing the gap between predictive accuracy and actionable insights for players, coaches, and analysts.