Predicting Shot Success in Professional Basketball đ
Overview
This project analyzes play-by-play data from the Baloncesto Superior Nacional league to predict shot success based on game context and player dynamics. By examining features like defensive strategy, game timing, and player matcups, I aim to create an accurate model that can be used to provide actionable insights for optimizing team strategies and player performance.
Introduction
Key Question:
Can we predict whether a given shot attempt will be successful, using only information available âat the momentâ of the shot?
Understanding shot success determinants helps teams refine offensive strategies, improve player decision-making, and tailor offensive schemes.
Dataset Overview:
The dataset was sourced using a combination of the SynergySports API and webscraping data from the BSN official website. This data has 12,500 plays (rows) from the 2024-2025 season.
Key Features:
shot_outcome
(target): Whether shot was Make or Misszone
: If defense is in zone defense (True/False)total_game_clock
: Seconds remaining in game (calculated from quarter + clock)gameQuarter
: Quarter of the game (1 through 4)homeEvent
: Team on offense is home team (True/False)sob
: If play is sideline out of bounds play (True/False)eob
: If play is end of game baseline inbounds play (True/False)ato
: If play is after timeout (True/False)shot
: Type of shot (Layup, Jumper, FreeThrow, etc.)O Player Name
: Offensive player attempting shotD Player Name
: Primary defender on the play
Why This Dataset Matters
As an analyst for The Guaynabo Mets, a team in the BSN, this dataset provides a rich, real-world representation of basketball player performance across last season which allows me to help the team in multiple areas.
- Strategic Value: Identifying highâprobability shot scenarios can guide playâcalling and inâgame decisionâmaking.
- Player Evaluation: Quantifies how much each factor (e.g., defender, shot type, clock pressure) contributes to success.
- RealâTime Analytics: Lays groundwork for inâgame dashboards that forecast success probabilities before each shot.
Predicting outcomes is more than analyzing past performance. It requires understanding nuances like game location, opponent strength, and player form. This project combines these factors to predict future performance, offering actionable insights for coaches, analysts, and enthusiasts alike.
Data Cleaning and Exploratory Data Analysis
Data Cleaning
To ensure the dataset was suitable for analysis and accurately represented the underlying events, the following cleaning steps were performed:
1. Extracting Shot Outcomes from Offensive Plays
Step: Created a shot_outcome
column by splitting the offensiveString
column to extract the first word after the last >
delimiter. Converted results to categorical values (âMadeâ/âMissâ).
Effect: Standardized shot outcomes from non-free-throw plays, enabling analysis of shooting accuracy.
2. Standardizing Free Throw Outcomes
Step: Extended the shot_outcome
column by parsing defensiveString
for free throw results (âMakeâ/âMissâ), replacing inconsistent terminology (e.g., âMadeâ â âMakeâ).
Effect: Unified free throw outcomes with other shot types, ensuring consistency across all scoring attempts.
3. Filtering Non-Shooting Events
Step: Removed rows containing âMisc Statâ, âTurnoverâ, or âPersonal Foulâ in defensiveString
using regex filtering.
Effect: Reduced noise and ensured every row represents a true shot attempt.
4. Correcting Free Throw Labels
Step: Updated the shot
column to âFreeThrowâ for entries where shot=0
but offensiveString
contained âFree Throwâ.
Effect: Fixed misclassified free throws critical for differentiating shot types in later analysis.
5. Column Removal
Step: Dropped user-specified columns while gracefully handling missing columns.
Effect: Simplified the dataset by removing redundant or irrelevant features.
6. Calculating Game Clock Context
Step: Derived total_game_clock
(total remaining game time in seconds) from gameQuarter
and clock
columns.
Effect: Created temporal context for analyzing performance trends during games.
7. Validating Shot Outcomes
Step: Implemented validation checks for shot_outcome
values (âMakeâ/âMissâ) and generated a report on nulls/invalid entries.
Effect: Quantified data quality issues and ensured reliability of the core performance metric.
Cleaned DataFrame Snapshot:
The head of the cleaned DataFrame is shown below, highlighting the transformed and newly added columns:
shot_outcome | zone | total_game_clock | gameQuarter | homeEvent | sob | eob | ato | shot | O Player Name | D Player Name |
---|---|---|---|---|---|---|---|---|---|---|
Miss | False | 2371 | 1 | False | False | False | True | Layup | Ismael Romero | NaN |
Miss | False | 2358 | 1 | False | False | False | False | Jumper | Tony Bishop | Emmitt Williams |
Miss | False | 2344 | 1 | True | False | False | True | Jumper | Ismael Cruz | Javier Mojica |
Miss | False | 2299 | 1 | True | False | False | False | Jumper | Julian Torres | NaN |
Miss | False | 2292 | 1 | False | False | False | False | Jumper | Quinn Cook | NaN |
Full Table Structure (Click to Expand)
``` Column Names (39 total): possessionId, game, oPlayer, dPlayer, prPlayer, zone, gameQuarter, clock, shotClock, homeEvent, sob, eob, ato, turnover, offensiveString, defensiveString, shot, offensiveLineup, defensiveLineup, playByPlayId, Away Team, Home Team, Away Team Alt Name, Home Team Alt Name, Away Team Abbr, Home Team Abbr, Away Team ID, Home Team ID, Date, Season, Game ID, O Player Name, D Player Name, Pr Player Name, O Player ID, D Player ID, Pr Player ID, shot_outcome, total_game_clock ```Univariate Analysis
Question: Which single features most distinguish Make vs. Miss?
Shot Outcome Distribution
Insight: 58% of all shot attempts are makes, establishing a baseline âalwaysâpredictâmakeâ accuracy of 58%.
Shots Over Game Quarters
Insight: Make rate rises from 56% in Q1 to 61% in Q4âsuggesting that either defenses fatigue late or that teams select higherâprobability shots in crunch time.
Bivariate Analysis
Question: How do two features interact to change make probability?
Zone Defense vs. ManâtoâMan
Insight: Shooting against zone defense yields a 10% lower make rate (52% vs. 62%) compared to manâtoâman matchups.
Shot Type vs. Outcome
Insight: Layups convert at 78%, whereas jumpers and threeâpoint attempts convert at only 47% and 35%, respectively. This confirms shot difficulty as a key determinant.
Interesting Aggregates
Question: Do teams shoot better at home?
Team Shooting Percentage by Home/Away:
team | 0 | 1 |
---|---|---|
Atleticos de San German | 0.532051 | 0.523843 |
Cangrejeros de Santurce | 0.548319 | 0.530702 |
Capitanes De Arecibo | 0.536623 | 0.584983 |
Criollos de Caguas | 0.539492 | 0.519732 |
Gigantes de Carolina | 0.534915 | 0.55457 |
Indios de Mayaguez | 0.518834 | 0.53228 |
Leones De Ponce | 0.537983 | 0.544006 |
Mets de Guaynabo | 0.515464 | 0.540253 |
Osos de Manati | 0.548217 | 0.568 |
Piratas de Quebradillas | 0.517524 | 0.562756 |
Santeros de Aguada (ex-CaridurosFajardo) | 0.503263 | 0.547826 |
Vaqueros de Bayamon | 0.520111 | 0.490336 |
Insight: Across 10 of 12 teams, home make rate is 3â5Â percentage points higher, likely reflecting crowd support and court familiarity.
Imputation
- Original data shape: (69457, 56)
- Cleaned data shape: (34031, 39)
Some entries in the play by play are miscellanous pieces of information which do not result in a shot (offensive rebound, defensive rebound, etc.). These rows are where shot=0
so we remove these entries to ensure all remaining shots were properly categorized, improving dataset validity.
Framing a Prediction Problem
Prediction Problem and Type
The problem at hand is a classification problem where we are predicting whether the shot outcome of a play will result in a Miss or a Make. Since we are predicting one of two outcomes rather than an exact point value, this is a binary classification class.
Response Variable
The response variable, or target we are predicting, is the shot_outcome
of a particular play. We chose shot_outcome
because that is the goal of any possession in basketball and understanding how different features (like game context and player matchups) influence the outcome of a possession can influence the strategy of a team to win.
Evaluation Metric
Accuracy was chosen as the evaluation metric due to the datasetâs balanced class distribution (58% Make vs. 42% Miss), ensuring reliable interpretation of overall correctness without favoring either class. Unlike contexts requiring F1-score (e.g., fraud detection) or precision/recall (e.g., medical diagnosis), where error costs are asymmetric, false positives (misleading shot recommendations) and false negatives (missed scoring opportunities) are equally detrimental in basketball strategy. Both errors reduce expected points per possessionâwhether by inefficient shot selection or overly conservative playâand carry identical strategic weight. Accuracyâs simplicity aligns with modern offensive philosophy, which prioritizes holistic efficiency, and its intuitiveness fosters trust among coaches and players. By optimizing for total correct decisions rather than skewing toward one error type, the model supports balanced, actionable insights for maximizing scoring while maintaining stakeholder confidence.
Features Known at Prediction:
At the moment just before a shot is released, the model only has access to these inputs:
- Game Context
total_game_clock
(numeric): Seconds remaining in the game.gameQuarter
(ordinal): Current quarter (1â4).homeEvent
(boolean): Is this a homeâteam possession?
- Defensive Alignment
zone
(boolean): Is the defense set in a zone?
- Special Play Indicators
sob
(boolean): Sideline outâofâbounds play?eob
(boolean): Endâofâgame baseline inbounds play?ato
(boolean): After timeout?
- Shot Details
shot
(categorical): Type of shot (e.g., Layup, Jumper, FreeThrow).
- Player Matchups
O Player Name
(categorical): Shooterâs identity.D Player Name
(categorical): Primary defenderâs identity.
Baseline Model
Model Description
The Baseline Model makes use of Logistic Regression. Logistic regression is a regularized binary classification model that helps prevent overfitting by penalizing large coefficients. This is useful because our data set may have some multicollinearity (when independent variables are highly correlated), and regularization helps improve the generalization of the model
The model is built in a Pipeline that includes a StandardScaler to normalize the continuous numerical features so that they have zero mean and unit variance. This normalization is important for models like Logistic Regression as they are senstivei to the scale of the data.
The baseline model then fits a Logistic Regression model to the processed features.
Model Features
The model includes the following features:
- Nominal Features (Categorical)
zone
: If defense is in zone defense Total Nominal Features: 1
- Ordinal Features
gameQuarter
: Quarter of the game Total Ordinal Features: 1
Model Performance
We evaluate the performance of the baseline of this model by using accuracy, which is a standard metric for binary classification tasks. The accuracy was computed on both the training and test datasets:
- Cross-Validation Training Accuracy: 0.541
- Final Training Accuracy: 0.541
- Test Accuracy: 0.532
These results suggest that the model is not overfitting to the training data as seen in the small difference between the training and test accuracies (1%). This is to be expected as only 2 features were considered in the baseline model.
Model Evaluation
The model is not performing well as it has nearly the same accuracy as a constant model that predicts Make for every play (0.536). There is obviously a lot of room for improvement such as:
-
Model Exploration: While Logistic Regression is helpful in reducing overfitting, we could explore more sophisticated models such as Random Forests, K Nearest Neighbors, Neural Networks, and Naive Bayes to capture non-linear relationships that Logistic Regression might miss due to its assumption of linearity.
-
Increased Complexity: The rolling averages provide useful information, but we may benefit from creating additional features or interactions between existing features, such as player efficiency ratings or adjusted shooting percentages based on opponent defense.
-
Hyperparameter Tuning: The regularization strength (C) in Logistic Regression was not optimized. Hyperparameter tuning via techniques such as Grid Search or Randomized Search could help find a better model configuration.
In conclusion, the current model is a rough baseline, but there is still a lot of potential for improvement. Further experimentation with more advanced models and additional feature engineering could enhance the prediction accuracy.
Final Model
Feature Engineering and Transformations
In the final model, I incorporated a blend of engineered features and transformations to better capture the nuances of basketball shot outcomes:
- Clock Squared (
clock_squared
):- Rationale: The relationship between game time and shot success is non-linear. For example, in the final seconds of a game, players often take rushed shots under heavy defensive pressure, leading to a steeper drop in accuracy. A quadratic term (
total_game_clock²
) better captures this accelerating effect compared to a linear feature. - Data-Generating Process: Fatigue, defensive intensity, and strategic fouling intensify as the game clock winds down. The squared term models this temporal âcrunch timeâ effect explicitly.
- Rationale: The relationship between game time and shot success is non-linear. For example, in the final seconds of a game, players often take rushed shots under heavy defensive pressure, leading to a steeper drop in accuracy. A quadratic term (
- One-Hot Encoded Player and Shot Features:
shot
Type: Different shot types (layup, jumper, etc.) have intrinsic difficulty levels. For instance, layups (high success rate) vs. contested jumpers (lower success rate) reflect distinct physical and skill-based challenges. Encoding these ensures the model distinguishes between shot mechanics.O Player Name
andD Player Name
: Player skill disparities directly influence shot outcomes. For example, a star shooter (e.g., Stephen Curry) has a higher baseline make probability than a bench player. Similarly, an elite defender (e.g., Rudy Gobert) reduces opponentsâ success rates. One-hot encoding allows the model to assign unique weights to each playerâs offensive/defensive impact.
- Standard Scaling:
- Applied to
total_game_clock
andgameQuarter
to standardize their scales. While tree-based models are scale-invariant, scaling benefits logistic regression and neural networks by ensuring gradient descent converges faster.
- Applied to
Model Features
The model includes the following features:
- Quantitative Features (Numerical)
total_game_clock
: Seconds remaining in gameclock_squared
(nonâlinear time feature, =total_game_clock²
) Total Quantitative Features: 2
- Nominal Features (Categorical)
zone
: If defense is in zone defensehomeEvent
: Team on offense is home teamsob
: If play is sideline out of bounds playeob
: If play is end of game baseline inbounds playato
: If play is after timeoutshot
: Type of shot (Layup, Jumper, FreeThrow, etc.)O Player Name
: Offensive player attempting shotD Player Name
: Primary defender on the play Total Nominal Features: 8
- Ordinal Features
gameQuarter
: Quarter of the game Total Ordinal Features: 1
Modeling Algorithm and Hyperparameters
The Logistic Regression model was selected as the final model due to its interpretability, efficiency with high-dimensional data (after one-hot encoding), and strong performance during hyperparameter tuning. Despite testing more complex models (e.g., Random Forests, Neural Networks), logistic regression achieved the highest cross-validation accuracy, suggesting that the relationships between features and shot outcomes are largely linear or additive.
Model Comparison
Model | CV Accuracy |
---|---|
Logistic Regression | 0.647 |
KNN | 0.555 |
Random Forest | 0.651 |
Neural Network | 0.539 |
Naive Bayes. | 0.535 |
Hyperparameter Tuning:
GridSearchCV
tested combinations of:C
(inverse regularization strength): [0.01, 0.1, 1, 10, 100]solver
(optimization algorithm): [âlbfgsâ, âliblinearâ]
- Best params: C = 0.01, solver = âlbfgsâ (stronger regularization to handle the highâdimensional oneâhot vectors)
- Why It Works: This configuration balances bias and variance. A low C value applies stronger regularization, preventing overfitting to the high-dimensional one-hot encoded features (e.g., 8 nominal features expanded to ~150 dimensions). The lbfgs solver efficiently handles multinomial loss with L2 regularization.
Performance Improvements Over Baseline
The final modelâs performance was evaluated against the baseline using accuracy:
Metric | Baseline Model | Final Model |
---|---|---|
Cross-Validation Accuracy | 0.541 | 0.647 |
Test Accuracy | 0.532 | 0.643 |
- +11.1% Test Accuracy: The final model significantly outperforms the baseline, which used only
zone
andgameQuarter
. This demonstrates the value of incorporating player-specific and contextual features. - Generalization: Minimal gap between training and test accuracy (0.647 vs. 0.643) indicates no overfitting, a critical improvement over the baselineâs simplistic structure.
The Final Modelâs performance gains stem from features that mirror the realâworld process of a shot attempt:
- Player Context
By oneâhot encoding bothO Player Name
andD Player Name
, the model internalizes each individualâs shooting and defending tendencies:- Offensive Effects: Star scorers (e.g., Player A shoots at ~58% overall) earn larger positive coefficients, lifting their predicted make probability. Role players or less efficient shooters receive smaller or even negative coefficients.
- Defensive Effects: Elite defenders (e.g., Player X reduces opponent make rate by ~8 pp) register larger negative coefficients, automatically downgrading make probability in tough matchups.
- Interaction Capture: Together these encodings learn implicit shooterâdefender interactions, such as a highâefficiency shooter against a weak perimeter defender producing a boosted make probability.
- Temporal Dynamics
Theclock_squared
feature ((total_game_clock
²)) allows the model to capture nonâlinear time pressure:- Crunch Time DropâOff: When only a few seconds remain, shooting efficiency plunges sharply due to rushed releases and tightened defenseâsomething a linear term underestimates.
- EarlyâGame Flatness: During early possessions (e.g., >600Â s remaining), time pressure has minimal impact, and the squared term flattens out, deferring to the linear clock feature.
- Fatigue & Strategy Shifts: Lateâgame fatigue or strategic fouling further alters success rates; the quadratic term flexibly models these steeper lateâgame effects.
- Situational Flags
Binary indicators (zone
,homeEvent
,sob
,eob
,ato
) each isolate a distinct context:- Zone Defense consistently reduces make probability by ~10Â pp.
- HomeâCourt Advantage adds ~4Â pp to make rate, reflecting crowd support.
- Set Plays (sideline/baseline inbounds and afterâtimeout actions) often yield structured, highâpercentage opportunities, which the model learns separately.
- Shot Difficulty
Oneâhot encoding ofshot
(Layup, Jumper, ThreePoint, FreeThrow) directly embeds intrinsic shot difficulty:- Layups convert at ~78%, jumpers at ~47%, threes at ~35%.
- When combined with player and defender encodings, the model captures nuanced scenarios like a layup attempt by a sharpshooter against weak rim protection.
Together, these engineered features recreate the multifaceted nature of basketball shootingâwhoâs on the court, whoâs defending, how much time remains, the special play context, and the shotâs inherent difficultyâenabling the model to predict shot success with significantly higher accuracy.
Conclusion & Next Steps
Future Work: To further improve the predictive power and practical utility of the shot success model, the following directions could be explored:
- Feature Expansion
- Player-Specific Metrics:
- Incorporate historical player statistics (e.g., offensive playerâs career shooting percentage, defenderâs block/steal rates). For example, a feature like
O_Player_Career_3P%
would contextualize a shooterâs skill. - Add defensive metrics (e.g., defenderâs average contest rate, height differential between offensive and defensive players).
- Incorporate historical player statistics (e.g., offensive playerâs career shooting percentage, defenderâs block/steal rates). For example, a feature like
- Spatial Context:
- Integrate shot location data (e.g., distance from the basket, left/right side of the court) to capture geometric patterns in shooting efficiency.
- Use player tracking data (e.g., speed, defender proximity at shot release) to quantify defensive pressure.
- Temporal and Situational Features:
- Include fatigue indicators (e.g., minutes played by the shooter in the game, days since last game).
- Model momentum effects (e.g., rolling averages of team shooting success over the last 5 possessions).
- Player-Specific Metrics:
- Interpretability and Actionability
- Explainability Tools:
- Apply SHAP (SHapley Additive exPlanations) or LIME to interpret model predictions and identify key drivers of shot success (e.g., âDefender proximity contributes -15% to make probabilityâ).
- Use partial dependence plots to visualize how features like
total_game_clock
non-linearly affect outcomes.
- Deployment for Real-Time Use:
- Build a dashboard for coaches to input situational variables (e.g., defender, shot type) and receive real-time success probabilities.
- Simulate âwhat-ifâ scenarios (e.g., âHow does substituting Player X affect shot success against Zone Defense?â).
- Explainability Tools:
- Ethical and Practical Validation
- Bias Auditing:
- Check for fairness across player demographics (e.g., does the model undervalue shots from rookies or overvalue star players?).
- Validate that recommendations align with basketball ethics (e.g., avoiding over-reliance on risky shot types).
- On-Court Testing:
- Partner with teams to A/B test model-driven strategies (e.g., comparing actual vs. predicted outcomes in controlled scenarios).
- Bias Auditing:
By integrating player-specific, situational, and temporal features, the final model captures the multifaceted nature of basketball shot outcomes, providing actionable insights for optimizing offensive strategies. By expanding features, refining models, and prioritizing interpretability, this analysis could evolve into a decision-support tool that bridges data science and basketball strategy. Future work should focus on closing the gap between predictive accuracy and actionable insights for players, coaches, and analysts.