So far in our Fantasy Basketball Optimization series, we put together a relational database of CSV files with stats and other league info and did some basic exploratory analysis in Power BI. Now it’s time to address the biggest challenge in fantasy basketball: the lock-in dilemma.

What is the “lock-in dilemma”?
In fantasy basketball, players have multiple games per week, so you have to pick one game per week to “lock-in” points for. The hard part is that you have to decide to lock before the player’s next game starts, so you’re constantly asking yourself “should I take the current points, or hold out for a better score?”
This lock-in concept then throws off our scoring expectations. Most fantasy apps provide projections for individual games, but if we’re good with our lock-ins, the final totals are higher than the single-game projections (which we refer to as the “lock-in bias”).
Thus, we utilized data and analytics to alleviate these challenges. First, we modeled a probability distribution for single game totals to estimate probabilities for our lock-in decisions. Then, we applied Monte Carlo Simulation to create adjusted projections that account for the lock-in bias.
In this article, we’ll cover the theory behind these statistical techniques along with the full implementation using Python. Note that while the final pipeline uses an AWS Lambda function for this, we’ll demo the code in a Jupyter notebook for the scope of this article.
Statistical Theory
We’ll start with a little theoretical overview of how we’ll apply our statistical methods. The flow chart below showcases a quick overview of the theory:

Optimized Lock-ins
To help with the lock-in dilemma, we thought of a way to estimate probabilities associated with lock-in decisions.
To start, we model each player’s single game point totals as a normal distribution by finding the mean and standard deviation from past games. Then, using the normal CDF formula, we can estimate the probability of a player doing better (or worse) than a given score in an upcoming game.

If the player has multiple games left in the week, we derive a formula for the probability of a better score in at least one of the remaining games:

Finally, we can apply this formula to any given lock-in scenario. If there’s a low probability of scoring better (< 50%), we should lock-in the current score. If the probability is high (> 50%), we should wait to lock.

Adjusted Projections
As mentioned earlier, the lock-ins typically result in higher locked totals than the single game projections. Thus, we used our statistical modeling with some Monte Carlo Simulation to adjust the projections for the lock-in bias.
To create an adjusted projection for a given player/week, we’ll start by simulating point totals for all of the player’s games that week (using the normal distribution that we fit). We’ll then apply our lock-in optimization logic to flag which of the totals we would’ve locked in:

Finally, we’ll repeat this across a bunch of different simulation repetitions and average out the results to get our final adjusted projection.

Player Comparisons
One way that the adjusted projections help us is by providing better player comparisons.
Suppose we have one spot left in next week’s lineup, and we need to choose between two different players to fill it. We’re trying to compare the two to determine which can get us more points next week.

While the first player has a higher single-game average, the second has a higher standard deviation and more games next week, which actually results in a higher adjusted projection. Thus, we should start the second player, who is actually expected to get us more points at the end of the day.

Matchup Projections
We can also total up the adjusted projections across all of our players to get more accurate overall projections. These matchup projections give us a constant comparison to our opponent throughout the week and help us estimate the likelihood of us coming out in victory.
To make overall matchup projections, we’ll simulate adjusted projections for the players that haven’t been locked yet (with the total games left), and add them up with the current locked totals to calculate overall matchup projections:

We can then average out the totals across all the different repetitions to get an overall matchup projection.

Since the matchup totals can vary as the week goes on, we’ll calculate prediction intervals to estimate how precise our matchup projections are. To do this, we’ll simply find the 5th and 95th percentiles of our repetition totals, which gives us lower and upper bounds that our final total should stay within about 90% of the time.

We can also calculate a win probability from our simulation repetitions. For each rep, we’ll flag whether or not our simulated total was higher than the opponent’s (with 1 being yes and 0 being no) and average out the flag to estimate our win probability.

Data Prep
With all the theory figured out, we’re ready to start implementing it with Python. We’ll start by reading and transforming the data from our relational database of CSV files.
Raw Stats
We’ll start with the stats fact table that breaks out each player’s total fantasy points by game. Note that all we have to do for transformations on this is just make sure that the game date column is formatted correctly after being read in from the CSV.
# read in the stats data df_stats = pd.read_csv(f"{folder_path}//all_stats.csv") # make sure that the game date is formatted correctly df_stats["GAME_DATE"] = pd.to_datetime(df_stats["GAME_DATE"]) # showcase the data df_stats.head()

Players
We’ll also load in the players table, but there’s some extra stuff we want to add to it. First, we’ll join in the Sleeper players and rosters tables to identify which fantasy team each player is on. Then, we’ll add in the mean and standard deviation for each player’s single game fantasy points for our probability calculations later on:
# read in the players data df_players = pd.read_csv(f"{folder_path}//raw_players.csv") # put together a full player name column df_players["FULL_NAME"] = (df_players["FIRST_NAME"] + " " + df_players["LAST_NAME"]).str.upper() # read in the Sleeper players and rosters df_splayers = pd.read_csv(f"{folder_path}//sleeper_players.csv") df_rosters = pd.read_csv(f"{folder_path}//sleeper_rosters.csv") # uppercase the names to match the other data df_splayers["FULL_NAME"] = df_splayers["PLAYER_NAME"].str.upper() # merge the rosters into the players df_splayers = ( df_splayers.merge(df_rosters, how = "left", on = "PLAYER_ID_SLEEPER") [["FULL_NAME", "ROSTER_ID", "IS_STARTER", "SEARCH_RANK"]] ) # join to the other players dataframe df_players = df_splayers.merge(df_players, how = "left", on = "FULL_NAME") df_players = df_players.loc[~df_players["PLAYER_ID"].isnull()] # aggregate the stats data agg_stats = df_stats.groupby("PLAYER_ID").agg( PTS_MEAN = ("FANTASY_POINTS", "mean"), PTS_SD = ("FANTASY_POINTS", "std") ) # join in the aggregated stats df_players = df_players.merge(agg_stats, how = "left", on = "PLAYER_ID") # subset to just the columns we want df_players = df_players[[ "PLAYER_ID", "FULL_NAME", "TEAM_ID", "PTS_MEAN", "PTS_SD", "ROSTER_ID", "IS_STARTER", "SEARCH_RANK" ]] # make sure that player_id is an integer df_players["PLAYER_ID"] = df_players["PLAYER_ID"].astype(int) # showcase the data df_players.head()

Individual Player Schedules
We’ll also re-format the games table to breakout all the different games by each player:
# read in the team schedules df_games = pd.read_csv(f"{folder_path}//all_games.csv") # make sure that the game date is formatted correctly df_games["GAME_DATE"] = pd.to_datetime(df_games["GAME_DATE"]) # concatenate the home and away teams df_pgames = pd.concat([ ( df_games[["HOME_ID", "GAME_ID", "WEEK_NUMBER", "GAME_DATE"]] .rename(columns = {"HOME_ID": "TEAM_ID"}) ), ( df_games[["GUEST_ID", "GAME_ID", "WEEK_NUMBER", "GAME_DATE"]] .rename(columns = {"GUEST_ID": "TEAM_ID"}) ) ]) # join in a few player attributes df_pgames = df_pgames.merge( df_players[["TEAM_ID", "PLAYER_ID", "FULL_NAME"]], on = "TEAM_ID", how = "inner" ) # showcase the data df_pgames.head()

Locked Points
Finally, we’ll load in the locked points CSV from the last post, and we’ll split it up between players we’ve locked vs players that are still open:
# read in the locked points data df_locks = pd.read_csv(f"{folder_path}//locked_points.csv") # filter to just the current week df_locks = df_locks.loc[df_locks["WEEK_NUMBER"] == week_number] # filter to the players that haven't been locked yet df_open = df_locks.loc[df_locks["LOCKED_POINTS"].isna()] # filter to the players that have already been locked df_locks = df_locks.loc[~df_locks["LOCKED_POINTS"].isna()] # showcase the data df_locks.head()

Optimize Lock-ins
With all the data loaded, we’re ready to start applying our stats modeling to it. We’ll start with a little function that calculates the probability of a better game from our lock-in optimization formula:
# probability that there will be a better score given the parameters and number of games left def calc_prob_better(current, games_left, mean, sd): prob_single = norm.cdf(current, loc = mean, scale = sd) prob_any = 1 - (prob_single ** games_left) return prob_any # test out the function pbetter = calc_prob_better( current = 48.1, games_left = 3, mean = 42.3, sd = 14.6 ) print(f"{pbetter:.1%}")


Flag Potential Locks
To actually apply this to our team, we’ll write up some code that goes through the players we haven’t locked yet, estimates the probability of them scoring better, and provides recommendations based on the probabilities:
# filter to just the players that we can potentially lock-in df_options = df_open.loc[df_open["TEAM"] == "My Team"][["PLAYER_NAME", "PLAYER_ID"]] # get the games that haven't been played yet dfg2 = ( df_pgames.loc[ (df_pgames["PLAYER_ID"].isin(df_options["PLAYER_ID"])) & (df_pgames["WEEK_NUMBER"] == week_number) & (df_pgames["GAME_DATE"] > pd.Timestamp.now()) ][["PLAYER_ID", "GAME_ID"]] .groupby("PLAYER_ID").agg( GAMES = ("GAME_ID", "count") ).reset_index() ) # get the most recent fantasy points dfs2 = df_stats.loc[ (df_stats["PLAYER_ID"].isin(df_options["PLAYER_ID"])) & (df_stats["WEEK_NUMBER"] == week_number) ][["PLAYER_ID", "GAME_ID", "FANTASY_POINTS"]] dfs2["RECENT_RANK"] = dfs2.groupby("PLAYER_ID")["FANTASY_POINTS"].rank(ascending = False) dfs2 = dfs2.loc[dfs2["RECENT_RANK"] == 1] dfs2 = dfs2[["PLAYER_ID", "FANTASY_POINTS"]] # join everything together df_options = ( df_options.merge(dfg2, on = "PLAYER_ID", how = "left") .merge(dfs2, on = "PLAYER_ID", how = "left") .merge(df_players[["PLAYER_ID", "PTS_MEAN", "PTS_SD"]], on = "PLAYER_ID", how = "left") ) # loop through each player and calculate the probability of a better score for i, row in df_options.iterrows(): df_options.loc[i, "PROB_BETTER"] = calc_prob_better( current = row["FANTASY_POINTS"], games_left = row["GAMES"], mean = row["PTS_MEAN"], sd = row["PTS_SD"] ) # sort by the lock probability df_options = df_options.sort_values("PROB_BETTER").reset_index(drop = True) # flag whether or not we should lock df_options["LOCK"] = df_options["PROB_BETTER"] < 0.5 # showcase the data df_options

Adjusted Predictions
Another thing we wanted to do was create adjusted projections (that account for lock-in bias) using Monte Carlo Simulation.

We’ll start with a function that simulates future games for a given dataframe of players across a given number of simulation repetitions:
def simulate_locked_points(df_players, nreps = 100, seedval = 4): # set the random seed for reproducibility np.random.seed(seedval) # expand by the game number df = df_players.copy() df["GAME_NUM"] = df["GAMES"].apply(lambda x: np.arange(1, x + 1)) df = df.explode("GAME_NUM") # expand by the simulation repetition df = df.reset_index(drop = True) df["REP_NUM"] = df.apply(lambda x: np.arange(1, nreps + 1), axis = 1) df = df.explode("REP_NUM") # simulate the points values df["SIM_POINTS"] = df.apply( lambda x: np.random.normal(loc = x["PTS_MEAN"], scale = x["PTS_SD"]), axis = 1 ) # calculate the number of games left df["GAMES_LEFT"] = df["GAMES"] - df["GAME_NUM"] # compute which games we will lock in df["PROB_BETTER"] = df.apply(lambda x: calc_prob_better( current = x["SIM_POINTS"], games_left = x["GAMES_LEFT"], mean = x["PTS_MEAN"], sd = x["PTS_SD"] ), axis = 1) df = df.loc[df["PROB_BETTER"] < 0.5] df = df.sort_values(by = ["REP_NUM", "GAME_NUM"]).groupby(["PLAYER_ID", "REP_NUM"]).head(1) return df # test out the function simulate_locked_points(pd.DataFrame({ "PLAYER_ID": [1, 2], "PTS_MEAN": [30, 30], "PTS_SD": [10, 10], "GAMES": [4, 3] }))

We’ll also code up a little wrapper function that calls that simulate_locked_points
function and averages the locked totals across each player to get our adjusted player projections:
def compute_adjusted_projections(df_players): # simulate the locked points df_sims = simulate_locked_points(df_players) # calculate the adjusted projections df_adjusted = df_sims.groupby(list(df_players.columns) ).agg({ "SIM_POINTS": "mean" }).reset_index() return df_adjusted # test out the function compute_adjusted_projections(pd.DataFrame({ "PLAYER_NAME": ["Player 1", "Player 2"], "PLAYER_ID": [1, 2], "PTS_MEAN": [30, 30], "PTS_SD": [10, 10], "GAMES": [4, 3] }) )

Player Comparisons
One of the ways we planned on using these adjusted projections is providing better player comparisons.

The following code takes a list of players and computes their adjusted projections for a given week so that we can compare them:
compare_players = ["JONAS VALANCIUNAS", "DYSON DANIELS"] compare_week = 12 # filter down to the players we want to compare df_players2 = df_players.loc[df_players["FULL_NAME"].isin(compare_players)] # get the number of games in the week agg_games = df_pgames.loc[df_pgames["WEEK_NUMBER"] == compare_week].groupby(["PLAYER_ID"]).agg( GAMES = ("GAME_ID", "count") ) # join in the number of games df_players2 = df_players2.merge(agg_games, on = "PLAYER_ID", how = "left") # calculate the adjusted projections df_adjusted = compute_adjusted_projections(df_players2) # showcase each of the projections for i, row in df_adjusted.iterrows(): print(f"\n{row['FULL_NAME']} is projected to score {row['SIM_POINTS']:.1f} points this week ") print(f" ({row['GAMES']} games with mean = {row['PTS_MEAN']:.1f} and sd = {row['PTS_SD']:.1f})")

Simulate Matchup
We also planned on using the adjusted projections to provide better overall matchup projections.

We’ll start with a chunk of code that simulates points for all of the players who haven’t been locked in yet:
# filter to the players that we need to simulate sim_players = df_players.merge( df_open[["PLAYER_ID", "TEAM"]], on = "PLAYER_ID", how = "inner" ) # filter to the games left this week current_games = df_pgames.loc[ (df_pgames["WEEK_NUMBER"] == week_number) & (df_pgames["GAME_DATE"] > current_date) ] # aggregate by player current_games = current_games.groupby("PLAYER_ID").agg( GAMES = ("GAME_ID", "count") ) # join in the number of games left sim_players = sim_players.merge(current_games, on = "PLAYER_ID", how = "left") # simulate the locked points df_sims = simulate_locked_points(sim_players) # showcase the data df_sims.head()

We’ll then aggregate the total simulated totals across each team/simulation repetition. Then, we’ll total up all the actual locked points so far and add it to the simulated totals to get the overall matchup total for each repetition:
# aggregate the simulation reps by team df_reps = df_sims.groupby(["TEAM", "REP_NUM"]).agg( SIM_TOTAL = ("SIM_POINTS", "sum") ).reset_index() # summarize the totals that are already locked in locked_totals = df_locks.groupby("TEAM").agg( LOCKED_POINTS = ("LOCKED_POINTS", "sum") ) # join in the locked points and add to the totals df_reps = df_reps.merge(locked_totals, on = "TEAM", how = "left") df_reps["LOCKED_POINTS"] = df_reps["LOCKED_POINTS"].fillna(0) df_reps["TOTAL_POINTS"] = df_reps["SIM_TOTAL"] + df_reps["LOCKED_POINTS"] # showcase the data df_reps.head()

Finally, we’ll average out the matchup totals across each team to get point estimates for our total vs the opponent’s. We’ll also add in the currently locked totals and number of players left to lock to provide us a little more info about where we’re at in the matchup:
# calculate the overall totals df_totals = df_reps.groupby("TEAM").agg( LOCKED_POINTS = ("LOCKED_POINTS", "mean"), EST_TOTAL = ("TOTAL_POINTS", "mean") ).reset_index() # calculate the players left to lock dfr = df_open.groupby("TEAM").agg( PLAYERS_LEFT = ("PLAYER_ID", "count") ).reset_index() # join in the players left to lock df_totals = df_totals.merge(dfr, on = "TEAM", how = "left") df_totals["PLAYERS_LEFT"] = df_totals["PLAYERS_LEFT"].fillna(0) # showcase the data df_totals


Prediction Intervals
Part of our new matchup projections are prediction intervals that showcase how uncertain each of the projections are. To create these, we’ll simply calculate percentiles of total points across the repetitions:
# calculate the overall totals df_totals = df_reps.groupby("TEAM").agg( PCT05 = ("TOTAL_POINTS", lambda x: np.percentile(x, 5)), AVG = ("TOTAL_POINTS", "mean"), PCT95 = ("TOTAL_POINTS", lambda x: np.percentile(x, 95)) ).sort_values("TEAM").reset_index() # print the overall projections for i, row in df_totals.iterrows(): print(f"\n{row['TEAM']}: {row['AVG']:,.1f} total points" ) print(f" (between {row['PCT05']:.1f} and {row['PCT95']:.1f})")


Win Probabilities
The last thing we wanted to add is our win probability (based off the simulation), which we’ll calculate by flagging repetitions where our total was higher and averaging out the flag:
# get the rep totals for my team dfr1 = ( df_reps.loc[df_reps["TEAM"] == "My Team"] [["REP_NUM", "TOTAL_POINTS"]] .rename(columns = {"TOTAL_POINTS": "MY_TEAM"}) ) # get the rep totals for the opponent dfr2 = ( df_reps.loc[df_reps["TEAM"] == "Opponent"] [["REP_NUM", "TOTAL_POINTS"]] .rename(columns = {"TOTAL_POINTS": "OPPONENT"}) ) # join the two together and calcuate the result df_match = dfr1.merge(dfr2, on = "REP_NUM", how = "inner") df_match["RESULT"] = np.where(df_match["MY_TEAM"] > df_match["OPPONENT"], 1, 0) # calculate the win probability win_prob = df_match["RESULT"].mean() print(f"My team has a {win_prob:.1%} chance of winning this week")


Conclusion
By applying some statistical analysis to the data from our relational database, we were able to address the lock-in dilemma to help optimize our team. However, everything up to this point has been done through local Jupyter notebooks, so the rest of the series covers how we moved the code into an automated data pipeline with AWS cloud services.
Until then, feel free to download the full Jupyter notebook of Python code and try it out yourself:
If you found this guide helpful, please drop a comment below with questions and/or feedback. Your input helps us improve future posts and inspire new ideas!
Thanks for joining us today, and we hope to see you next time!