Addressing the Lock-in Dilemma

So far in our Fantasy Basketball Optimization series, we put together a relational database of CSV files with stats and other league info and did some basic exploratory analysis in Power BI. Now it’s time to address the biggest challenge in fantasy basketball: the lock-in dilemma.

What is the “lock-in dilemma”?

In fantasy basketball, players have multiple games per week, so you have to pick one game per week to “lock-in” points for. The hard part is that you have to decide to lock before the player’s next game starts, so you’re constantly asking yourself “should I take the current points, or hold out for a better score?”

This lock-in concept then throws off our scoring expectations. Most fantasy apps provide projections for individual games, but if we’re good with our lock-ins, the final totals are higher than the single-game projections (which we refer to as the “lock-in bias”).

Thus, we utilized data and analytics to alleviate these challenges. First, we modeled a probability distribution for single game totals to estimate probabilities for our lock-in decisions. Then, we applied Monte Carlo Simulation to create adjusted projections that account for the lock-in bias.

In this article, we’ll cover the theory behind these statistical techniques along with the full implementation using Python. Note that while the final pipeline uses an AWS Lambda function for this, we’ll demo the code in a Jupyter notebook for the scope of this article.

Statistical Theory

We’ll start with a little theoretical overview of how we’ll apply our statistical methods. The flow chart below showcases a quick overview of the theory:

Optimized Lock-ins

To help with the lock-in dilemma, we thought of a way to estimate probabilities associated with lock-in decisions.

To start, we model each player’s single game point totals as a normal distribution by finding the mean and standard deviation from past games. Then, using the normal CDF formula, we can estimate the probability of a player doing better (or worse) than a given score in an upcoming game.

If the player has multiple games left in the week, we derive a formula for the probability of a better score in at least one of the remaining games:

Finally, we can apply this formula to any given lock-in scenario. If there’s a low probability of scoring better (< 50%), we should lock-in the current score. If the probability is high (> 50%), we should wait to lock.

Adjusted Projections

As mentioned earlier, the lock-ins typically result in higher locked totals than the single game projections. Thus, we used our statistical modeling with some Monte Carlo Simulation to adjust the projections for the lock-in bias.

To create an adjusted projection for a given player/week, we’ll start by simulating point totals for all of the player’s games that week (using the normal distribution that we fit). We’ll then apply our lock-in optimization logic to flag which of the totals we would’ve locked in:

Finally, we’ll repeat this across a bunch of different simulation repetitions and average out the results to get our final adjusted projection.

Player Comparisons

One way that the adjusted projections help us is by providing better player comparisons.

Suppose we have one spot left in next week’s lineup, and we need to choose between two different players to fill it. We’re trying to compare the two to determine which can get us more points next week.

While the first player has a higher single-game average, the second has a higher standard deviation and more games next week, which actually results in a higher adjusted projection. Thus, we should start the second player, who is actually expected to get us more points at the end of the day.

Matchup Projections

We can also total up the adjusted projections across all of our players to get more accurate overall projections. These matchup projections give us a constant comparison to our opponent throughout the week and help us estimate the likelihood of us coming out in victory.

To make overall matchup projections, we’ll simulate adjusted projections for the players that haven’t been locked yet (with the total games left), and add them up with the current locked totals to calculate overall matchup projections:

We can then average out the totals across all the different repetitions to get an overall matchup projection.

Since the matchup totals can vary as the week goes on, we’ll calculate prediction intervals to estimate how precise our matchup projections are. To do this, we’ll simply find the 5th and 95th percentiles of our repetition totals, which gives us lower and upper bounds that our final total should stay within about 90% of the time.

We can also calculate a win probability from our simulation repetitions. For each rep, we’ll flag whether or not our simulated total was higher than the opponent’s (with 1 being yes and 0 being no) and average out the flag to estimate our win probability.

Data Prep

With all the theory figured out, we’re ready to start implementing it with Python. We’ll start by reading and transforming the data from our relational database of CSV files.

Raw Stats

We’ll start with the stats fact table that breaks out each player’s total fantasy points by game. Note that all we have to do for transformations on this is just make sure that the game date column is formatted correctly after being read in from the CSV.

# read in the stats data
df_stats = pd.read_csv(f"{folder_path}//all_stats.csv")

# make sure that the game date is formatted correctly
df_stats["GAME_DATE"] = pd.to_datetime(df_stats["GAME_DATE"])

# showcase the data
df_stats.head()

Players

We’ll also load in the players table, but there’s some extra stuff we want to add to it. First, we’ll join in the Sleeper players and rosters tables to identify which fantasy team each player is on. Then, we’ll add in the mean and standard deviation for each player’s single game fantasy points for our probability calculations later on:

# read in the players data 
df_players = pd.read_csv(f"{folder_path}//raw_players.csv") 
 
# put together a full player name column 
df_players["FULL_NAME"] = (df_players["FIRST_NAME"] + " " + df_players["LAST_NAME"]).str.upper() 
 
# read in the Sleeper players and rosters 
df_splayers = pd.read_csv(f"{folder_path}//sleeper_players.csv") 
df_rosters = pd.read_csv(f"{folder_path}//sleeper_rosters.csv") 
 
# uppercase the names to match the other data 
df_splayers["FULL_NAME"] = df_splayers["PLAYER_NAME"].str.upper() 
 
# merge the rosters into the players 
df_splayers = (
    df_splayers.merge(df_rosters, how = "left", on = "PLAYER_ID_SLEEPER") 
    [["FULL_NAME", "ROSTER_ID", "IS_STARTER", "SEARCH_RANK"]]
) 
 
# join to the other players dataframe 
df_players = df_splayers.merge(df_players, how = "left", on = "FULL_NAME") 
df_players = df_players.loc[~df_players["PLAYER_ID"].isnull()] 
 
# aggregate the stats data 
agg_stats = df_stats.groupby("PLAYER_ID").agg(
    PTS_MEAN = ("FANTASY_POINTS", "mean"), 
    PTS_SD = ("FANTASY_POINTS", "std") 
) 
 
# join in the aggregated stats 
df_players = df_players.merge(agg_stats, how = "left", on = "PLAYER_ID") 
 
# subset to just the columns we want 
df_players = df_players[[
    "PLAYER_ID", "FULL_NAME", "TEAM_ID", "PTS_MEAN", "PTS_SD", 
    "ROSTER_ID", "IS_STARTER", "SEARCH_RANK"
]] 
 
# make sure that player_id is an integer 
df_players["PLAYER_ID"] = df_players["PLAYER_ID"].astype(int) 
 
# showcase the data 
df_players.head() 

Individual Player Schedules

We’ll also re-format the games table to breakout all the different games by each player:

# read in the team schedules 
df_games = pd.read_csv(f"{folder_path}//all_games.csv") 
 
# make sure that the game date is formatted correctly 
df_games["GAME_DATE"] = pd.to_datetime(df_games["GAME_DATE"]) 
 
# concatenate the home and away teams 
df_pgames = pd.concat([
    (
        df_games[["HOME_ID", "GAME_ID", "WEEK_NUMBER", "GAME_DATE"]]
        .rename(columns = {"HOME_ID": "TEAM_ID"})
    ), 
    (
        df_games[["GUEST_ID", "GAME_ID", "WEEK_NUMBER", "GAME_DATE"]]
        .rename(columns = {"GUEST_ID": "TEAM_ID"}) 
    ) 
]) 
 
# join in a few player attributes 
df_pgames = df_pgames.merge(
    df_players[["TEAM_ID", "PLAYER_ID", "FULL_NAME"]], 
    on = "TEAM_ID", 
    how = "inner"
) 
 
# showcase the data 
df_pgames.head()

Locked Points

Finally, we’ll load in the locked points CSV from the last post, and we’ll split it up between players we’ve locked vs players that are still open:

# read in the locked points data 
df_locks = pd.read_csv(f"{folder_path}//locked_points.csv") 
 
# filter to just the current week 
df_locks = df_locks.loc[df_locks["WEEK_NUMBER"] == week_number] 
 
# filter to the players that haven't been locked yet 
df_open = df_locks.loc[df_locks["LOCKED_POINTS"].isna()]
 
# filter to the players that have already been locked 
df_locks = df_locks.loc[~df_locks["LOCKED_POINTS"].isna()] 
 
# showcase the data 
df_locks.head() 

Optimize Lock-ins

With all the data loaded, we’re ready to start applying our stats modeling to it. We’ll start with a little function that calculates the probability of a better game from our lock-in optimization formula:

# probability that there will be a better score given the parameters and number of games left 
def calc_prob_better(current, games_left, mean, sd):
 
    prob_single = norm.cdf(current, loc = mean, scale = sd) 
    prob_any = 1 - (prob_single ** games_left) 
 
    return prob_any 
 
# test out the function 
pbetter = calc_prob_better(
    current = 48.1, 
    games_left = 3, 
    mean = 42.3, 
    sd = 14.6 
) 
print(f"{pbetter:.1%}")

Flag Potential Locks

To actually apply this to our team, we’ll write up some code that goes through the players we haven’t locked yet, estimates the probability of them scoring better, and provides recommendations based on the probabilities:

# filter to just the players that we can potentially lock-in 
df_options = df_open.loc[df_open["TEAM"] == "My Team"][["PLAYER_NAME", "PLAYER_ID"]] 
 
# get the games that haven't been played yet  
dfg2 = (
    df_pgames.loc[
        (df_pgames["PLAYER_ID"].isin(df_options["PLAYER_ID"])) & 
        (df_pgames["WEEK_NUMBER"] == week_number) & 
        (df_pgames["GAME_DATE"] > pd.Timestamp.now()) 
    ][["PLAYER_ID", "GAME_ID"]] 
    .groupby("PLAYER_ID").agg(
        GAMES = ("GAME_ID", "count") 
    ).reset_index() 
) 
 
# get the most recent fantasy points 
dfs2 = df_stats.loc[
    (df_stats["PLAYER_ID"].isin(df_options["PLAYER_ID"])) & 
    (df_stats["WEEK_NUMBER"] == week_number)
][["PLAYER_ID", "GAME_ID", "FANTASY_POINTS"]] 
dfs2["RECENT_RANK"] = dfs2.groupby("PLAYER_ID")["FANTASY_POINTS"].rank(ascending = False) 
dfs2 = dfs2.loc[dfs2["RECENT_RANK"] == 1] 
dfs2 = dfs2[["PLAYER_ID", "FANTASY_POINTS"]]  
 
# join everything together 
df_options = (
    df_options.merge(dfg2, on = "PLAYER_ID", how = "left") 
    .merge(dfs2, on = "PLAYER_ID", how = "left") 
    .merge(df_players[["PLAYER_ID", "PTS_MEAN", "PTS_SD"]], on = "PLAYER_ID", how = "left") 
) 
 
# loop through each player and calculate the probability of a better score 
for i, row in df_options.iterrows(): 
    df_options.loc[i, "PROB_BETTER"] = calc_prob_better(
        current = row["FANTASY_POINTS"], 
        games_left = row["GAMES"], 
        mean = row["PTS_MEAN"], 
        sd = row["PTS_SD"]
    ) 
 
# sort by the lock probability 
df_options = df_options.sort_values("PROB_BETTER").reset_index(drop = True) 
 
# flag whether or not we should lock 
df_options["LOCK"] = df_options["PROB_BETTER"] < 0.5
 
# showcase the data 
df_options 

Adjusted Predictions

Another thing we wanted to do was create adjusted projections (that account for lock-in bias) using Monte Carlo Simulation.

We’ll start with a function that simulates future games for a given dataframe of players across a given number of simulation repetitions:

def simulate_locked_points(df_players, nreps = 100, seedval = 4): 
 
    # set the random seed for reproducibility 
    np.random.seed(seedval) 
 
    # expand by the game number 
    df = df_players.copy() 
    df["GAME_NUM"] = df["GAMES"].apply(lambda x: np.arange(1, x + 1)) 
    df = df.explode("GAME_NUM") 
 
    # expand by the simulation repetition 
    df = df.reset_index(drop = True) 
    df["REP_NUM"] = df.apply(lambda x: np.arange(1, nreps + 1), axis = 1) 
    df = df.explode("REP_NUM") 
 
    # simulate the points values 
    df["SIM_POINTS"] = df.apply(
        lambda x: np.random.normal(loc = x["PTS_MEAN"], scale = x["PTS_SD"]), axis = 1
    ) 
 
    # calculate the number of games left 
    df["GAMES_LEFT"] = df["GAMES"] - df["GAME_NUM"] 
 
    # compute which games we will lock in 
    df["PROB_BETTER"] = df.apply(lambda x: calc_prob_better(
        current = x["SIM_POINTS"], 
        games_left = x["GAMES_LEFT"], 
        mean = x["PTS_MEAN"], 
        sd = x["PTS_SD"] 
    ), axis = 1) 
    df = df.loc[df["PROB_BETTER"] < 0.5]  
    df = df.sort_values(by = ["REP_NUM", "GAME_NUM"]).groupby(["PLAYER_ID", "REP_NUM"]).head(1) 
 
    return df 
 
# test out the function 
simulate_locked_points(pd.DataFrame({
    "PLAYER_ID": [1, 2], 
    "PTS_MEAN": [30, 30], 
    "PTS_SD": [10, 10], 
    "GAMES": [4, 3] 
})) 

We’ll also code up a little wrapper function that calls that simulate_locked_points function and averages the locked totals across each player to get our adjusted player projections:

def compute_adjusted_projections(df_players):
 
    # simulate the locked points 
    df_sims = simulate_locked_points(df_players) 
 
    # calculate the adjusted projections 
    df_adjusted = df_sims.groupby(list(df_players.columns) ).agg({
        "SIM_POINTS": "mean"
    }).reset_index() 
 
    return df_adjusted 
 
# test out the function 
compute_adjusted_projections(pd.DataFrame({
    "PLAYER_NAME": ["Player 1", "Player 2"], 
    "PLAYER_ID": [1, 2], 
    "PTS_MEAN": [30, 30], 
    "PTS_SD": [10, 10], 
    "GAMES": [4, 3] 
}) ) 

Player Comparisons

One of the ways we planned on using these adjusted projections is providing better player comparisons.

The following code takes a list of players and computes their adjusted projections for a given week so that we can compare them:

compare_players = ["JONAS VALANCIUNAS", "DYSON DANIELS"] 
compare_week = 12
 
# filter down to the players we want to compare 
df_players2 = df_players.loc[df_players["FULL_NAME"].isin(compare_players)] 
 
# get the number of games in the week 
agg_games = df_pgames.loc[df_pgames["WEEK_NUMBER"] == compare_week].groupby(["PLAYER_ID"]).agg(
    GAMES = ("GAME_ID", "count")
) 
 
# join in the number of games 
df_players2 = df_players2.merge(agg_games, on = "PLAYER_ID", how = "left") 
 
# calculate the adjusted projections 
df_adjusted = compute_adjusted_projections(df_players2) 
 
# showcase each of the projections 
for i, row in df_adjusted.iterrows():
    print(f"\n{row['FULL_NAME']} is projected to score {row['SIM_POINTS']:.1f} points this week ")
    print(f"  ({row['GAMES']} games with mean = {row['PTS_MEAN']:.1f} and sd = {row['PTS_SD']:.1f})") 

Simulate Matchup

We also planned on using the adjusted projections to provide better overall matchup projections.

We’ll start with a chunk of code that simulates points for all of the players who haven’t been locked in yet:

# filter to the players that we need to simulate 
sim_players = df_players.merge(
    df_open[["PLAYER_ID", "TEAM"]], 
    on = "PLAYER_ID", 
    how = "inner"
) 
 
# filter to the games left this week 
current_games = df_pgames.loc[
    (df_pgames["WEEK_NUMBER"] == week_number) & 
    (df_pgames["GAME_DATE"] > current_date)
] 
 
# aggregate by player 
current_games = current_games.groupby("PLAYER_ID").agg(
    GAMES = ("GAME_ID", "count")
) 
 
# join in the number of games left 
sim_players = sim_players.merge(current_games, on = "PLAYER_ID", how = "left") 
 
# simulate the locked points 
df_sims = simulate_locked_points(sim_players) 
 
# showcase the data 
df_sims.head() 

We’ll then aggregate the total simulated totals across each team/simulation repetition. Then, we’ll total up all the actual locked points so far and add it to the simulated totals to get the overall matchup total for each repetition:

# aggregate the simulation reps by team 
df_reps = df_sims.groupby(["TEAM", "REP_NUM"]).agg( 
    SIM_TOTAL = ("SIM_POINTS", "sum")
).reset_index() 
 
# summarize the totals that are already locked in 
locked_totals = df_locks.groupby("TEAM").agg(
    LOCKED_POINTS = ("LOCKED_POINTS", "sum")
) 
 
# join in the locked points and add to the totals 
df_reps = df_reps.merge(locked_totals, on = "TEAM", how = "left") 
df_reps["LOCKED_POINTS"] = df_reps["LOCKED_POINTS"].fillna(0) 
df_reps["TOTAL_POINTS"] = df_reps["SIM_TOTAL"] + df_reps["LOCKED_POINTS"] 
 
# showcase the data 
df_reps.head() 

Finally, we’ll average out the matchup totals across each team to get point estimates for our total vs the opponent’s. We’ll also add in the currently locked totals and number of players left to lock to provide us a little more info about where we’re at in the matchup:

# calculate the overall totals 
df_totals = df_reps.groupby("TEAM").agg(
    LOCKED_POINTS = ("LOCKED_POINTS", "mean"), 
    EST_TOTAL = ("TOTAL_POINTS", "mean") 
).reset_index() 
 
# calculate the players left to lock 
dfr = df_open.groupby("TEAM").agg(
    PLAYERS_LEFT = ("PLAYER_ID", "count")
).reset_index() 
 
# join in the players left to lock 
df_totals = df_totals.merge(dfr, on = "TEAM", how = "left") 
df_totals["PLAYERS_LEFT"] = df_totals["PLAYERS_LEFT"].fillna(0) 
 
# showcase the data 
df_totals 

Prediction Intervals

Part of our new matchup projections are prediction intervals that showcase how uncertain each of the projections are. To create these, we’ll simply calculate percentiles of total points across the repetitions:

# calculate the overall totals 
df_totals = df_reps.groupby("TEAM").agg(
    PCT05 = ("TOTAL_POINTS", lambda x: np.percentile(x, 5)), 
    AVG = ("TOTAL_POINTS", "mean"),
    PCT95 = ("TOTAL_POINTS", lambda x: np.percentile(x, 95))
).sort_values("TEAM").reset_index() 
 
# print the overall projections 
for i, row in df_totals.iterrows():
    print(f"\n{row['TEAM']}: {row['AVG']:,.1f} total points" ) 
    print(f"    (between {row['PCT05']:.1f} and {row['PCT95']:.1f})") 

Win Probabilities

The last thing we wanted to add is our win probability (based off the simulation), which we’ll calculate by flagging repetitions where our total was higher and averaging out the flag:

# get the rep totals for my team 
dfr1 = (
    df_reps.loc[df_reps["TEAM"] == "My Team"] 
    [["REP_NUM", "TOTAL_POINTS"]]
    .rename(columns = {"TOTAL_POINTS": "MY_TEAM"}) 
) 
 
# get the rep totals for the opponent 
dfr2 = (
    df_reps.loc[df_reps["TEAM"] == "Opponent"] 
    [["REP_NUM", "TOTAL_POINTS"]] 
    .rename(columns = {"TOTAL_POINTS": "OPPONENT"}) 
) 
 
# join the two together and calcuate the result 
df_match = dfr1.merge(dfr2, on = "REP_NUM", how = "inner") 
df_match["RESULT"] = np.where(df_match["MY_TEAM"] > df_match["OPPONENT"], 1, 0) 
 
# calculate the win probability 
win_prob = df_match["RESULT"].mean() 
print(f"My team has a {win_prob:.1%} chance of winning this week") 

Conclusion

By applying some statistical analysis to the data from our relational database, we were able to address the lock-in dilemma to help optimize our team. However, everything up to this point has been done through local Jupyter notebooks, so the rest of the series covers how we moved the code into an automated data pipeline with AWS cloud services.

Until then, feel free to download the full Jupyter notebook of Python code and try it out yourself:

If you found this guide helpful, please drop a comment below with questions and/or feedback. Your input helps us improve future posts and inspire new ideas!

Thanks for joining us today, and we hope to see you next time!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top