Solving the Golf Tee Game with PySpark

The Cracker Barrel golf tee game is a classic puzzle where the objective is to jump tees over each other and remove them, ultimately aiming to leave as few tees left as possible. You win the game if you only leave one tee left, which labels you as “genius”. But how do you win?

We took a brute-force approach to solving that with a Python algorithm that evaluates every single potential move and finds all of the winning paths. This is a lot to evaluate, so we used a technique called “parallelization” to run it efficiently with Apache Spark in Microsoft Fabric.

This article showcases how we did this, including the basics of our algorithm, the implementation with PySpark, and a real-world use case that mirrors it.

The Setup

Base Notation

First, we’ll define notation for moves and board positions. Each hole is labeled in the format “[row number].[column number]”. For instance, “3.1” for the first hole in the third row.

With this notation, we can specify any board position as a list of all the holes containing tees. This allows us to clearly describe the current board layout at any given time.

We also use notation to indicate which directions each tee can potentially move. This may include straight left (L) or right (R), or diagonally (top-left (TL), top-right (TR), bottom-left (BL), or bottom-right (BR)). The full breakdown with abbreviations is shown in the visual below.

We use a combination of the position and direction identifiers to define a given move. For instance, “3.2R” specifies that we’ll move the tee from position 3.2 straight right to position 3.4 (thus jumping over 3.3 and eliminating the tee there).

With that notation for each move, we can combine them into a single string to represent the entire movement path (with each move separated by a “-“). Note that you can select your starting point by choosing which hole begins without a tee, so we’ll specify the empty position at the start of the path string (i.e. “E1.1” means that the very top position starts empty).

The visual below illustrates a sample path string for a few initial moves.

The Algorithm

With our notation defined, we can now talk through our algorithm for finding all the potential moves. To start, we made a little function that takes the current state of the board along with a potential position and move direction, and it determines whether or not the move is valid.

If it isn’t valid, it returns nothing. If it is, it returns the full move definition and the new state of the board after the move is done. This checks each potential move and gives us the new board position for the next iteration.

When checking for all the potential moves, there are some rules that can help reduce the number of checks.

For instance, to move downward, there needs to be at least 2 rows below the current position. Thus, the BL and BR moves will never be valid for the bottom 2 rows, and we only have to check the top triangle. Similarly, we only have to check the left triangle for R and TR moves, and only the right triangle for L and TL moves.

The visual below showcases the potential positions for each of the move directions.

To apply this, we define a function that takes a given board layout and checks all the potential moves.

For each move direction, it loops through all the corresponding triangle positions (that currently have tees) and compiles a dataframe of all the valid moves (with the new board layout after each).

With our function to check each board layout, we can apply it iteratively to find potential move paths. Since we can choose where the empty tee starts, there are 15 initial layouts to start from. We’ll apply our check_plist function to each of these layouts to get a list of all the valid moves with the resulting new layouts.

In the next iteration, we’ll apply the function to the new list of layouts to find the next round of potential moves (and resulting layouts), and we’ll continue this iteratively until all of the paths have ran out of valid moves.

This visual showcases a truncated version of this:

Parallelization

We tracked the total checks and final paths across each iteration. While it begins with only 504 checks on the 1st iteration, it grows exponentially and peaks at 53 million checks on the 11th iteration.

Across all the iterations, we evaluated 164 million potential moves to find a total of 7.3 million possible paths.

As you can see, there are quite a few checks to process. If we tried running this in a local Python script, it would probably take a while to run, and it might even cause an “Out of Memory” error. Thus, we chose to run this in the cloud to take advantage of parallelization.

Parallelization is basically like a “divide-and-conquer” method that splits up the work into smaller tasks that can be done separately. Rather than having one processor do all the work, it splits up the tasks across a group of processors (i.e. “worker nodes”) that process the data simultaneously to reduce the compute time.

This allows you to reduce the compute time for big data workflows by scaling horizontally. For instance, if you have two worker nodes, you can cut the processing time in half (compared to a single node). If you have ten worker nodes, you can cut the processing time into 1/10^th of the original. This is why parallelization is widely used in big data workflows to process large datasets efficiently.

There are various ways to implement this concept, but we chose to use a tool called Apache Spark.

Apache Spark

Apache Spark is a powerful distributed computing framework that enables parallel processing of large datasets. It allows you to run commands in common languages like SQL and Python (i.e. Spark SQL and PySpark) and handles the parallelization piece in automatically.

Spark is often integrated with cloud computing platforms such as Databricks or Microsoft Fabric, and we decided to use Fabric to implement our algorithm.

Fabric Implementation

Spark can be used in Microsoft Fabric through lakehouses and notebooks. A lakehouse is a storage option in Fabric that stores data in Delta tables, which can be accessed and queried using SQL. They can also be accessed through notebooks, which are code files where you can implement Spark SQL and Pyspark commands in a nice readable format.

With notebooks, you can read raw data from Delta tables, process it in parallel with Spark, and write the final results to new Delta tables.

To efficiently run our algorithm in Microsoft Fabric, we used a few Delta tables to store results after each iteration.

One table (current_level) stores all possible board layouts for the current iteration. We then apply our check_data_frame() function in parallel using an applyInPandas command. With this method, we have a function that takes a normal Pandas dataframe as input and outputs a single Pandas dataframe, and we apply it to a PySpark dataframe in chunks based on a group by column.

The notebook reads previous states, performs checks in parallel via applyInPandas, and outputs the new layouts to a different table. We also append the progress of each path to one final dataframe that showcases the full path steps across all iterations.

At the end of each iteration, we overwrite the prior data frame with the new results, and we repeat this process across each iteration until we run out of new moves.

This setup works pretty well. It allows us to process the large volume of checks without running out of memory, and actually does so relatively quickly. With just a basic free trial of Microsoft Fabric, we were able to run the full algorithm in less than an hour.

Results

Since we tracked all the paths at each iteration, we can find all the paths that hit a dead end by filtering to the ones that have 0 next moves. When querying the final paths, we can use the iteration number to determine the result. The paths that made it all the way to the 13^th iteration are the winning ones (i.e. only one tee left).

In addition to simply finding the paths, we also found a way to compute probabilities for the different outcomes. To do this, we have to calculate the probability of the path at each given iteration.

Assuming that moves are chosen randomly, we can find the overall probability of each path by dividing the previous iteration probability by the number of new options.

We can then summarize the data to find our overall probability of winning. We’ll simply sum up the probabilities of the final paths by each outcome to calculate the probability of each.

Overall, we found that there were 7.3 million total potential paths, with only 439 thousand of those being winners. When choosing paths at random, there’s only a 0.45% chance of winning the game.

We can also sort the results to find the worst possible paths. We see here that the worst possible outcome is to leave 10 tees.

For example, here is one of the worst possible paths:

We won’t give away a full winning path here (to find one, you’ll have to download the code from our GitHub repo and run it yourself), but we’ll give you a start that increases your odds.

If you start with the following path below, it will increase your odds of winning from 0.45% to 38%, giving you a huge advantage.

Real-world Example

To go beyond our little example with the Cracker Barrel game, let’s look at a real-world scenario where PySpark can help us with real data. Suppose you manage a retail chain with hundreds of stores across the nation, and you’re trying to use data science to optimize inventory across thousands of products.

You get daily reports of inventory levels at each location, and you have an optimization algorithm to apply to each store/product. After a few years of collecting data, you end up with over 734 million rows in the full dataset – which is too much data to process in a local script.

Instead, you ingest the raw data into a Fabric Lakehouse and use the same applyInPandas technique to apply your optimization algorithm to each store/product.

Though the use case requires us to process a huge amount of data, this is achievable through the parallelization capabilities of Spark in Microsoft Fabric.

Conclusion

Solving the Cracker Barrel golf tee game with brute force isn’t easy, but PySpark and Microsoft Fabric made it manageable. Beyond this simple game, PySpark enables powerful big data processing for real-world applications (like the inventory optimization example from the last section).

We shared the full notebook of Python code in our GitHub repo, so feel free to check that out. Note that we ran this using a free trial of Microsoft Fabric, so even if you don’t currently have access to a Fabric subscription, you can try running this yourself by signing up for your own Fabric trial.

As always, we hope you learned something today, and we’ll see you next time!