Grading My 2024 NFL Forecast Model
Turns out, the model did better than I expected—and better than ESPN’s "experts."
Note: This post was written before yesterday’s Week 18 games, so any developments from those games are not reflected in the analysis. Consider this a time capsule from two days ago—possibly the shortest and least exciting time capsule ever.
This past week1, my NFL forecast model correctly “predicted” 14 out of 16 NFL games. That's an impressive 87.5 percent accuracy!
I put “predicted” in quotes because, while the model picks a favorite—the team it considers more likely to win—it’s still a probabilistic forecast. It also provides a margin of error with point spreads and probabilities. For instance, for Monday night's game between the Lions and 49ers, the model estimated that the Lions were 4.5-point favorites with a 66 percent chance to win. While it technically predicted a Lions victory, it acknowledged about a one-in-three chance they could lose.
Still, I thought it was a solid showing. Of course, it was just one week—maybe I got lucky.
With the season winding down, I thought now would be a good time to see how the model performed with a larger sample size and reflect on its performance over the entire year. There’s still one week left in the regular season, but with some teams likely benching their starters this weekend, I figured now was as good a time as any.
Overall, I was pleasantly surprised with how the model performed. This was my first year running it, and although I built it a bit quicker than I would have liked, I was pleased with how things turned out.
Below are a few highlights. If you're interested, more details are in the sections that follow.
The model “predicted” the winner in 177 of 256 games (69% accuracy), slightly outperforming ESPN’s expert panel.
On average, the model’s favored team had a 67 percent chance of winning, which aligns closely with the model's 69 percent accuracy—right where we'd expect it to be.
The model was well-calibrated—the predicted probabilities aligned closely with the actual outcomes over time.
The predicted point spreads proved to be reliable. On average, the predicted point spread was within one point of the actual result, with the model’s predictions favoring the home team by just 0.986 points compared to the actual result.
How the Model Fared at Picking Winners
As I mentioned earlier, the model correctly “predicted” the winner in 177 of 256 games, achieving a 69 percent accuracy rate—not bad!
For context, this accuracy rate was comparable to ESPN’s Expert Picks. Of the ten experts who made predictions for all 256 games, six had lower accuracy rates than the model, while four outperformed it. On average, the experts picked correctly 68 percent of the time—slightly less frequently than the model.
Additionally, the 69 percent accuracy rate aligns well with what we’d expect, given that the model, on average, estimated the favorite had a 67 percent chance of winning.
The 69 percent accuracy rate would be less satisfying if the model consistently assigned extreme probabilities, like 95 percent or 55 percent, to the favorites. In the former case, it would indicate the model was overly bullish; in the latter, too bearish.
This is a probabilistic forecast, so it is important that the predicted probabilities align closely with actual outcomes. Fortunately, that was the case here.
How the Model Fared With Probabilistic Predictions
A key measure of a probabilistic forecast model’s reliability is its calibration—how well its predicted probabilities match actual outcomes over the long run.
Throughout the season, my model offered a wide range of probabilities for favorites, from near-certainties to coin flips. At the high end, the model gave the Bills a 94.1 percent chance of beating the lackluster Patriots in Week 16. At the other end, the Week 162 matchup between the Chargers and Broncos was effectively a toss-up, with the model giving the Chargers a slim 50.1 percent edge.
For a model to be well-calibrated, its predicted probabilities must align with actual outcomes over the long term. For example, teams with a 65 percent chance of winning should win about 65 percent of the time. Similarly, teams with a 50 percent chance of winning, such as the Chargers against the Broncos, should win about half the time.
Overall, my model demonstrated solid calibration, with predictions aligning closely to actual outcomes in most cases.
To assess calibration, I grouped all 256 games into bins based on the favored team’s probability of winning, with each bin representing a five percentage point range. This approach allows us to see how well the model’s predicted probabilities align with actual win rates across different levels of confidence.
The table and graph below show how often teams with a given probability of winning actually won.
For the most part, the win rates align closely with the model’s predictions. Teams’ win rates fell within their expected ranges or very close to them. For example, when teams were predicted to have a 65–70% chance of winning, they won 69 percent of the time.
The most notable outlier, however, is the 75–80% bin, where teams won just 62 percent of the time despite being predicted to win 75–80% of the time. This is significantly lower than expected. However, with just 24 games in this bin, the small sample size may explain this discrepancy. Statistically, there’s about a 1-in-14 chance that this result occurred by chance.
This is something I’ll investigate further when I refine the model for next season.
How the Model Fared with Point Spreads
The model doesn’t just predict favorites and their probabilities of winning; it also generates point spreads—estimates of how much it expects each team to win or lose by.
For example, as I mentioned earlier, the model considered the Lions to be 4.5-point favorites against the 49ers last week. The Lions won by 6, so the model’s prediction was within 1.5 points—an impressive result given the high variance3 in NFL point differentials. While individual games like this highlight the model's accuracy, the overall performance is just as encouraging.
On average, the model’s predictions fell within one point of the actual result. Specifically, the model’s point spreads favored the home team by 0.986 points relative to the actual result.
It’s worth noting that large errors in opposite directions can cancel each other out. For instance, if one game misses by 21 points in one direction and another by 21 in the opposite direction, the average error would be zero—even though both predictions were far off.
Even so, the point spreads performed quite well. They were within a field goal (3 points) of the actual result about a quarter of the time and within a touchdown (7 points4) about half the time. Additionally, the differences between the model’s point spreads and actual results closely followed a normal distribution, with a mean of -0.986 points and a standard deviation of 12.8 points. This suggests that the model's errors are random rather than systematically biased in one direction.
To better understand how well the model’s predictions aligned with actual results, the following histogram visualizes the distribution of these differences.
The Model’s Still Runnin’
Overall, I think the model proved its value this season: It picked winners accurately, showed reliable calibration, and produced point spreads that aligned closely with actual results.
Looking ahead, I plan to make a few tweaks in the offseason to fine-tune its accuracy and address some of the quirks uncovered this year. Stay tuned—2025 should be even better!
I’ll continue running the model for Week 18 and throughout the Playoffs, so check the NFL Forecast landing page for the latest predictions, point spreads, and probabilities. Here’s to seeing if it can finish the season on a high note!
Week 17, that is.
It’s a coincidence that both the highest and lowest probability given to a favorite were in the same week.
In my analysis of NFL games from 2000–2023, I found that, on average, the home team won by 2.7 points. However, the standard deviation was a whopping 14 points, illustrating the high variability in NFL game outcomes.
Yeah, yeah, yeah — technically a touchdown is six points, but … whatever.