This is my third year building a matchup guide for the NCAA tournament. In this post I’ll take you through the numbers in my analysis, and later I’ll go through the process for scraping, shaping, and visualizing our college basketball data set.

There are many places you can go for data on college basketball, and more often than not they will have done the difficult work for you. KenPom.com, for example, offers a sophisticated model for ranking teams. ESPN’s BPI system was developed by Dean Oliver, who previously authored the influential book, Basketball on Paper. More recently, FiveThirtyEight has used weighted averages of polling systems and a variation of the standard chess rating system (Wikipedia: ELO) to rank basketball teams and assess probabilities. At the risk of reinventing the wheel, I went a different route and developed a variation of Pomeroy’s metrics leveraging data scraped from sports-reference.com. There are a few reasons why I wanted a custom system, but mainly I wanted the capability to drill down to deeper levels of granularity in the data, and I wanted to automate more of the process with code.

The core of Pomeroy’s system is a Bill James construct called Pythagorean Expectation. Pomeroy expanded on earlier work by Dean Oliver to show that you can accurately estimate a team’s win probability against a statistically average division 1 opponent using the following formula:

OE and DE here are the Adjusted Offensive and Defensive Efficiency ratings, or the amount of points a team would likely score/allow against an average D1 opponent. Once the Pythagorean percentage is calculated for two teams, win probability for one to beat the other in a game can be calculated using another formula called Log5:

If you incorporate the tempos at which each team plays (i.e., possessions per game) in conjunction with the D1 average tempo metric, you can use efficiency metrics to project a likely outcome of any game in points scored and allowed. Pomeroy provides provides these projections every day for subscribers to his site.

For my own analysis, I went to sports-reference.com’s gamelogs, which provide an offensive and defensive efficiency rating at the game level for all 351 teams in division 1. From there, I developed a small Python program that could crawl the gamelogs for every team, returning the game summary values into a csv file. This is only part of the solution, though, because it doesn’t tell us much when a team racks up strong numbers crushing a lower quality opponent. In order to make the data useful, we have to normalize the efficiency metrics based on the strength of schedule. For that, I used SRS, which is a simple rating based on the margin of victory and quality of the opponent. There’s a detailed explanation here, but in effect this is what it gets you:

Let’s say that Indiana has an SRS of 19.8 while Miami of Ohio has an SRS of -5.7. Indiana is thus about 20 points better than the average D1 team, while Miami of Ohio is about 6 points worse than that same opponent. If Indiana and Miami played, you’d likely see a rout of roughly 25 to 26 points, or the delta between the two teams’ ratings.

Because SRS is normalized to the average D1 team, it gives us an ideal method for adjusting offensive and defensive efficiency to the opponent. If the opponent’s SRS is positive, we add a tempo-adjusted offensive efficiency points to the raw number, and we subtract tempo-adjusted points from the defensive side. By running the numbers this way, in a hypothetical game between Indiana and Miami, we end up penalizing Indiana if they fail to beat Miami by a sufficient amount. I also added weighting to the most recent 4 games played, as these would most accurately reflect a team’s performance heading into the tournament.

In the next post I’ll walk through the process of scraping the data with Python.