Hard in the Paint: Using AI to make my March Madness bracket picks

K Hodges
4 min readMar 13, 2018


Being a Data Science newbie, I’ve recently been trying to brush up on my ML skills, especially with a focus on practical applications. Luckily, my timing was great, because Kaggle is having it’s annual March Madness competition

The goal of the competition is to get as close to a correct prediction percentage as possible. My particular goal was a bit different- using the same techniques, but to make my office bracket picks.

You can find the whole repository on my github, including each iteration of the process. The dataset will need to be grabbed from Kaggle if you want to play along.


The initial starter example on Kaggle utilizes Logistic Regression to make predictions based on seed difference in the teams. This makes sense- it’s a safe bet to expect that the higher seeded team will likely win. Additionally, by subtracting the two, we get the distance between the two seeds. This allows us to say things like “The bigger a difference, the less likely the lower ranked team will win”.

Using Logistic Regression alone, we were able to score a 0.55 (Lower is better) in Kaggle.

Unfortunately, I just finished a Deep Learning class, and have a storied history of over-complicating things. I decided to build out a deepish neural network using Keras and Tensorflow.

Here’s what the data we are feeding in looks like:

[Year, Winning Team ID, Losing Team ID, Seed Diff] [Team Did Win]

I grabbed the results of each game from 2015 forward, determined the seed difference (using 17 if the team didn’t make it to the championship that year). I then flipped the Winners and Losers and and re-calculated the Seed Diff. The ‘true’ results got labeled 1, the ‘false’ results got labeled 0.

Initially, I had used ALL the data, going back to the 80s. The results were a bit less accurate. I also tried tying the team ID to the year to get a unique Team-per-year, but this seemed to reduce accuracy as well, probably due to lack of unique data.

The Neural Network

I went with Dense layers using mainly ReLU (rectified linear unit) activation functions. Dropout layers were added, and finally through a Sigmoid output layer. Finally we fit it using the rmsprop optimizer with the binary_crossentropy loss function.

The predictions were then passed through Standard Deviation to normalize a bit more.

classifier = Sequential()
classifier.add(Dense(6,input_shape=(ds_width,),kernel_initializer = 'normal',activation='relu'))
classifier.add(Dense(13, activation='relu',kernel_initializer = 'normal',))
classifier.add(Dense(13, activation='relu',kernel_initializer = 'normal',))
classifier.add(Dense(1,kernel_initializer = 'normal', activation='sigmoid' ))
classifier.compile(optimizer = 'rmsprop', loss = 'binary_crossentropy', metrics=['binary_accuracy'])
print("Cooking... Please wait")
classifier.fit(X_train, y_train, batch_size = 250, epochs = 500, verbose=1)

Here are some explanations of each piece, and the explanation for the decision to choose said piece.

The Activation Function is what determines if the neuron should be fired or not. Additionally, they play a role in propitiation, in that the function receives a weight based on the accuracy of the results.

ReLU activation functions are not dissimilar from electrical rectifiers. The math looks like this: R(x) = max(0,x). R(x) returns 0 if x is less than 0, and x is it’s greater than 0. ReLUs have a consistent gradient descent and allow us to more easily adjust our model to account for loss.

Dropout layers will block some of the previous neurons from firing to the next. This technique helps prevent overfitting, and can generate some correlations that would have otherwise been missed due to overfitting.

Output Layers receive all the input from the hidden layers and transform the results into a single result (in our case).

Sigmoid output layers will return something between 0 and 1. In our data model, a 1 is “The First Team listed is the winner”, and 0 is “The First Team listed is not the winner.”

To determine out Loss Function and Optimizer, I employed Keras’s Grid Search to help make that decision.

optimizer = [‘SGD’, ‘RMSprop’, ‘Adagrad’, ‘Adadelta’, ‘Adam’, ‘Adamax’, ‘Nadam’]
param_grid = dict(optimizer=optimizer)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)

Grid Search came back and told me that RMSProp was the best choice for optimizer function. The optimizer function is responsible for handling gradient descent and adjusting the weights. RMSProp (Root Mean Square Propagation) is an extension of Stochastic Gradient Descent that adapts to the rate of weight change.

For the Loss Function (to determine, with each iteration, how close we are to being accurate), we use Binary Cross-entropy. This increases the further we are away from the correct answer (0 or 1).


The way that Kaggle determines your score uses a algorithm that deincentivizes being “Incorrect and Confident”. This means that you can score a 0.5 (Better than Logistic Regression) by just answering 0.5 as each prediction. Ironically, my neural network model actually scored a bit lower than the LogReg model (around 0.57 on it’s best day), however when running it through a bunch of test data, it predicts the correct winner around 70% of the time.

I’m looking forward to adding more columns to the data model, possibly including things like metrics for offensive and defensive capabilities of a team, as well as other details.

We’ll see the results of the March Madness bracket. I actually don’t follow NCAA basketball too closely, so I was surprised to find out that Loyola was considered a very weak pick. The model was extremely confident in its decision: ultimately giving our Chicago underdogs a 95% likelihood to beat Villanova (one of the favorites).

Anybody know a good bookie that takes DogeCoin?



K Hodges

“Defense Researcher” according to Reuters, Chelsea Manning Fan Fiction Author, Delightful Degenerate