XGBoost from Scratch | chloe rushing

Image Recognition Series Part 2:
XGBoost from Scratch

Updated Feb 24, 2025

Project Duration: Feb 2025 - Feb 2025

Introduction

When starting my data analytics journey, I knew I wanted to expand on my small facial recognition project, with the objective to analyze which algorithm was most accurate and efficient. However, with my limited knowledge of machine learning and artificial intelligence, I didn't quite know where to start or even which models to use.

Earlier this year, I took the Salesforce CRM Analytics and Einstein Discovery certificate exam. One of the questions mentioned several different algorithms that could be used in prediction models within Salesforce CRM analytics, including GLM (generalized linear model), k-Nearest Neighbors, and XGBoost. After some research, XGBoost was the one that piqued my interest the most.

Background

XGBoost, short for eXtreme Gradient Boosting, uses decision trees for classification, regression, and ranking. My main goal for this project is classification, specifically recognition. Several of my past courses used Dijkstra's algorithm, which starts with the final goal and uses recursion to find the shortest path. During my research, I came to the conclusion that XGBoost starts with the initial prediction and continues to update the current prediction a specified number of times, and I wanted to use the final prediction to compare with the actual data.

Brainstorming

Although there is already an open-source library for XGBoost that can be used in Python, I wanted to implement a script from scratch to better understand how the algorithm worked. The main thing I left out was the loss function, Instead, I created a small method that looped through an n number of iterations representing the tree size. While putting together the method, I realized having to use mean squared error (MSE) on an array would give me only only one value, but I wanted to continue using array operations. To combat this, I omitted the loss function and included a step size variable as a dynamic constant to prevent overfitting the model and update the prediction:

def update_predictions(prediction, data, num_iterations):

# for each iteration, compute gradient and hessian, update predictions

for i in range(num_iterations):

# compute gradient and hessian

grad, hess = gradient_hessian_helper(data, prediction)

# compute optimal step size to update predictions

step = np.sum(grad) / np.sum(hess)

# update predictions

prediction = prediction - step * grad

return prediction

The loop also calls a helper method that computes the gradient and hessian, which will then be used to update the prediction:

def gradient_hessian_helper(data, pred):

grad = pred - data

# hessian is always 1 for mean squared error (2nd derivative)

hess = np.ones_like(pred)

return grad, hess

Because the point of this project is recognition, I copied the recognition and accuracy checking methods from my eigenvalue/SVD version, with the exception of code snippets related to eigenvalues.

Errors

Because I had already split up the data prior to creating this script, I initialized the prediction with the average of my training set. I had originally started with a value of 0, but this somehow resulted in an onslaught of errors.

First, I attempted to initialize the prediction with the value 0:

prediction = np.full_like(data, 0)

This caused an overflow error due to the number of iterations used, as the more loops were made, the closer to infinity the values became due to the multiplication of the variables step and grad. This was a problem because I had originally planned to test if more or less iterations improved the prediction.

RuntimeWarning: overflow encountered in multiply

prediction = prediction - step * grad

Iterations Error with Initial Prediction of 0

As a result, I initialized the prediction with the average of the training data. This also helped with the iterations issue. While debugging the training method, I also found that the model was able to converge after only one iteration, however, I kept the loop variable for testing purposes. The average of the training data was 0.7215643015045017, and we can see from the printed output that the final training prediction remained the average no matter how many iterations were used.

Training for 1 Iteration

Training for 5 Iterations

After fixing the issue with the training method, I only needed a one-liner for testing, which used the training prediction as the initial prediction for testing.

def test_scores(pred, data):

model_prediction = update_predictions(pred, data, 10)

return model_prediction

However, my training and test datasets are different sizes, 60 versus 20 pictures, meaning that I was unable to perform array operations and broadcasting.

ValueError: operands could not be broadcast together with shapes (60,90000) (20,90000)

As previously mentioned, both the initial and final prediction was the average of the training data, meaning that I could just splice the prediction to be the same size as the testing data. Once I included this line, the code ran with no issues.

def test_scores(pred, data):

temp = pred[:data.shape[0]]

model_prediction = update_predictions(temp, data, 10)

return model_prediction

Takeaway

Although I still have much to learn about machine learning algorithms, this was a great first step in continuing my facial recognition project. Prior to starting this algorithm, I had also added a method to the eigenvalue script that exported the accuracy values that would then be imported into a script that would graph them. This graph will be the final piece for analysis. With the conclusion of the XGBoost script, I have another piece of data to include in the graph. Here is a sneak peek:

Graph Analysis in Progress

Click here to view the data and scripts.