ML on FPGAs - Exercise 1
Becoming familiar with InAccel tools and accelerating our first application. Naive Bayes classification acceleration and hyperparameter tuning on Logistic Regression Training.
Introduction
In this lab you are going to create your first accelerated application and then use scikit learn to find out the speedup you get upon running Naive Bayes algorithm using the original (CPU) and FPGA implementation. Further on, you are going to use hyperparameter tuning to get the best model for a Logistic Regression application.
A jupyterhub environment is going to be used that is deployed on Kubernetes and has access to 2 Xilinx Alveo (U250 and U280) and 2 Intel PAC A10 FPGA boards. To connect, navigate to https://studio.inaccel.com. You can also use InAccel's monitoring tool at any time to get an overview of the FPGA running tasks: http://monitor.inaccel.com.
Login to your workspace using your github/google credentials or by creating a new account (sign up).
Your home directory is your persistent volume. There you can store any files that you want to save for later use.
Open a Terminal.
Copy
shared/lab1
folder to your home directory (e.g.cp -r ~/shared/lab1 ~
).Navigate to your own copy of lab1 folder.
Running the first FPGA accelerated application
Press
Ctrl + Shift + L
to open a "New Launcher".Create a new Python notebook.
First of all import inaccel
coral api
andnumpy
Next you are going to create four vectors. Vector a and b will be the ones to be added and subtracted respectively.
Note that inaccel array extends numpy ndarray using its custom allocator for buffers so the way to instantiate an array is the same one you would use with numpy.
Now it is time to create a request for addition:
Finally check the output vector to validate that FPGA computed and returned the right results.
Did you get the right results? Can you modify the code to measure the time elapsed for each request?
You can use %timeit
magic command to measure the execution time of a command in a Python Notebook.
Alternatively, you can always use python time()
function to measure the time elapsed.
Example:
Modify the code to also measure the time for adding and subtracting the two vectors using the CPU. You can store the result in new arrays e.g.
cpu_add = a + b
Measure the speedup using the known formula where Ta is the time of the FPGA execution . Do you actually have a speedup on the execution of each operation (addition or subtraction)? What could be the reason in case you don't?
Input data needs to be transferred to the FPGA board DDR to be computed and then the results to be returned back to the host since the OS and the FPGA don't share the same memory.
The first time that you send a request for an accelerator, if the FPGA isn't configured with that bitstream, it needs to be reconfigured so you may observe a much higher execution time.
Scikit-Learn on FPGAs
Naive Bayes Example
In this section you are going to compare the execution of the original Scikit-Learn Naive Bayes algorithm with the FPGA accelerated one. From the jupyterhub dashboard navigate to lab1/notebooks
folder and open NaiveBayes.ipynb. The administrator of the lab has already installed any necessary bitstreams publicly so that all of the teams can have access to the same accelerator for running the scikit-learn examples. You can verify that by issuing an inaccel bitstream list
command.
The accelerator has some limitations as described below:
Max number of classes = 64
Max number of features = 2048
Run all the cells and inspect the outputs
Did the FPGA classifier returned the results the original one (CPU) returned?
Run again the cells and calculate the speedup you get for all the possible combinations of the following configurations (total configurations: 9):
samples: 100000
features: 500, 1000, 2000
classes: 10, 35, 60
In which case of the above ones did you get the highest speedup? Can you explain why?
In your report create two charts:
one for the speedup you get depending on the number of features for 100000 samples and 60 classes
and a second one for the speedup you get depending on the number of classes for 100000 samples and 2000 features
Does it better make sense to use this specific FPGA accelerator on a dataset containing a lot of classes or features?
Logistic Regression Example
In the previous section we accelerated the classification part of Naive Bayes algorithm using scikit-learn and InAccel Python API and took metrics on the speedup observed.
In this section we are going to use again scikit-learn but this time we will focus on accelerating the training part of another widely used machine learning algorithm, Logistic Regression. We have created a notebook that is used to train and apply many accelerated sklearn models, with a k-fold cross validation and hyperparameter tuning step. In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.
Before you proceed to the next steps and to avoid any errors, make sure you have scikit-learn v0.24.1 installed. To do so, open a terminal and execute the following:
Alternatively, you can issue the following command inside a Notebook cell:
From the jupyterhub dashboard open LogisticRegression.ipynb file. The administrator of the lab has already installed any necessary bitstreams publicly so issuing an inaccel bitstream list
command will reveal any installed Logistic Regression bitstreams. Logistic Regression (LR) training takes several hyperparameters which can affect the accuracy of the learned model. There is no "best" configuration for all datasets. To get the optimal accuracy, we need to tune these hyperparameters based on our data.
The accelerator has some limitations as described below:
Max number of classes: 64
Max number of features: 2047
Objectives:
Run all the cells and observe the speedup you get compared to the software execution.
Focus only on the the FPGA accelerated part. Change the parameters grid and add the following:
max_iter: 50, 100
l1_ratio: 0.3, 0.9
Re-run the cells related to the FPGA accelerated training.
Which combination of max_iter, l1_ratio provides the best model in terms of accuracy?
​Notes
For more information on the InAccel framework you can visit InAccel's website.
For a full documentation of the APIs used please refer to InAccel's Documentation.
Last updated