ML on FPGAs - Exercise 1
Becoming familiar with InAccel tools and accelerating our first application. Naive Bayes classification acceleration and hyperparameter tuning on Logistic Regression Training.

Introduction

In this lab you are going to create your first accelerated application and then use scikit learn to find out the speedup you get upon running Naive Bayes algorithm using the original (CPU) and FPGA implementation. Further on, you are going to use hyperparameter tuning to get the best model for a Logistic Regression application.
A jupyterhub environment is going to be used that is deployed on Kubernetes and has access to 2 Xilinx Alveo (U250 and U280) and 2 Intel PAC A10 FPGA boards. Each team will connect to https://studio.inaccel.com. Just for the first time, one member of the team will have to signup using the team's credentials. You can also use InAccel's monitoring tool at any time to get an overview of the FPGA running tasks: http://monitor.inaccel.com.
  • Login to your workspace using your team's username and password .
  • Your home directory is your persistent volume. There you can store any files that you want to save for later use.
  • Open a Terminal.
  • Copy shared/lab1 folder to your home directory (e.g. cp -r ~/shared/lab1 ~ ).
  • Navigate to your own copy of lab1 folder.

Running the first FPGA accelerated application

  • Press Ctrl + Shift + L to open a "New Launcher".
  • Create a new Python notebook.
  • First of all import inaccel coral api and numpy
1
import inaccel.coral as inaccel
2
import numpy as np
Copied!
  • Next you are going to create four vectors. Vector a and b will be the ones to be added and subtracted respectively.
Note that inaccel array extends numpy ndarray using its custom allocator for buffers so the way to instantiate an array is the same one you would use with numpy.
1
#make sure that you create a scalar with numpy
2
size = np.int32(1024 * 1024)
3
​
4
# Allocate four vectors & Initialize input vectors with random values
5
with inaccel.allocator:
6
a = np.random.rand(size).astype(np.float32)
7
b = np.random.rand(size).astype(np.float32)
8
c_add = np.ndarray(size, dtype = np.float32)
9
c_sub = np.ndarray(size, dtype = np.float32)
Copied!
  • Now it is time to create a request for addition:
1
# Send a request for "addition" accelerator to the Coral FPGA Resource Manager
2
# Request arguments must comply with the accelerator's specific argument list
3
vadd = inaccel.request("com.inaccel.math.vector.addition")
4
vadd.arg(a).arg(b).arg(c_add).arg(size)
5
inaccel.submit(vadd).result()
Copied!
  • And another one for subtraction:
1
# Send a request for "subtraction" accelerator to the Coral FPGA Resource Manager
2
# Request arguments must comply with the accelerator's specific argument list
3
vsub = inaccel.request("com.inaccel.math.vector.subtraction")
4
vsub.arg(a).arg(b).arg(c_sub).arg(size)
5
inaccel.submit(vsub).result()
Copied!
  • Finally check the output vectors to validate that FPGA computed and returned the right results.
1
# Check output vectors
2
valid = True
3
​
4
if not np.array_equal(c_add, a + b):
5
valid = False
6
​
7
if not np.array_equal(c_sub , a - b):
8
valid = False
9
​
10
if valid:
11
print('Results: RIGHT!')
12
else:
13
print('Results: WRONG!')
Copied!
  • Did you get the right results? Can you modify the code to measure the time taken for each request?
You can use %timeitmagic command to measure the execution time of a command in a Python Notebook.
Alternatively, you can always use python time()function to measure the time elapsed.
  • Example:
%timeit
time()
1
%timeit i = 0
Copied!
1
from time import time
2
​
3
start_time = time()
4
...
5
elapsed_time = int((time() - start_time) * 1000) / 1000
6
​
7
print(elapsed_time)
Copied!
  • Modify the code to also measure the time for adding and subtracting the two vectors using the CPU. You can store the result in new arrays e.g. cpu_add = a + b
  • Measure the speedup using the known formula
    (S=T/Ta)(S=T/Ta)
    where Ta is the time of the FPGA execution . Do you actually have a speedup on the execution of each operation (addition or subtraction)? What could be the reason in case you don't?
  • Input data needs to be transferred to the FPGA board DDR to be computed and then the results to be returned back to the host since the OS and the FPGA don't share the same memory.
  • The first time that you send a request for an accelerator, if the FPGA isn't configured with that bitstream, it needs to be reconfigured so you may observe a much higher execution time.

Scikit-Learn on FPGAs

Naive Bayes Example

In this section you are going to compare the execution of the original Scikit-Learn Naive Bayes algorithm with the FPGA accelerated one. From the jupyterhub dashboard navigate to lab1/notebooks folder and open NaiveBayes.ipynb. The administrator of the lab has already installed any necessary bitstreams publicly so that all of the teams can have access to the same accelerator for running the scikit-learn examples. You can verify that by issuing an inaccel bitstream list command.
The accelerator has some limitations as described below:
  1. 1.
    Max number of classes = 64
  2. 2.
    Max number of features = 2048
  1. 1.
    Run all the cells and inspect the outputs
  2. 2.
    Did the FPGA classifier returned the results the original one (CPU) returned?
  3. 3.
    Run again the cells and calculate the speedup you get for all the possible combinations of the following configurations (total configurations: 9):
    1. 1.
      samples: 100000
    2. 2.
      features: 500, 1000, 2000
    3. 3.
      classes: 10, 35, 60
  4. 4.
    In which case of the above ones did you get the highest speedup? Can you explain why?
  5. 5.
    In your report create two charts:
    1. 1.
      one for the speedup you get depending on the number of features for 100000 samples and 60 classes
    2. 2.
      and a second one for the speedup you get depending on the number of classes for 100000 samples and 2000 features
  6. 6.
    Does it better make sense to use this specific FPGA accelerator on a dataset containing a lot of classes or features?

Logistic Regression Example

In the previous section we accelerated the classification part of Naive Bayes algorithm using scikit-learn and InAccel Python API and took metrics on the speedup observed.
In this section we are going to use again scikit-learn but this time we will focus on accelerating the training part of another widely used machine learning algorithm, Logistic Regression. We have created a notebook that is used to train and apply many accelerated sklearn models, with a k-fold cross validation and hyperparameter tuning step. In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.
From the jupyterhub dashboard open LogisticRegression.ipynb file. The administrator of the lab has already installed any necessary bitstreams publicly so issuing an inaccel bitstream list command will reveal any installed Logistic Regression bitstreams. Logistic Regression (LR) training takes several hyperparameters which can affect the accuracy of the learned model. There is no "best" configuration for all datasets. To get the optimal accuracy, we need to tune these hyperparameters based on our data.
The accelerator has some limitations as described below:
  1. 1.
    Max number of classes: 64
  2. 2.
    Max number of features: 2047
Objectives:
  1. 1.
    Run all the cells and observe the speedup you get compared to the software execution.
  2. 2.
    Focus only on the the FPGA accelerated part. Change the parameters grid and add the following:
    1. 1.
      max_iter: 50, 100
    2. 2.
      l1_ratio: 0.3, 0.9
  3. 3.
    Re-run the cells related to the FPGA accelerated training.
  4. 4.
    Which combination of max_iter, l1_ratio provides the best model in terms of accuracy?

​Notes

​
Last modified 5mo ago