# ML on FPGAs - Exercise 1

Becoming familiar with InAccel tools and accelerating our first application. Naive Bayes classification acceleration and hyperparameter tuning on Logistic Regression Training.

In this lab you are going to create your first accelerated application and then use scikit learn to find out the speedup you get upon running Naive Bayes algorithm using the original (CPU) and FPGA implementation. Further on, you are going to use hyperparameter tuning to get the best model for a Logistic Regression application.

A jupyterhub environment is going to be used that is deployed on Kubernetes and has access to 2 Xilinx Alveo (U250 and U280) and 2 Intel PAC A10 FPGA boards. Each team will connect to https://studio.inaccel.com. Just for the first time, one member of the team will have to signup using the team's credentials. You can also use InAccel's monitoring tool at any time to get an overview of the FPGA running tasks: http://monitor.inaccel.com.

- Login to your workspace using your team's username and password .
- Your home directory is your persistent volume. There you can store any files that you want to save for later use.
- Open a
**Terminal**. - Copy
`shared/lab1`

folder to your home directory (e.g.`cp -r ~/shared/lab1 ~`

). - Navigate to your own copy of lab1 folder.

- Press

to open a "**Ctrl + Shift + L****New Launcher**". - Create a new Python notebook.
- First of all import inaccel
`coral api`

and`numpy`

import inaccel.coral as inaccel

import numpy as np

- Next you are going to create four vectors. Vector
**a**and**b**will be the ones to be added and subtracted respectively.

Note that inaccel array extends numpy ndarray using its custom allocator for buffers so the way to instantiate an array is the same one you would use with numpy.

#make sure that you create a scalar with numpy

size = np.int32(1024 * 1024)

# Allocate four vectors & Initialize input vectors with random values

with inaccel.allocator:

a = np.random.rand(size).astype(np.float32)

b = np.random.rand(size).astype(np.float32)

c_add = np.ndarray(size, dtype = np.float32)

c_sub = np.ndarray(size, dtype = np.float32)

- Now it is time to create a request for
**addition**:

# Send a request for "addition" accelerator to the Coral FPGA Resource Manager

# Request arguments must comply with the accelerator's specific argument list

vadd = inaccel.request("com.inaccel.math.vector.addition")

vadd.arg(a).arg(b).arg(c_add).arg(size)

inaccel.submit(vadd).result()

- And another one for
**subtraction:**

# Send a request for "subtraction" accelerator to the Coral FPGA Resource Manager

# Request arguments must comply with the accelerator's specific argument list

vsub = inaccel.request("com.inaccel.math.vector.subtraction")

vsub.arg(a).arg(b).arg(c_sub).arg(size)

inaccel.submit(vsub).result()

- Finally check the output vectors to validate that FPGA computed and returned the right results.

# Check output vectors

valid = True

if not np.array_equal(c_add, a + b):

valid = False

if not np.array_equal(c_sub , a - b):

valid = False

if valid:

print('Results: RIGHT!')

else:

print('Results: WRONG!')

- Did you get the right results? Can you modify the code to measure the time taken for each request?

You can use

`%timeit`

magic command to measure the execution time of a command in a Python Notebook.Alternatively, you can always use python

`time()`

function to measure the time elapsed.- Example:

%timeit

time()

%timeit i = 0

from time import time

start_time = time()

...

elapsed_time = int((time() - start_time) * 1000) / 1000

print(elapsed_time)

- Modify the code to also measure the time for adding and subtracting the two vectors using the
**CPU**. You can store the result in new arrays e.g.`cpu_add = a + b`

- Measure the speedup using the known formula$(S=T/Ta)$where
**Ta**is the time of the FPGA execution . Do you actually have a speedup on the execution of each operation (addition or subtraction)? What could be the reason in case you don't?

- Input data needs to be transferred to the FPGA board DDR to be computed and then the results to be returned back to the host since the OS and the FPGA don't share the same memory.
- The first time that you send a request for an accelerator, if the FPGA isn't configured with that bitstream, it needs to be reconfigured so you may observe a much higher execution time.

In this section you are going to compare the execution of the original Scikit-Learn Naive Bayes algorithm with the FPGA accelerated one. From the

**jupyterhub**dashboard navigate to`lab1/notebooks`

folder and open *. The administrator of the lab has already installed any necessary bitstreams publicly so that all of the teams can have access to the same accelerator for running the scikit-learn examples. You can verify that by issuing an***NaiveBayes.ipynb**`inaccel bitstream list`

command.The accelerator has some limitations as described below:

- 1.Max number of classes = 64
- 2.Max number of features = 2048

- 1.Run all the cells and inspect the outputs
- 2.Did the FPGA classifier returned the results the original one (CPU) returned?
- 3.Run again the cells and calculate the speedup you get for all the possible combinations of the following configurations (total configurations: 9):
- 1.
**samples**: 100000 - 2.
**features**: 500, 1000, 2000 - 3.
**classes**: 10, 35, 60

- 4.In which case of the above ones did you get the highest speedup? Can you explain why?
- 5.In your report create two charts:
- 1.one for the speedup you get depending on the number of features for 100000 samples and 60 classes
- 2.and a second one for the speedup you get depending on the number of classes for 100000 samples and 2000 features

- 6.Does it better make sense to use this specific FPGA accelerator on a dataset containing a lot of classes or features?

In the previous section we accelerated the classification part of Naive Bayes algorithm using scikit-learn and InAccel Python API and took metrics on the speedup observed.

In this section we are going to use again scikit-learn but this time we will focus on accelerating the training part of another widely used machine learning algorithm, Logistic Regression. We have created a notebook that is used to train and apply many accelerated sklearn models, with a k-fold cross validation and hyperparameter tuning step. In machine learning,

**hyperparameter optimization**or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.From the

**jupyterhub**dashboard open*file. The administrator of the lab has already installed any necessary bitstreams publicly so issuing an***LogisticRegression.ipynb****inaccel bitstream list**

command will reveal any installed Logistic Regression bitstreams. Logistic Regression (*LR*) training takes several hyperparameters which can affect the accuracy of the learned model. There is no "best" configuration for all datasets. To get the optimal accuracy, we need to tune these hyperparameters based on our data.The accelerator has some limitations as described below:

- 1.Max number of classes: 64
- 2.Max number of features: 2047

Objectives:

- 1.Run all the cells and observe the speedup you get compared to the software execution.
- 2.Focus only on the the FPGA accelerated part. Change the parameters grid and add the following:
- 1.
**max_iter**: 50, 100 - 2.
**l1_ratio**: 0.3, 0.9

- 3.Re-run the cells related to the FPGA accelerated training.
- 4.Which combination of
**max_iter**,**l1_ratio**provides the best model in terms of accuracy?

Last modified 1yr ago