Accelerating Machine Learning Model Training and Inference with Scikit-Learn

271 VIEWS

· ·

Intelligent systems are becoming more widely used in today’s world. Machine learning engineers and data scientists, on the other hand, continue to struggle with minimizing the time spent training and tuning models, as well as the latency in returning predictions. As the amount of data processed grows larger, this becomes even more important. When using scikit-learn algorithms, which lack direct GPU support, the difficulty of accelerating model training and inference becomes especially clear. 

In this article, you will learn about the Intel extension for scikit-learn and how to use it to accelerate model training and inference processes. You will also build a musical instrument classification system and see the speed that using the Intel extension with scikit-learn brings.

Intel Extension for Scikit-Learn

The Intel extension for scikit-learn is a part of Intel’s AI analytics toolkit. This toolkit accelerates machine learning model training and inference on Intel devices. In particular, the Intel extension for scikit-learn accelerates compute-intensive tasks frequently performed with scikit-learn. As a result, it takes less time for models to develop or to finish inference, which speeds up the machine learning process.

The Intel extension accelerates the underlying computations of some Sklearn algorithms partly by using vector processing. Vector processing ensures that instructions are packaged in such a way that they are executed efficiently for single-dimensional data arrays. It also uses threading and makes sure that memory optimization is carried out on Intel architecture-based hardware. Also, it is integrated directly into your existing code with a minimal number of changes.

Using the Intel Extension to Accelerate Model Training and Inference with SKlearn

You will be using the IRMAS dataset to build a simple system that identifies the musical instruments in a given audio signal. We will use part 1 of the dataset in this article.

Downloading the Audio Files

To download the data, open a terminal and run the command below. This downloads a zipped file named IRMAS-TestingData-Part1.zip into your current working directory.

$ wget https://zenodo.org/record/1290750/files/IRMAS-TestingData-Part1.zip

Next, to uncompress this file and get access to the audio data, run the commands below. This saves the extracted files in the data/IRMAS-TestingData-Part1/Part1/ directory.

$ mkdir data
$ unzip IRMAS-TestingData-Part1.zip -d data/

Creating the Melspectogram Extraction Function

Melspectograms are essentially a visual representation of the relationship between frequency and time in a given signal, in this case, an audio signal. However, the frequencies, in this case, are converted to the mel scale. The mel scale is a logarithmic scale that depicts frequencies similarly to how the human hearing system perceives them. It is crucial to preprocess audio signals in this manner in order to provide the models with information that makes learning easier.

Before moving on, import numpy, tqdm, and librosa into your environment using the pip command below with the quiet flag. These are the packages we need to run the code in this section.

$ pip install numpy tqdm librosa -q
Extract Melspectograms

Next, create a function named extract_melspectogram_batch to extract melspectograms from the audio files.

import os
from tqdm.notebook import tqdm
import numpy as np
import librosa

def extract_melspectogram_batch(filepath, n_fft=1024, len_segment = 10):
    signal, sample_rate  = librosa.load(filepath)
    duration = librosa.get_duration(y=signal, sr=sample_rate)
    n_samples_per_segment = sample_rate * len_segment
    n_segments = int(duration / len_segment)
    melspectograms = []
    for i in range(n_segments):
        start = i * n_samples_per_segment
        stop = start + n_samples_per_segment
        melspectogram = librosa.feature.melspectrogram(y=signal[start:stop],
                                                       sr=sample_rate,
                                                       n_fft=n_fft,
                                                       )
        melspectograms.append(melspectogram.tolist())
    return melspectograms, n_segments

The audio file path, the number of fast Fourier transforms to compute (n_fft), and the duration of each melspectogram (in seconds) are all inputs for this function. The latter is crucial because the lengths of the music files vary, and in order to learn with them, one needs features with equal dimensions.

In the function’s body, first load the file using librosa.load. This returns the signal and the audio sampling rate. In order to determine the length of the audio file, parse these values into the librosa.get_duration method. By multiplying the sample rate by the segment length, you can calculate the number of samples per segment. Additionally, determine how many 10 second segments the file has by dividing the duration by the size of a segment.

Next, make a list to store the spectrograms, and then loop through all of the segments. In each iteration, define the start and stop points, which determine what part of the signal is processed. Following that, pass the signal, the sample rate, and the number of fast Fourier transforms into the librosa.feature.melspectrogram function to generate the melspectogram. Then, convert it to a list using the numpy.array.tolist method and append it to the list created earlier.

Finally, return the extracted spectrograms and the number of segments. 

Extracting the Melspectograms From the Audio Files

Moving on, you write the code below to actually extract spectrograms from all the files.

pbar = tqdm(total = 807)
data = {'melspectograms': [], 'filename': []}
for filename in os.listdir('data/IRMAS-TestingData-Part1/Part1/'):
    if filename[-3:] == 'wav':
        melspectograms, n_segments = extract_melspectogram_batch(f'data/IRMAS-TestingData-Part1/Part1/{filename}')
        data['melspectograms'].extend(melspectograms)
        data['filename'].extend([filename[:-4]] * n_segments)
        pbar.update(1)

First, instantiate a progress bar object and parse in the total number of audio files. After that, make a dictionary to keep melspectograms and the file connected to each melspectogram. The following step involves looping through the list of all the filenames in your data directory. Then, use a condition to make sure that you are dealing with the correct file because this directory also contains text files that list the instruments that go with each file.

Once this is certain, call the extract_melspectogram_batch function to generate the melspectograms. Then, add the generated spectograms to the melspectograms list in the dictionary you previously created. In a similar manner, do the same for the filenames. Since one file can have multiple segments, create a list that contains the filename n_segments times. Finally, update the progress bar by 1 and run the code.

Prepare the Data

After we have extracted the melspectograms, the next step is to prepare the target values. These values represent the instruments in a given musical piece and we will get them from the text files in the data directory. Since the text files are named like the audio files, to access them we simply have to change the extension. Look below to see how you can do this.

labels = []
pbar = tqdm(total=len(data['filename']))
for filename in data['filename']:
    with open(f'data/IRMAS-TestingData-Part1/Part1/{filename}.txt') as f:
        labels.append([i.strip() for i in f.read().strip().split('\n')])

    pbar.update(1)
Create a List

First, instantiate a list called labels to store the instrument identifiers. Also instantiate the progress bar and pass in the length of the filenames list. Next, loop through the filenames. In each iteration, complete the file path and add the ".txt" extension. Then you open the file and read its contents. Since some audio might contain multiple instruments, you can split the file’s content by newline and strip whitespace. This creates a list of instrument identifiers. Append this list to the labels list and update the progress bar. The result when run is below.

As you can see in the image above, we have confirmation that some audio contains multiple instruments. This means that multiple instrument labels can be assigned to each audio file, making this a multi-label classification problem. However, since the labels list contains uneven sublists, it cannot be converted into a NumPy array and used directly. To fix this, we will need to execute the following lines of code

Before moving on, run run pip install pandas scikit-learn scikit-learn-intelex in a terminal to install the final set of modules to be used. 

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

labels = pd.Series(labels)

multi_label_binarizer = MultiLabelBinarizer()

labels = pd.DataFrame(data=multi_label_binarizer.fit_transform(labels),
                   columns=multi_label_binarizer.classes_,
                   index=labels.index)
Create Parameters

Import pandas and the MultiLabelBinarizer. Next, convert the labels list to a pandas series object and create an instance of the MultiLabelBinarizer. Then you can instantiate a pandas.DataFrame object. This takes in 3 parameters. The first is the data which results from transforming the labels object using the multi-label binarizer. The next is columns which represents the instrument names obtained by accessing the multi_label_binarizer.classes_ variable. The last is the index parameter, which is set to the index of the earlier created pandas.Series object.

This converts the data into a one hot encoded form by factoring in all the unique classes. This is what it should look like:

accelerate model training and inference

Finally, convert the melspectograms into an array and split the data into training and testing portions in the ratio of 4:1 as shown below.

from sklearn.model_selection import train_test_split

melspectograms = np.array(data['melspectograms']).reshape(1258, -1)

X_train, X_test, y_train, y_test = train_test_split(
    melspectograms, labels, test_size=0.2, random_state=0
)

In this snippet, you import the train_test_split method from sklearn.model_selection. The spectograms are then flattened and no longer 2-D by converting the melspectograms into an array and reshaping them. The melspectograms, labels, a test size of 0.2, and a random state of 0 are parsed into the train_test_split method.

Build Model with Intel Extension and Sklearn

At this point, you have obtained raw audio files, extracted melspectograms, and processed the instrument labels. You have also split this data into training and testing portions. The next step is building the model and calculating the run time. 

First, import the required modules, which include the timer and the patch_sklearn method from sklearnex. Also make sure to call the patch_sklearn method before importing the support vector classifier (SVC) from scikit-learn. The process will not proceed we intend if we do not do this. The OneVsRestClassifier is also imported to take care of the multi-label classification as the SVC only works for binary classification. Finally, the hamming_loss is imported as a metric.

from timeit import default_timer as timer
from sklearnex import patch_sklearn
patch_sklearn()

from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import hamming_loss

start = timer()

params = {'gamma': 0.0699504493741883, 'C': 3}
model = OneVsRestClassifier(SVC(**params))
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = hamming_loss(y_test, y_pred)

sklearnex_runtime = round(timer() - start, 2)
print(f"Intel(r) extension for Scikit-learn training and inference time: {sklearnex_runtime} s")

Start the timer after the import statements. Also declare the required parameters, which are the gamma (kernel coefficient) and C (regularization parameter) in this case. Then, create the SVC model, unpack these parameters into it, and wrap a OneVsRestClassifier around the SVC instance. 

Following this, pass in the training spectograms and instrument labels into the model.fit method. Then, call the model.predict method on the test melspectograms and calculate the loss using the hamming_loss. The total run time is computed, rounded off to 2 decimal places, and printed. This gives you a value of 36.24 seconds.

Build Model With Sklearn Only and Compare

A similar process is carried out using Scikit-Learn’s original setup. To achieve this, the sklearnex.unpatch_sklearn method is called and then the support vector model is imported again alongside the OneVsRestClassifier and hamming_loss metric.

import sklearnex
sklearnex.unpatch_sklearn()

from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import hamming_loss

start = timer()

params = {'gamma': 0.0699504493741883, 'C': 3}
model = OneVsRestClassifier(SVC(**params)).fit(X_train, y_train)
y_pred = model.predict(X_test)
score = hamming_loss(y_test, y_pred)

original_sklearn_runtime = round(timer() - start, 2)
f"Original Scikit-learn selection time: {original_sklearn_runtime} s"

The rest of the code remains unchanged. In this case, a run time of 666.83 seconds is recorded. This means that using the Intel extension for scikit-learn led to an 18x speedup when compared to using sklearn as it is. This demonstrates the extension’s ability to accelerate model training and inference.

Conclusion

At this point, you now know about the intel extension for scikit-learn. You also know how to process audio data, extract useful features from it, and process target values for a multi-label classification problem. You were able to compare the performance of sklearn with and without the Intel extension and saw that the former gave a significant processing speed advantage.

Thus, you can conclude that the Intel extension helps you experiment quickly and build efficiently. And this is only a part of Intel’s AI analytics toolkit. The kit includes support for even more libraries and models, which can speed up your day to day operations. Make sure to try it out and keep building.


Fortune is a Python developer with a knack for building intelligent systems with data. He works as a machine learning engineer at the NITDA Hub. He is also a technical writer and a process engineer with a focus on sustainable energy systems.


Discussion

Click on a tab to select how you'd like to leave your comment

Leave a Comment

Your email address will not be published.

Menu
Skip to toolbar