Fine-Tuning NLP Models With Hugging Face

Part 2: Transfer Learning With TensorFlow

Kedion

17 min readSep 29, 2021

Written by Tigran Avetisyan

This is Part 2 of our 3 PART SERIES on Hugging Face.

See Part 1 here.

Why Fine-Tune Pre-trained Hugging Face Models On Language Tasks
Fine-Tuning NLP Models With Hugging Face
Step 1 — Preparing Our Data, Model, And Tokenizer
Step 2 — Data Preprocessing
Step 3 — Setting Up Model Hyperparameters
Step 4 — Training, Validation, and Testing
Step 5 — Inference

Introduction

We recently had a look at Hugging Face — an NLP framework that allows you to get started with language-related tasks without training your own neural network. We explored how to do inference with Hugging Face. More specifically, we learned how to perform API requests to solve language problems and how to access pre-trained models to do inference.

Today, we are going to take one step further by trying transfer learning with Hugging Face models! The Hugging Face platform has thousands of pre-trained models for you to use — if you don’t have much time on your hands for training your own neural net, picking a model and fine-tuning it may be the right option for you!

Let’s get going!

Why Fine-Tune Pre-trained Hugging Face Models On Language Tasks?

So before getting started, let’s try to understand why you would even want to use a framework like Hugging Face for your language tasks. Wouldn’t it be great if you trained your own model with your own data from zero?

Well, unless you have a lot of time and compute resources on your hands, you are better off leveraging transfer learning. There are two reasons to employ transfer learning with Hugging Face instead of training a model from scratch:

Hugging Face models have been trained on huge corpuses of data — larger than you may ever be able to collect for your use cases. This potentially allows you to achieve excellent results with comparatively little data.
Since you don’t need to fully retrain Hugging Face models with enormous datasets, there will be no need for you to wait days or perhaps even weeks for training to complete. Hugging Face models can have tens and even hundreds of millions of parameters — can you afford to wait for weeks while your model is training?

Hugging Face very vividly demonstrates how transfer learning can help accelerate fine-tuning.

It takes a large amount of resources to turn a bare model into something that can produce valid results. Thanks to transfer learning, you can skip the time-consuming step of training and spend very little time on fine-tuning the dataset to your needs:

In fact, Hugging Face’s pre-trained models are capable of excelling at tasks in various areas without the need for additional fine-tuning. And with extra training on small but targeted datasets, you will likely be able to adapt these models to any situation.

So essentially, we can view transfer learning as some kind of a shortcut in training. You can save thousands of hours and tens of thousands of compute dollars by just using pre-trained language models! Unless your tasks are super-specific and can’t be solved with existing models, you should stick to transfer learning.

Fine-Tuning NLP Models With Hugging Face

Now that we understand the uses and benefits of transfer learning, we can proceed to our fine-tuning guide with Hugging Face!

Generally, fine-tuning involves the following steps:

Collecting data for training.
Selecting a model that has been pre-trained on data that’s similar to your dataset.
Preprocessing data in accordance with the model’s expected input format.
Fine-tuning and training our model until we achieve sufficient performance.

Note that with transfer learning, pre-trained models aren’t fully retrained — only the output layers are trained on the new dataset, while the rest is left as-is.

This is the formula that we will be basing our fine-tuning efforts on!

With that, we can now begin transfer learning with Hugging Face! Note that we will be using pre-trained tokenizers and Hugging Face datasets to simplify the guide. But if you want, you could train your own tokenizer from scratch.

Step 1 — Preparing Our Data, Model, And Tokenizer

To get started, we need to:

Prepare our data. For this tutorial, we are going to be using the TweetEval dataset intended for multi-class classification.
Load a pre-trained model and its corresponding tokenizer. We will be using DistilBERT base uncased, which has about 67 million parameters.

When using Hugging Face datasets, loading data is very easy. First, if you haven’t already done so, you need to install the Datasets package with the following command:

If you are using Conda, the installation command would be this:

Once you have everything set up, import the necessary libraries and load the dataset, like so:

We use the load_dataset function to load our data, passing the following arguments:

path, which is the URL or the directory of the dataset.
name, the dataset configuration we want to load (if applicable).

Consult the Datasets documentation for more information about its loading methods.

You can find the path of the dataset you intend to use on its webpage (at the very top):

As for the config, if your dataset has any, they should be described on the dataset’s webpage. For this tutorial, we’ve selected the “emotion” config — it’s intended to help us identify emotions based on tweet content. This particular configuration has four labels corresponding to anger, joy, optimism, and sadness:

Next, we instantiate our DistilBERT model along with its tokenizer:

Like in our inference guide, we are using auto classes to instantiate the tokenizer and model. We simply need to pass our model’s name to the respective auto classes to get the right tokenizer and model.

Notice that we passed 4 to the num_labels parameter of the auto model’s from_pretrained method. This is so that we get the right output shape for our dataset.

The preparatory steps end here — now, we can start preprocessing our data!

Step 2 — Data Preprocessing

Inspecting the Dataset

Let’s have a look at the contents of our dataset:

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 3257
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1421
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 374
    })
})

Our dataset is a DatasetDict object — that is, a Datasets dictionary object that contains our Dataset data splits. The Datasets library uses the DatasetDict and Dataset classes to represent data.

As we can see from the output above, our dataset contains training, validation, and test sets. We can access these sets via their corresponding keys (“train”, “validation”, and “test”). Overall, there are 5,052 samples in the dataset, about 64.5% of which is in the training set, 7.4% in the validation set, and 28.1% in the test set.

Let’s inspect the training set to understand what our dataset contains:

Dataset({
    features: ['text', 'label'],
    num_rows: 3257
})

Our train set contains features “text” and “label.” If we inspect these, we can see that they contain tweets and their corresponding labels:

Sequence samples:
 ["“Worry is a down payment on a problem you may never have'. \xa0Joyce Meyer.  #motivation #leadership #worry", "My roommate: it's okay that we can't spell because we have autocorrect. #terrible #firstworldprobs"]

Label samples:
 [2, 0]

Labels are already in numeric form, so we won’t have to transform them for training. However, we will need to transform numeric predictions to their corresponding string labels during inference to understand the output of our model. For this purpose, let’s make the following dictionary:

Another thing we should do is to figure out how many words the longest sequence in the dataset contains. This is important because we need to make sure that our sequences have the same length — otherwise, we won’t be able to put our data into TF Tensors. Besides, going longer than the longest sequence in the dataset has no point.

We can use this function to find out the length of the longest sequence in each of the data splits:

Here, we pass a custom key function to Python’s max function. The key function:

Splits each text sequence in the dataset into a list of words via the split method of the str class. By default, split uses whitespace as a separator, which allows us to break strings down into words.
Retrieves the lengths of the resulting word lists.
Obtains the sequence that has the most words.
Splits the longest sequence into a list of words again.
Returns the length of the longest sequence.

And here’s what we get if we call this function on our data splits:

Longest sequence in train set has 33 words
Longest sequence in val set has 32 words
Longest sequence in test set has 36 words

So our longest sequence is in the test set and has 36 words!

Filtering, Padding, and Tokenizing Our Dataset

Our next step is to preprocess our dataset and make it ready for training.

We’ve just determined that the longest sequence in the dataset has 36 words. To be able to transform our data into TF Tensors, we need to make sure that our sequences all have the same length. In the case of our dataset, that would mean that our sequences must all be 36 words long.

However, long sequences may slow training down, and if you have little RAM or VRAM, you may not even be able to use the full dataset for training! Fortunately, we can work around this issue by discarding sequences beyond a certain length from our dataset. This can be done with a function like this:

In this function, we:

Apply the filter method of the DatasetDict class to the elements under the key “text” in each of our sets.
Split each string into a list containing words via the split method.
Obtain the length of the word list and compare it against our specified number of words.

The filter method returns a new DatasetDict object that has sequences that are no longer than specified.

Then, we specify our word limit and call our function to shrink the dataset. We’ve chosen to make all sequences contain 36 words — you could try plugging another number to see how it affects training!

If we inspect our filtered dataset, we’ll see that it contains all the samples from the dataset:

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 3257
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1421
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 374
    })
})

Next, to be able to use sequences for training, we need to turn them into numeric form. That’s where the tokenizer we’ve loaded earlier comes into play!

We can tokenize our entire dataset in one go by just mapping this function to it:

The tokenizer will encode our text sequences and will ensure that all of our sequences have the same number of components. This will be achieved via padding.

We map tokenize_dataset to the dataset by calling the dataset’s map method:

We pass the tokenize_dataset function to map, which then applies our function to each of the splits in the dataset. The result of the mapping is as follows:

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3257
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1421
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 374
    })
})

As you can see, we have two new features in the dataset:

“input_ids” — our sequences converted to numerical form.
“attention_mask” — an array that shows which IDs (or words) in the sequences the model should and should not pay attention to.

To see these features in action, let’s have a look at one training sample:

{'text': "“Worry is a down payment on a problem you may never have'. \xa0Joyce Meyer.  #motivation #leadership #worry", 'label': 2, 'input_ids': [101, 1523, 4737, 2003, 1037, 2091, 7909, 2006, 1037, 3291, 2017, 2089, 2196, 2031, 1005, 1012, 11830, 11527, 1012, 1001, 14354, 1001, 4105, 1001, 4737, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

This particular training sample was padded since it had originally been shorter than our specified length of 36. In “input_ids”, the ID 0 corresponds to padding.

We can see that the attention mask assigned 0 importance to padding tokens, effectively meaning that the model won’t be basing its outputs on them. This makes sense since the purpose of padding tokens is only to make sure that our sequences are of the same length.

Preparing Features and Labels

Our next step is breaking our DatasetDict down into separate splits for training and then converting them into TF Tensors. To do this, we first extract the features (i.e. input IDs and attention masks) from the dataset by removing unnecessary data columns — “text” and “label”. We’ll assign labels to different Tensors later.

For your reference, tokenizer.model_input_names contains the names of the input data fields that the model expects:

['input_ids', 'attention_mask']

So what we’ve done above is essentially loop through these names and extract values located under the corresponding key in each of our data splits. This gives us features in TF Tensors:

{'input_ids': <tf.Tensor: shape=(3257, 36), dtype=int64, numpy=
array([[  101,  1523,  4737, ...,     0,     0,     0],
       [  101,  2026, 18328, ...,     0,     0,     0],
       [  101,  2053,  2021, ...,     0,     0,     0],
       ...,
       [  101,  1030,  5310, ...,     0,     0,     0],
       [  101,  2017,  2031, ...,     0,     0,     0],
       [  101,  1030,  5310, ...,     0,     0,     0]], dtype=int64)>, 'attention_mask': <tf.Tensor: shape=(3257, 36), dtype=int64, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int64)>}

Now, let’s make our labels. We’ll be one-hot encoding them via Keras’s to_categorical function:

Our labels obtain the following form:

[[0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]]

Creating Datasets for Training, Validation, and Testing

And finally, let’s use our features and labels to create TensorFlow Datasets:

When called, our Dataset objects will supply the model with batches of features & labels for training, validation, and testing.

You may play around with batch sizes if you want — the choice of batches with 8 samples is arbitrary. Note that if you increase the batch size, you may need to increase the model’s learning rate, and vice versa. Larger batches will require more RAM/VRAM as well.

Step 3 — Setting Up Model Hyperparameters

Our next step would be to optimize hyperparameters. This isn’t going to be a huge focus for us today since the goal of this guide is to introduce you to the basics of transfer learning with Hugging Face.

But if you’ve got a good grasp of hyperparameter tuning, know that the same rules apply here. If you’ve ever worked with Keras, you already know how this is done — Hugging Face models in TF format are handled in the same way as Keras models.

Freezing DistilBERT Weights

Let’s first inspect our model architecture to get a general idea of what we’re dealing with. Since TF models in Hugging Face are compatible with the Keras API, we can inspect our DistilBERT model with the help of the summary method.

Model: "tf_distil_bert_for_sequence_classification" _________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= distilbert (TFDistilBertMain multiple                  66362880   _________________________________________________________________ pre_classifier (Dense)       multiple                  590592     _________________________________________________________________ classifier (Dense)           multiple                  3076       _________________________________________________________________ dropout_19 (Dropout)         multiple                  0          ================================================================= Total params: 66,956,548 Trainable params: 66,956,548 Non-trainable params: 0 _________________________________________________________________

We can see that our model has about 67 million parameters and consists of four “blocks”:

A large block that represents the DistilBERT model.
A pre-classifier dense layer.
A classifier dense layer.
A dropout layer.

At the moment, all the layers are set to be trainable. We could leave the model as-is and train it from scratch, but if we want to leverage all the data that the model has been trained on (i.e. do transfer learning), we should freeze the DistilBERT block’s weights.

We can access the DistilBERT block via the layers attribute of the model. layers is a list that contains the blocks/layers we’ve had a look at earlier.

And here’s how we freeze the DistilBERT block:

If we call summary again, we can see the difference that freezing the weights made.

Model: "tf_distil_bert_for_sequence_classification" _________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= distilbert (TFDistilBertMain multiple                  66362880   _________________________________________________________________ pre_classifier (Dense)       multiple                  590592     _________________________________________________________________ classifier (Dense)           multiple                  3076       _________________________________________________________________ dropout_19 (Dropout)         multiple                  0          ================================================================= Total params: 66,956,548 Trainable params: 593,668 Non-trainable params: 66,362,880 _________________________________________________________________

Only the parameters in pre-classifier and classifier layers are now trainable.

Making A Learning Rate Schedule

Another thing we are going to use to hopefully improve training results is a learning rate scheduler. The purpose of this scheduler will be to gradually reduce the learning rate as training goes on.

When the learning rate is kept unchanged throughout training, models often fail to converge to their best results. A reduced learning rate isn’t necessarily the key to achieving peak accuracy, but it often does improve training results.

We’ll be using Keras’s LearningRateScheduler class to set up a schedule. LearningRateScheduler requires us to define a function for the learning rate schedule — we can use something like this:

We then instantiate our callback object as follows:

Selecting Performance Metrics and Compiling Our Model

As our next step, we need to select performance metrics for our model and then compile it.

We’ll be using the following classes to compile our model:

tf.keras.optimizers.Adam.
tf.keras.losses.CategoricalCrossentropy.
tf.keras.metrics.CategoricalAccuracy.

We’re using categorical cross-entropy and categorical accuracy as our loss and accuracy metrics respectively because our labels are one-hot encoded.

Notice the from_logits parameter for CategoricalCrossentropy. We’ve set this to True because our model outputs logits rather than probabilities.

For our Adam optimizer, we’ve used its default learning rate, but you could adjust it to see how it affects training!

Step 4 — Training, Validation, and Testing

To commence training with Hugging Face TF models, we just call the fit method of our model, passing our training and validation data, the desired number of epochs, and our training callbacks.

Epoch 1/15

Epoch 00001: LearningRateScheduler reducing learning rate to 0.0010000000474974513.

408/408 [==============================] - 28s 44ms/step - loss: 0.9866 - categorical_accuracy: 0.6088 - val_loss: 0.8948 - val_categorical_accuracy: 0.6176Epoch 2/15

Epoch 00002: LearningRateScheduler reducing learning rate to 0.0010000000474974513.408/408 [==============================] - 17s 41ms/step - loss: 0.8314 - categorical_accuracy: 0.6742 - val_loss: 0.8540 - val_categorical_accuracy: 0.6444Epoch 3/15

Epoch 00003: LearningRateScheduler reducing learning rate to 0.0010000000474974513.408/408 [==============================] - 16s 39ms/step - loss: 0.8004 - categorical_accuracy: 0.6874 - val_loss: 0.8355 - val_categorical_accuracy: 0.6390Epoch 4/15

Epoch 00004: LearningRateScheduler reducing learning rate to 0.0010000000474974513.408/408 [==============================] - 16s 39ms/step - loss: 0.7789 - categorical_accuracy: 0.6933 - val_loss: 0.8153 - val_categorical_accuracy: 0.6524Epoch 5/15

Epoch 00005: LearningRateScheduler reducing learning rate to 0.0010000000474974513.408/408 [==============================] - 16s 39ms/step - loss: 0.7630 - categorical_accuracy: 0.6976 - val_loss: 0.8000 - val_categorical_accuracy: 0.6658Epoch 6/15

Epoch 00006: LearningRateScheduler reducing learning rate to 0.0010000000474974513.408/408 [==============================] - 17s 42ms/step - loss: 0.7465 - categorical_accuracy: 0.7065 - val_loss: 0.8149 - val_categorical_accuracy: 0.6471Epoch 7/15

Epoch 00007: LearningRateScheduler reducing learning rate to 0.0010000000474974513.408/408 [==============================] - 17s 41ms/step - loss: 0.7303 - categorical_accuracy: 0.7086 - val_loss: 0.8147 - val_categorical_accuracy: 0.6444Epoch 8/15

Epoch 00008: LearningRateScheduler reducing learning rate to 0.0010000000474974513.408/408 [==============================] - 16s 38ms/step - loss: 0.7182 - categorical_accuracy: 0.7163 - val_loss: 0.8223 - val_categorical_accuracy: 0.6390Epoch 9/15

Epoch 00009: LearningRateScheduler reducing learning rate to 0.0010000000474974513.408/408 [==============================] - 16s 38ms/step - loss: 0.7012 - categorical_accuracy: 0.7215 - val_loss: 0.8358 - val_categorical_accuracy: 0.6337Epoch 10/15

Epoch 00010: LearningRateScheduler reducing learning rate to 0.0010000000474974513.408/408 [==============================] - 16s 39ms/step - loss: 0.6956 - categorical_accuracy: 0.7249 - val_loss: 0.8391 - val_categorical_accuracy: 0.6497Epoch 11/15

Epoch 00011: LearningRateScheduler reducing learning rate to 0.0003678794586447782.408/408 [==============================] - 16s 39ms/step - loss: 0.6354 - categorical_accuracy: 0.7525 - val_loss: 0.7942 - val_categorical_accuracy: 0.6738Epoch 12/15

Epoch 00012: LearningRateScheduler reducing learning rate to 0.00012245643455377955.408/408 [==============================] - 16s 40ms/step - loss: 0.6201 - categorical_accuracy: 0.7593 - val_loss: 0.7920 - val_categorical_accuracy: 0.6898Epoch 13/15

Epoch 00013: LearningRateScheduler reducing learning rate to 3.688316883663751e-05.408/408 [==============================] - 16s 39ms/step - loss: 0.6052 - categorical_accuracy: 0.7617 - val_loss: 0.7870 - val_categorical_accuracy: 0.6791Epoch 14/15

Epoch 00014: LearningRateScheduler reducing learning rate to 1.0051835886692629e-05.408/408 [==============================] - 17s 40ms/step - loss: 0.6055 - categorical_accuracy: 0.7673 - val_loss: 0.7874 - val_categorical_accuracy: 0.6765Epoch 15/15

Epoch 00015: LearningRateScheduler reducing learning rate to 2.4787521134827945e-06.408/408 [==============================] - 16s 39ms/step - loss: 0.5919 - categorical_accuracy: 0.7654 - val_loss: 0.7874 - val_categorical_accuracy: 0.6791

Our model is evidently overfitting since our training results are considerably better than validation results. Try to play around with some parameters to see if you can reduce overfitting (we’ll give you some ideas at the end of the guide).

After training, we use our test set to evaluate the performance of the model:

[==============================] - 6s 33ms/step - loss: 0.7024 - categorical_accuracy: 0.7277[0.7024242877960205, 0.7276566028594971]

If, after fine-tuning, you are satisfied with your model, you can save it by calling its save_pretrained method:

You can later load this model by calling its from_pretrained method after you instantiate it via the auto class — exactly like we did in the very beginning.

Step 5 — Inference

And as the last step in this guide, let’s have a look at inference.

There’s nothing too difficult about inference with Hugging Face — we just need to obtain our predictions and then convert them into text labels.

TFSequenceClassifierOutput(loss=None, logits=array([[-1.370903  , -4.7255187 , -0.5999131 ,  3.622602  ],        [ 1.3311445 , -1.4883113 , -0.37181637, -0.26223233],        [-1.1938188 , -1.2004251 , -4.824672  ,  4.268196  ],        ...,        [ 0.6109057 , -1.8429809 , -2.5267117 ,  1.897653  ],        [ 3.3085601 , -2.588659  , -2.983421  , -0.5218434 ],        [-3.7007992 ,  3.6551635 , -0.12156612, -1.4982461 ]],       dtype=float32), hidden_states=None, attentions=None)

Model outputs are logits, so we could convert them into probabilities and then extract the indices of the classes with the highest probabilities.

We then convert probabilities to class names by using our class_names dictionary we’ve defined when inspecting our data:

The result is as follows:

['sadness', 'anger', 'sadness', 'joy', 'sadness', 'anger', 'sadness', 'sadness', 'sadness', 'anger']

And to make sure that the model’s outputs make sense, let’s examine the predictions against input sequences.

Tweet: [CLS] @ user interesting choice of words... are you confirming that governments fund # terrorism? bit of an open door, but still... [SEP] [PAD] [PAD] [PAD] [PAD] [PAD]
Predicted class: anger

Tweet: [CLS] my visit to hospital for care triggered # trauma from accident 20 + yrs ago and image of my dead brother in it. feeling symptoms of # depression [SEP] [PAD] [PAD] [PAD] [PAD]
Predicted class: sadness

Tweet: [CLS] @ user welcome to # mpsvt! we are delighted to have you! # grateful # mpsvt # relationships [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
Predicted class: joy

Tweet: [CLS] what makes you feel # joyful? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
Predicted class: sadness

Tweet: [CLS] # deppression is real. partners w / # depressed people truly dont understand the depth in which they affect us. add in # anxiety & amp ; makes [SEP]
Predicted class: sadness

Tweet: [CLS] i am revolting. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
Predicted class: anger

Tweet: [CLS] rin might ever appeared gloomy but to be a melodramatic person was not her thing. \ n \ nbut honestly, she missed her old friend [SEP]
Predicted class: sadness

Tweet: [CLS] in need of a change! # restless [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
Predicted class: sadness

The model could definitely be improved, but the point of this article was to highlight the main steps to get you started.

Stay tuned for part 3, coming soon!

Next Steps

And this is basically how you fine-tune Hugging Face models on your own, specific datasets!

Note that our model was overfitting on the data, and it wasn’t showing super-good results. As suggestions, you could try doing the following:

Using a different sequence classification model.
Freezing or unfreezing other model layers.
Adding new layers on top of the model and/or removing existing ones.
Creating your own data splits.
Adjusting the learning rate.
Adjusting the batch size.

Code

You can find all code for this article in the Jupyter notebook here.

Fine-Tuning NLP Models With Hugging Face

Part 2: Transfer Learning With TensorFlow

Contents

Introduction

Why Fine-Tune Pre-trained Hugging Face Models On Language Tasks?

Fine-Tuning NLP Models With Hugging Face

Step 1 — Preparing Our Data, Model, And Tokenizer

Step 2 — Data Preprocessing

Inspecting the Dataset

Filtering, Padding, and Tokenizing Our Dataset

Preparing Features and Labels

Creating Datasets for Training, Validation, and Testing

Step 3 — Setting Up Model Hyperparameters

Freezing DistilBERT Weights

Making A Learning Rate Schedule

Selecting Performance Metrics and Compiling Our Model

Step 4 — Training, Validation, and Testing

Step 5 — Inference

Next Steps

Code

Written by Kedion

No responses yet