Creating a Feature Store with Feast

Part 1: Building a Local Feature Store for ML Training and Prediction

Kedion
17 min readMar 16, 2022
Photo by Alina Grubnyak

Written by Tigran Avetisyan.

In today’s remote and distributed environments, it can be challenging to maintain a correct and consistent stream of data for machine learning training and prediction. When data is spread across several locations, keeping tabs on everything can be a total nightmare.

Here’s where feature storage frameworks come in. They can greatly simplify data management and improve consistency by gathering all your feature data into a single repository. You can then use this repository to analyze your data, train models, and fetch features for prediction.

Many feature store frameworks exist, but perhaps the most famous and widely used of them is Feast. Although still in early development, Feast supports a wide range of data sources and can vastly simplify the management of features stored across different locations.

In this 3-part series on the Feast framework, we are going to explore its capabilities for feature storage and data validation. In this guide — PART 1 in the series — we are going to focus on feature storage in a local environment.

Let’s get started!

What is Feast?

Feast is an open-source framework for storing and serving features to machine learning models. It aims to facilitate the retrieval of feature data from different sources and to this end provides a unified environment for feature management.

Here’s how the Feast docs position the framework in a machine learning data pipeline:

https://docs.feast.dev/

With Feast, machine learning and data science teams can:

· Store their features in offline or online repositories (more on the differences between offline and online repositories later).

· Combine features from different sources for training, analysis, and feature engineering.

· Retrieve fresh feature data for inference.

· Reuse features across different projects and models.

· Validate features to detect changes in their statistical makeup (limited support).

As of Feast version 0.19.3, Feast developers planned to add support for lightweight feature engineering and feature discovery, as well as to improve feature validation.

The Feast docs point out that the framework is not:

· An ETL (extract, transform, and load) or ELT (extract, load, and transform) system. Feast is not designed for general-purpose data transformation and pipelining.

· A data warehouse. Feast is not designed as a data warehouse solution or a source of truth for information.

· A data catalog. Feast is not designed for general-purpose data cataloging. It is specifically focused on features for machine learning.

How Does Feast Work?

In this section, we will explain to you how Feast works and how its infrastructure is built.

If you don’t understand something, don’t worry — we’ll show you how everything works in code after this section. With that said, make sure to check out Feast docs — they contain some very specific details that are beyond the scope of this guide.

Key concepts in Feast

Feast implements hierarchical feature storage to help you store and manage features. As of version 0.19.3, a Feast feature repository has the following structure:

https://docs.feast.dev/getting-started/concepts/overview

Here’s what each of the components in a Feast feature repo means:

· Project. A project is a collection of related features and their data sources. Projects are isolated from each other, and you can’t reuse features from one project in another. As of version 0.19.3, projects were supported to ensure backward compatibility with previous versions of Feast. The concept of projects might change as Feast developers simplify the framework.

· Feature view. A feature view is a group of feature data from a specific data source. Feature views allow you to consistently define features and their data sources, enabling the reuse of feature groups across a project. If your features are stored in more than one location, you can specify a feature view for each location and later join all the features together. Feature views make the addition of new features to your existing data very easy as well — as you gather new groups of features, you can create separate feature views for them and then merge them with your old data.

· Data source. In Feast, each feature view has a data source. A data source is where the raw feature data is stored, like a local .parquet file or a GCP (Google Cloud Platform) bucket. You can have as many data sources as you want, but you can’t mix different types of sources together.

· Feature service. A feature service is an object that contains features from one or more feature views. You can use feature services to create logically related groups of feature views.

· Entity. The Feast docs describe entities as a collection of semantically related features. In practical terms, entities can be the individuals or objects that your feature data relates to. As an example, if you have a pneumonia dataset, you can set your patients as your entities and assign unique IDs to them for identification. You can then use the IDs to store and retrieve specific feature values. Feast also uses entities to correctly join data from different feature views.

· Timestamps. Feast uses timestamps to ensure that features from different sources are joined in the correct chronological order. Primarily, this is so that you can avoid using very old data for training or prediction.

· Dataset. A dataset is a group of feature views and entities. Feast datasets allow engineers to combine data from different feature views for analysis and training.

Feast infrastructure

Feast’s concepts are pretty simple, but how do they work in practice? To help you understand this, let’s take a look at this diagram from the Feast docs:

https://docs.feast.dev/getting-started/architecture-and-components/overview

This diagram displays the architecture of a Feast project. Before we describe how Feast handles data, let’s first understand the concepts in the diagram:

· Offline store. The offline store is where you store your features. In the terminology of Feast, the offline store contains historical features that you can use for analysis or training. Feast can both retrieve and write data to the offline store. This type of store is “offline” because it is located outside a Feast environment.

· Online store. The online store is where Feast stores features for low-latency access. The online store is designed to accelerate inference.

· Feature repo. A feature repository contains the definitions of a Feast feature store. It defines where features are stored, how they should be retrieved, and what they contain. If your offline store is on your local machine, a feature repository can also contain the raw feature data.

· Registry. A registry is a catalog of feature definitions and their metadata. It defines the infrastructure of your feature repository and where feature data comes from.

· Provider. A provider is the implementation of a specific offline or online feature store. Feast has specialized providers for AWS, GCP, and local environments.

With these components in mind, here’s how you use Feast at a high level:

· First, you define your entities, feature views, feature services, and data sources and register them in your feature store. Feast will register feature and data source definitions in your feature repository’s registry.

· Then, you use your feature views to fetch feature data from your offline stores (data sources). In the language of Feast, this is called historical retrieval. You can join different feature views together to create a dataset that you can then analyze, save, or use for training.

· After you train and deploy a model, you can fetch (materialize) the latest feature values from the offline store for inference. When you materialize features, they are stored in the online store for performance reasons.

· As you add new features to your offline stores, you can continuously materialize them to keep your online store fresh. Additionally, if necessary, you can do historical retrieval of your new and old features, join them all together, and retrain your model on the new dataset.

Prerequisites for Using Feast

For this guide, you’ll need to install the following Python packages on your machine:

· Feast.

· Scikit-learn.

· Pandas.

· Joblib.

You might have some of these packages already. If you don’t, use the pip install [package-name] command to install missing packages.

If you are using conda, use the command conda install -c basnijholt feast to install Feast. For pandas, use conda install pandas. For scikit-learn, use conda install -c intel scikit-learn. And for joblib, use conda install -c anaconda joblib.

Using Feast

Your environment should be ready for Feast! Let’s now see how Feast works in action.

You can find the feature store that we built for this post here. Use the repo as a reference if you want.

Note that the data files in breast_cancer/data will very likely be outdated by the time you read this post. This could be an issue because you might not be able to retrieve online features for inference because of their old age. You will understand exactly why this could be a problem when we do inference.

After cloning the repo and before using the code, make sure to:

· Run the Jupyter notebook feast_data_preparation.ipynb in the repo’s root directory to generate a toy dataset for the feature store. We’ll take a look at the code in this notebook in the next section.

· Move the five generated files — data_df1.parquet, data_df2.parquet, data_df3.parquet, data_df4.parquet, and target_df.parquet — to breast_cancer/data.

Preparing data for the feature repository

Before we start using Feast, let’s create a toy dataset to better understand the data structure that Feast expects.

Ideally, your data should contain the following fields:

· Feature rows with feature values.

· Unique identifiers for the entities that the feature values belong to, like IDs.

· Timestamps to indicate when each feature value row was created or recorded.

To help you better visualize what kind of data Feast expects, let’s use the breast cancer dataset from scikit-learn. We load the dataset as follows:

Next, let’s split our DataFrame into arbitrary sets of features:

Then, let’s generate arbitrary timestamps for each of the feature rows in the dataset:

The function pd.date_range ­returns a pd.DatetimeIndex object that contains dates and times. We generate as many dates as we have feature rows (len(data_df)) with daily frequency (freq=’D’) up to the current time (end=pd.Timestamp.now()). Then, we convert the times to a DataFrame and place them into the column event_timestamp. We use index=False to reset the indices of the new DataFrame.

Let’s then join the timestamps with our feature DataFrames:

As our next step, let’s create arbitrary IDs for the patients to serve as entity keys. For the purposes of this post, we assume that each feature row is associated with a different patient.

If you now inspect the DataFrames, you should see something like this:

We can see the columns event_timestamp and patient_id at the end of the DataFrame.

Finally, we can write our DataFrames to .parquet files:

Creating a Feast feature repository

Now that we have toy data, let’s create a Feast feature repository. To help you visualize what we will be doing, here’s what the repo structure will look like at the very end:

We will have a directory feast that will contain our feature repository breast_cancer. breast_cancer will contain our raw data under data and configuration files. Outside breast_cancer, the structure is arbitrary — we are just storing everything in a directory named feast to keep things neat. breast_cancer, in contrast, follows rules defined by Feast.

For now, create a directory feast and run the following in the terminal inside it:

feast init -m breast_cancer

-m indicates that Feast should create an empty repository. breast_cancer in the command indicates the name of the directory that Feast will create for feature storage. You can name the directory anything you want, but its name should clearly indicate which dataset the feature repository is for.

After you run the command, Feast will create a directory breast_cancer with a configuration file named feature_store.yaml. At this point, your working directory should look like this:

After you create the repo, navigate to it and open feature_store.yaml. Its contents will be like this:

This file contains the following parameters:

· project — the name of the current project.

· registry — the path of the registry file where Feast will store your feature definitions.

· provider — the target environment where the features are stored — in our case, a local environment.

· online_store — the environment that Feast will use to store features for low-latency inference.

To make the feature store work, we need to edit the values for registry and online_store as follows:

After you make these changes, Feast will store online data and the feature registry in the directory data of your feature repository, in data/online_store.db and data/registry.db respectively.

Next, make a directory named data in the feature repository and move all the .parquet files we created earlier there. You should have four .parquet files for features and one .parquet file for targets, like so:

Creating a definition script for features

Next, create a Python script file in breast_cancer. Name the script anything you want — in our case, the filename is definitions.py. But make sure that there are no other Python scripts in the directory — otherwise, Feast might be unable to correctly register your features.

After you create the script, your working directory should look something like this:

The script will contain the definition of our Feast feature store. Feast will use the script to register our entities, file sources, and features in a feature repository.

The full contents of the script are as follows:

On lines 2 and 3, we import a number of dependencies — several classes from Feast and google.protobuf.duration_pb2.Duration. You’ll better understand what Duration is for when we use features stored in the repository for inference.

After importing dependencies, we define a Feast entity:

Our entity corresponds to the column patient_id in our dataset and is of the datatype INT64. The parameter description simply stores the description of the entity for reference purposes.

After defining an entity, we define the file source for the first set of features:

Local file sources are handled by the class FileSource. To define a file source, we pass the path of the .parquet file that contains the desired features, and we also define the name of the event timestamp column in the source dataset. If necessary, change the path you pass to FileSource.

Finally, to define a group of features, we instantiate a FeatureView object for it, like so:

In the code block above, we define:

· The name of the feature view (name=”df1_feature_view”).

· The time that the features in the feature view should be cached for (ttl=Duration(seconds=86400 * 3)). In our case, ttl is set to three days. Feast uses ttl to make sure that only new features are served to the model for inference — you’ll understand how a little bit later.

· The features that we want to include in the feature view (features=[ … ]).

· The source of the features (batch_source=f_source1).

We define file sources and feature views for the remaining sets of features in a similar way.

Registering features

After you define your feature views, navigate to the root directory of your feature repository and run the following command in the terminal:

feast apply

The output of this operation will be as follows:

Created entity patient_idCreated feature view df2_feature_viewCreated feature view df1_feature_viewCreated feature view df3_feature_viewCreated feature view target_feature_viewCreated feature view df4_feature_viewCreated sqlite table breast_cancer_df2_feature_viewCreated sqlite table breast_cancer_df1_feature_viewCreated sqlite table breast_cancer_df3_feature_viewCreated sqlite table breast_cancer_target_feature_viewCreated sqlite table breast_cancer_df4_feature_view

feast apply will register feature definitions and build the infrastructure for feature storage and serving. According to the definitions in feature_store.yaml, Feast will store the repo configuration in the file registry.db under breast_cancer/data. It will also set up the file online_store.db in the same directory for online storage. So your feature repo directory will look like this:

The opposite operation of feast apply is feast teardown — if you for some reason need to delete the existing registry, use this command.

Using the terminal, you can now check the feature views and entities registered in the repository.

feast entities list shows registered entities:

NAME         DESCRIPTION             TYPEpatient_id   The ID of the patient   ValueType.INT64

While feast feature-views list shows registered feature views:

NAME                 ENTITIES          TYPEdf2_feature_view     {‘patient_id’}    FeatureViewdf1_feature_view     {‘patient_id’}    FeatureViewdf3_feature_view     {‘patient_id’}    FeatureViewtarget_feature_view  {‘patient_id’}    FeatureViewdf4_feature_view     {‘patient_id’}    FeatureView

You can find out more about available terminal commands here.

Retrieving features and creating a training dataset

After the feature repository is initialized, you could use your feature views for exploratory analysis, feature engineering, training, and prediction. In this guide, we’re only going to show how you can use the saved features to do training and inference. Let’s start by creating a dataset for training.

Navigate to feast — the parent directory of your Feast feature store — and create a Python script named create_dataset.py. This script will fetch the features from the feature views, join them, and save them as a dataset in our feature store.

After you create the script, your working directory will look like this:

The first thing we do in create_dataset.py is import dependencies:

Then, we get our feature store by using the class FeatureStore from Feast:

We pass the directory of the feature store to the parameter repo_path relative to the directory where your script is located.

Next, we load our targets into a pandas DataFrame, creating a so-called entity DataFrame:

Feast uses entity DataFrames to join different feature views together. To ensure point-in-time correctness, Feast matches the entity names and event timestamps in feature views and the entity DataFrame.

As an example, consider the following feature view:

And the following entity DataFrame:

To join these two DataFrames together, Feast will match their timestamps and entity keys, like so:

https://docs.feast.dev/getting-started/concepts/point-in-time-joins

The resulting DataFrame would look like this:

For timestamps present in both the feature view and the entity DataFrame, Feast would successfully join the features. As for the remaining timestamps, Feast wouldn’t fill their feature values because there would be no data corresponding to them.

With all that in mind, in our code, Feast will use the entity keys and timestamps in our entity_df to correctly join the targets with their corresponding feature values. In code, this is done as follows:

The method get_historical_features takes our entity_df with the labels, joins it with the features passed to the parameter features, and returns a RetrievalJob object that acts as a handle to the data. We obtain features from existing feature views by using the feature_view_name:feature_name format.

After you get the RetrievalJob object, you can save it for later use, like training or exploratory analysis. If you want to explore the features before saving, just call the method to_df of your RetrievalJob object — this will return a pandas DataFrame that you can inspect.

To save the joined dataset to your feature repo, run the following piece of code:

We use the method create_saved_dataset to save our dataset to a .parquet file and register it under the name breast_cancer_dataset. Using this name, we can later fetch our dataset for training.

Using the dataset to train a model

To start training, create a Python script named train.py in the same directory where you created create_dataset.py. train.py will handle training for us.

As usual, we first load dependencies:

Then get our feature store:

After that, we get our training dataset, extract the labels and features from it, and split them into train and test sets — just like you would do without a feature store:

Note that we also drop the patient IDs and event timestamps because we don’t need them for training.

Finally, we train a logistic regression model and save it using joblib:

Here, note the X=X_train[sorted(X_train)] in the call to reg.fit. What we do here is provide the features to our model in alphabetical order. This is necessary because when loading features from feature views, Feast may not preserve the order from the source data.

Doing inference on the trained model

Once we have our model, we can use it for inference. But rather than load the inference data from our .parquet files, we can fetch the latest features from them and save them in our feature repository. This enables prediction with very low latency.

In a local environment, the performance differences between online and offline inference may be very small. But if the source data is stored in a GCP bucket or in AWS cloud storage, the differences might be very noticeable.

There are two commands that you can use to load features to your online store — materialize and materialize-incremental. You can use these commands either in the terminal or from Python code.

materialize loads the latest features between two dates. A terminal call would look like this:

feast materialize 2020–01–01T00:00:00 2022–01–01T00:00:00

In contrast, materialize-incremental loads features up to the provided end date:

feast materialize-incremental 2022–01–01T00:00:00

With feast materialize-incremental, the start time is either now — ttl (the ttl that we defined in our feature views) or the time of the most recent materialization. If you’ve materialized features at least once, then subsequent materializations will only fetch features that weren’t present in the store at the time of the previous materializations.

One thing to note here — if you have several feature rows per entity, Feast will only load the latest values per entity key. As an example, if you have two entries on separate days for the patient ID 11, only the latest entry will get materialized.

Now that we understand how materialization works in Feast, let’s materialize the latest feature values up to the current time. We’ll materialize features using Python code.

Create a script called materialize.py in the directory feast, insert the code from below into it, and run it:

We use store.materialize_incremental, but we also show the usage of store.materialize in the commented piece of code.

After you run the code, Feast will fetch the latest feature values and store them in breast_cancer/data/online_store.db.

Create another Python script under the directory feast and name it predict.py. The first thing to do in the script is to get our feature store:

Next, we can get the online features that we materialized earlier. We use the method store.get_online_features for this:

We pass the names of the features that we want to retrieve to the parameter features of the method get_online_features. We also pass the entity IDs that we want to get the features for to the parameter entity_rows. entity_rows expects a list of dictionaries with a pair of entity names and entity values. In our case, we requested the features for IDs 567 and 568.

The method get_online_features returns an OnlineResponse object — we can use its method to_dict to convert the feature values to a Python dictionary. We can then convert the dictionary to a pandas DataFrame, like so:

Finally, we can load our logistic regressor and do inference with it, remembering to drop the patient_id column:

An important thing to remember with get_online_features is this — if the timestamp for a feature row is older than now — ttl, then no feature values will be returned. This is done so that no old features are served to the model for inference.

Also, note that the data we used to demonstrate inference comes from our training data. In reality, you obviously wouldn’t do this.

Next Steps

You should now have a firm grasp of the basics of Feast! Things don’t stop here — there are many things that you might want to explore further! In particular, you may want to take a look at how Feast works with cloud storage services like GCP or AWS.

In PART 2 of this series, we are going to have a look at the data validation capabilities of Feast. As of Feast version 0.19.3, support for data validation was limited, but there were still many things worth exploring!

Until next time!

--

--

Kedion
Kedion

Written by Kedion

Kedion brings rapid development and product discovery to machine learning & AI solutions.

No responses yet