Creating a Feature Store with Feast
Written by Tigran Avetisyan.
This is PART 3 in our 3 PART series about the feature storage framework Feast.
See PART 1 here and PART 2 here.
As a reminder, in PART 1, we explored feature store setup with Feast. In PART 2, we explored Feast’s data validation capabilities.
In PART 3, we are going to show you how you could integrate Feast into a Python API and a React web application. We’ll wrap Feast code in a UI to hopefully make its use simpler and easier.
Let’s get started with our tutorial!
Overview of our Application
Our React application will allow you to do the following:
· Clone local feature store repositories from GitHub.
· Do light exploration on the repositories — more specifically, find out registered entities, feature views, and features.
· Create entity DataFrames and use them to save feature views to a dataset.
· Materialize features between specific date ranges.
The application doesn’t allow you to do an in-depth analysis of features or train ML/DL models. It’s not a complete product, but you could use it as a basis for a more functional tool.
Why do we need to deal with feature stores on GitHub, you might wonder? It’s because one of the ways to share Feast feature stores in production is to host them on GitHub. Data scientists and ML engineers can clone feature store repos to their machines and use them to access feature data. They can fetch features, explore them, make changes to feature definitions, and push feature store updates to GitHub for others to use.
Overview of Our Toy Feature Store Repository
To help you test the application’s functionality, we’ve created a toy feature store and uploaded it to GitHub. You can find it here.
The repository has the structure expected by our React app. At a high level, the feature store repo looks like this:
The most important thing here is that the feature store registry file (registry.db
), the online store (online_store.db
), and data sources (driver_stats_1.parquet
and driver_stats_2.parquet
) are located in a directory named data
in the feature store folder, which in our case is driver_stats
. The feature store folder name can be anything you want, but the registry, online store, and the data files should be in a directory named data
.
The files in the root directory of the repo (README.md
, data_exploration.ipynb
, data_preparation.ipynb
, and driver_stats_with_string.parquet
) are not part of the feature store and are there for your reference.
Our toy feature store is based on the driver stats dataset — the one we used in PART 2. However, it’s not the exact same dataset — we’ve made some changes to it so that we could better showcase how our app works.
You can find the original dataset here. We’ve also included it in the root directory of our toy feature store repo (driver_stats_with_string.parquet
).
We’ve split the original dataset into two separate files:
· driver_stats/data/driver_stats_1.parquet
— contains timestamps, driver IDs, and the feature column conv_rate
.
· driver_stats/data/driver_stats_2.parquet
— contains timestamps, driver IDs, and the feature columns acc_rate
and daily_avg_trips
.
The definition file (definitions.py
) for our feature store looks like this:
Aside from the three feature columns, the original dataset also had the columns event_timestamp
, driver_id
, created
, and string_feature
. We’ve copied these columns to both of our splits.
Other changes we’ve done to the original dataset include the following:
· We’ve dropped the feature rows for the event timestamp 2021–04–12 07:00:00+00:00
. This is because this is the only timestamp for 2021–04–12
. After this timestamp, event timestamps jump to 2021–08–31 18:00:00+00:00
.
· We’ve dropped the feature rows for the day 2021–08–31
because they don’t start at 00:00:00+00:00
– rather, they start at 18:00:00+00:00
. Similarly, we dropped the timestamps after 2021–09–15 00:00:00+00:00
. This is because the app only allows you to select timestamp ranges by entering the year, month, and day. Hours, minutes, and seconds are not supported.
You can find the code used to modify the original dataset in the Jupyter notebook data_preparation.ipynb
in the root directory of the toy repository. In data_exploration.ipynb
(featured in PART 2), you can learn how we explored the timestamp ranges in the dataset.
After all the changes, here are a few things to keep in mind with the toy dataset:
· The timestamps in the dataset range from 2021–09–01 00:00:00+00:00
to 2021–09–15 00:00:00+00:00
.
· The entity column name for the dataset is driver_id
— it represents the unique drivers that the data was collected for.
· There are two feature views in the feature store — driver_stats_fv_1
and driver_stats_fv_2
. driver_stats_fv_1
contains the feature column conv_rate
, while driver_stats_fv_2
contains the feature columns acc_rate
and avg_daily_trips
.
· The feature view driver_stats_fv1
corresponds to the source file driver_stats_1.parquet
, while driver_stats_fv2
corresponds to driver_stats_2.parquet
.
· There is one entity registered in the feature store — driver_id
. It represents the unique drivers that the data was collected for.
· There are five driver IDs in the dataset — 1001
, 1002
, 1003
, 1004
, and 1005
.
· For each hour in the indicated timestamp range, there are feature values for each driver ID. This means that you have five feature rows (one per driver ID) for every hour.
Prerequisites for the Application
Here are the dependencies that you will need to follow along with this guide:
· Node.js, which you can download from here.
· Docker Desktop to run Docker containers.
· Several Python packages to help us build the API.
Windows users can also install Git to be able to run shell scripts and manage GitHub repositories.
On the Python side of things, you will need the following packages:
· Feast.
· FastAPI.
· Uvicorn.
· Pydantic.
· Pandas.
· GitPython.
If you are using the pip package manager, use the command pip install [package-name]
to install the packages.
If you are using conda, use this command to install Feast:
conda install -c basnijholt feast
Use these commands to install FastAPI, Uvicorn, Pydantic, and GitPython:
conda install -c conda-forge fastapi, uvicorn, pydantic, gitpython
And to install pandas, use this command:
conda install pandas
Building the API
Now, let’s build the Python API for the React app, using the FastAPI framework!
Importing dependencies
As always, we start by importing dependencies:
Setting up the API
If you’ve read Explainable AI Framework Comparison PART 3, you’ll be familiar with how to set up a FastAPI API. For more technical details, make sure to read that post.
To start setting up our API, we need to instantiate a FastAPI
object and add a CORS policy to it:
In origins
, (lines 5 to 10), http://localhost:3000
is the origin of the React app, while http://localhost:5000
is the origin of the API. We also provide http://127.0.0.1:3000
and http://127.0.0.1:5000
as alternatives to the first two URLs. If you run into issues with CORS, changing the origins might work.
After this, we need to set up data models for some of the arguments that we will be passing to the API. First, we have GitRepo
, which we will use to clone feature store repositories from GitHub:
We have two parameters here:
· repo_url
– the URL of the feature store repo on GitHub that we want to clone.
· to_path
– the directory that we want to save the cloned repo to on our web server. to_path
is relative to api/git_repos
on the web server.
Next, we have EntityDF
— a data model that we will use to create entity DataFrames:
The parameters here are as follows:
· entity_keys
— the IDs of the entities that we want to create an entity DataFrame for.
· entity_name
– the name of the column under which the entity IDs are located in the source files.
· timestamps
— the timestamps for which we want to create an entity DataFrame.
· frequency
— the periodicity of the timestamps.
Finally, we have SaveDatasetInfo
, used to help us save datasets:
The parameters in this data model are as follows:
· dataset_name
— the name under which we want to register the dataset in our feature store. dataset_name
also serves as the filename for the dataset.
· feature_view_names
— the feature views whose features we want to save in the dataset.
Setting up API endpoints
The endpoints in our API are as follows:
· /clone_repo
.
· /get_store
.
· /get_feature_views
.
· /get_feature_names
.
· /get_entities
.
· /register_entity_df
.
· /save_dataset
.
· /materialize
.
· /materialize_incremental
.
Let’s take a look at our endpoints one by one!
/clone_repo
/clone_repo
is the endpoint that handles the cloning of feature store repositories from GitHub. Its code is as follows:
The function clone_repo
accepts the parameter repo_params
, which embodies our data model GitRepo
.
On line 5, the API checks if the requested to_path
already exists on the web server. If it doesn’t, the API clones the repo from repo_url
to api/git_repos/{to_path}
(lines 6 to 9). If the path already exists, the API skips the cloning step.
On line 12, the API saves to_path
to an internal variable app.target_path
. We will use app.target_path
later to construct the target directory for dataset saving.
/get_store
The endpoint /get_store
loads the feature store from the cloned directory. This endpoint looks like this:
On lines 5 to 7, we instantiate a FeatureStore
object, passing to it the path of our feature store relative to api/git_repos
. Notice that we use the variable app.target_path
from the previous endpoint.
On line 10, we save path
to the internal variable app.repo_path
for later use.
On line 13, we navigate to the directory where the feature_store.yaml
file for our feature store is located on the web server. On lines 17 and 18, we tear down the feature store’s existing infrastructure and rebuild it. This is so that Feast updates the paths to the data source files relative to your working directory on your machine. We use os.system
to run feast teardown
and feast apply
as shell commands.
And on line 21, we navigate back to the original working directory.
/get_feature_views
The endpoint /get_feature_views
allows us to retrieve the feature views registered in our feature store.
The code for this endpoint looks like this:
In this endpoint, we:
1. Get the feature views registered in the feature store (line 5).
2. Initialize an empty list for feature view names (line 8).
3. Iterate over the existing feature views (lines 11 to 14), appending their names to the list feature_view_names
(line 14).
4. Return the feature view names (line 17).
/get_feature_names
/get_feature_names
allows us to get the feature names for a specific feature view. The code for this endpoint is as follows:
This endpoint is pretty simple — it does the following:
1. Creates an empty list to store the feature names (line 5).
2. Iterates over the features registered under feature_view_name
and appends the name of each feature to the list (lines 9 and 10).
3. Returns the feature names (line 13).
/get_entities
The endpoint /get_entities
allows us to get the entities registered in the feature store. More specifically, the endpoint returns the names of the entities and their descriptions.
The code for this endpoint is as follows:
In this endpoint, we:
1. Retrieve the registered entities (line 5).
2. Create empty lists for entity names and descriptions (lines 8 and 9).
3. Iterate over the entities (lines 12 to 16) and append each entity’s name and description to our lists (lines 15 and 16 respectively).
4. Return the entity names and their corresponding descriptions (lines 19 to 22).
This endpoint can handle feature stores with either one or multiple entities.
/register_entity_df
We use the endpoint /register_entity_df
to create and save entity DataFrames in our API. Here’s the code for this endpoint:
In this endpoint, we:
1. Generate timestamps between the provided dates with the required frequency and save them into a DataFrame under the column event_timestamp
(lines 6 to 13).
2. Use the provided entity keys and the entity column name to create a DataFrame with entity IDs (lines 16 to 19).
3. Merge the timestamps with the entity IDs, using the method cross
(lines 22 to 25.
4. Save the entity DataFrame in the API (line 28).
/save_dataset
With the endpoint /save_dataset
, we can use the created entity DataFrame to save one or more feature views to a dataset. The code for this endpoint is as follows:
In this endpoint, we:
1. Create an empty list that will store the feature names for the feature views that we want to get (line 5).
2. Iterate over the requested feature_view_names
(line 10) and their features (line 11), generate feature names in the format feature_view_name:feature_name
, and append them to our list (line 12).
3. Get historical values for the features, using our entity DataFrame (lines 15 to 17).
4. Build the target path for the dataset (lines 19 to 27). Notice that we are using the path values saved earlier to construct the target path.
5. Save the dataset to a .parquet
file, using the provided dataset_name
and our target path (lines 30 to 34).
/materialize and /materialize_incremental
Finally, we have two endpoints for materialization — /materialize
for standard materialization and /materialize_incremental
for incremental materialization.
The code for /materialize
looks like this:
In this endpoint, we take string dates, convert them to datetime
, and use them to specify the start and end dates for materialization.
As for /materialize_incremental
, its code looks like this:
This endpoint works similarly to /materialize
, but it doesn’t require a start date because incremental materialization calculates it automatically. As a reminder, when you materialize incrementally, the start date is the difference between end_date
and ttl
(defined in the feature store definition script) if end_date
is provided. Otherwise, the start date is now — ttl
.
Note that the API requires end_date
, even though it is an optional argument for the method materialize_incremental
.
Launching the API
Finally, to launch the API, conclude the API script with the following piece of code:
As a reminder, you can visit http://localhost:5000/docs
to view the dynamic documentation of your API.
Unless necessary, do not enable reload
. With reload=True
, the API will reload after every change to its code or working directory. In practice, this means that the API might reload every time you clone a repository from GitHub. Reloading wipes all variables from the API, meaning that after cloning a repository, its target path wouldn’t remain in the API. This would make saving the dataset impossible because /save_dataset
requires the cloned repo’s target_path
to construct the target path of the saved dataset.
Using Our API in a React Application
We’ve integrated our API into a React application for you to give you an idea of how you can build a UI around Feast. You can find the code for the app in this GitHub repository. The app has been dockerized so that you can launch it across a wide range of environments.
In this section, let’s take a look at what our app can do and how you can use it!
Cloning the app repository
To get started, clone the repo of the app to your local machine.
To clone the repository of our React web app, run the following command in the terminal (Git Bash if you are on Windows) in the desired target directory:
git clone https://github.com/tavetisyan95/feast_web_app.git
You do not need to clone the toy feature store repository to use the app.
Editing endpoint URLs and ports (if necessary)
The default target URLs and ports for HTTP requests are defined in app/feast_web_app/src/config.js
:
Unless necessary, do not edit these values.
Keep in mind that editing the endpoints and ports in config.js
doesn’t actually change the endpoints or ports that the API or the React app work on. config.js
is only used to help the app send HTTP requests to the proper endpoints. To change the actual ports and endpoints of the API, you will need to make changes to its Python code.
Starting the application
To start the application’s Docker container, launch Docker Desktop. Then, navigate to the root directory of the application, launch the terminal, and run the following command:
docker-compose -f docker-compose.yaml up -d –build
It may take some time for the app to spin up. Once you see terminal messages that the containers are up, navigate to http://localhost:3000
in your web browser to open the application’s webpage.
ALTERNATIVELY, you can run the start.sh
shell script to start the web app without Docker. Run the command bash start.sh
in the terminal to launch the shell script. If you are on Windows, you can use Git Bash to run shell scripts.
Using the application
Now, let’s take a look at how you can use the application step-by-step!
Cloning a feature store repository
The first thing to do once the app is up is to clone a feature store repository from GitHub. This is done in the section GITHUB REPO CLONING
.
The URL of the GitHub repo that you want to clone goes into the box GitHub repo URL
. The default URL points to our toy feature store repository, so you don’t need to change it unless you want to try another repo.
target_path
determines the directory that the repository will be cloned to. target_path
is relative to api/git_repos
on the web server. By default, repositories are cloned to api/git_repos/repo
.
Note that you should not add any slashes or backlashes to the target path — this is handled by the API.
After you have entered the desired values, press the button Get Repo
. If everything is alright, you will see the success message Repo cloned!
under the button.
After you press the button, the requested repository will be cloned to the provided target directory. The target directory path will then be saved to construct the target path that your dataset should be saved to later on.
NOTE: when running the app via Docker, you might not see the repository appearing in api
because all the operations will be happening in the container. But if you run the app from the shell script instead, you should see the repository in the directory api
.
Loading the feature store from the cloned repo
After you’ve cloned the repository, you need to load it in your Python API in the section FEATURE STORE RETRIEVAL
.
To do this, plug the location of the feature_store.yaml
into the box Store path
and press Get Store
.
Keep in mind that the store path should be relative to the root directory of the cloned repository. To help you figure out Store path
, consider the structure of our toy repository:
Notice that the configuration file feature_store.yaml
is located in driver_stats/
relative to the root path of the repository. With that in mind, driver_stats
is what you need to enter under Store path
. Again note that no slashes or backslashes are necessary.
By default, the value under Store path
points to the correct location of feature_store.yaml
in our toy repository. So there’s no need to change it unless you are using a different repo.
Like with Target path
for the cloned feature repository, the API saves the value of Store path
for later use.
Exploring the feature store
Our application has light functionality for Feast feature store exploration. In the section FEATURE STORE EXPLORATION
, you can get the names and descriptions of the entities registered in the feature store, as well as the registered feature views and their features.
Press Get Entity List
to see the registered entities and their descriptions in the box Entity list
. Press Get Feature View Names
to see feature view names in the box Feature Views
.
Note that after you retrieve the feature view names, you will be able to see the features registered under them. To do this, select the feature view of interest in the box Feature Views
.
Registering an entity DataFrame
In the section ENTITY DATAFRAME CREATION
, you can create an entity DataFrame and save it on the web server as a Python variable.
You can set the following parameters for entity DataFrame creation:
· Timestamp range
. This field accepts two values — Start date
and End date
. The API will generate timestamps between these two dates, using the function pandas.date_range
. The expected format for dates is YYYY-MM-DD
(or %Y-%m-%d
in Python). The default values correspond to the oldest and newest timestamps in the toy feature repo.
· Frequency
. This parameter defines the gap between each generated timestamp. The default value is H
for hourly frequency. You can enter any other value that is supported by pandas.date_range
, but keep in mind that some values might not work for your data!
· Entity keys
. These are the entity IDs that you want to get feature values for. You can enter more than one key — just make sure to separate your keys with a comma, e.g. 1001, 1002, 1003, 1005
.
· Entity column name
. This is the name of the column under which the entity IDs are stored in your dataset. The default value is driver_id
.
After you are done playing around with the values, press the button Create Entity DF
. It might take some time for the API to create and save the DataFrame.
Saving a dataset
After you create an entity DataFrame, you can use it to save your features to a dataset! Scroll down to the section DATASET SAVING
to do this.
Here, you can type a name for the dataset (Dataset name
), as well as select the feature views whose feature values should be saved to the dataset (Available feature views
).
As a reminder, Dataset name
is used as:
· The name for the dataset in the feature repository.
· The target path of the saved dataset.
When you press Save Dataset
, the API retrieves historical features for the selected feature views, using the entity DataFrame you created earlier. The API also combines Dataset name
with Target path
and Store path
from previous sections in the app to construct the target path for the dataset file.
By default, the app will save the dataset to api/git_repos/repo/driver_stats/data/dataset.parquet
.
As a note, keep in mind that Available feature views
gets populated with feature view names after you press the button Get Feature View Names
in the section FEATURE STORE EXPLORATION
. This is important to know because if you retrieve feature view names and then refresh the webpage, all the values retrieved from your API calls will vanish from it. You will need to get the feature views again to be able to save them to a dataset.
Materializing features
And as thelast step, you can materialize features in the section FEATURE MATERIALIZATION
at the very bottom of the webpage.
To do standard, non-incremental materialization, use the box Materialization interval (standard materialization)
. You can select start and end dates for materialization in this box. The format for the dates is again YYYY-MM-DD
.
To do incremental materialization, use the box Materialization interval (incremental materialization)
. Here, you can select the end date for materialization. Again, the expected date format is YYYY-MM-DD
.
Note that depending on the entered end date and the ttl
of your feature views, you may not get any feature values after incremental materialization!
Limitations of the Application
In its current implementation, the app has some notable limitations, including:
· After you clone a repository, you cannot pull changes made after cloning. The only way to pull changes is to delete the repo on the web server and clone it again.
· The API expects timestamps in the format %Y-%m-%d
(YYYY-MM-DD
), e.g. 2021–09–01
. Other formats are not supported.
· Invalid inputs for parameters aren’t handled at all. No error messages are shown in the web browser. The only way to know that something has gone wrong is through terminal logs. But there is one exception — if you don’t select any feature views before saving a dataset, you will see Please select at least one feature view
under the button Save Dataset
.
Also, while this should be obvious, our app isn’t ready for production use. It’s likely inefficient and might have security vulnerabilities, so definitely don’t use it as-is in your ML workflows!
Next Steps
Our application isn’t an end-to-end tool, but it could help you set the foundations for a full-fledged feature store management solution! Feast isn’t difficult to use in code, but a UI could help you reduce the number of repeating steps in your workflow.
As a next step, you could try to integrate a simple training and prediction UI in the app. Another thing you could try is adding functionality for actually creating feature stores.
No matter what you decide to do, feel free to use our app and API as a launchpad for your own projects!
Until next time!