Managing Machine Learning Lifecycles with MLflow
Written by Tigran Avetisyan
Model development and experimentation is part of any machine learning lifecycle. However, without careful planning, keeping track of experiments can become tedious and challenging; especially given the number of configurations we typically deal with.
MLflow is a machine learning lifecycle framework that allows ML engineers and teams to keep track of their experiments. Not only that, MLflow also facilitates the sharing of code, and even the deployment of models in diverse environments.
In this 3-part series we are going to have a look at how to use MLflow to:
- Track experiments.
- Share code across various environments.
- Package models in a reproducible, platform-agnostic format.
In PART 1 of the series, we are going to focus on the first two steps — tracking experiments and sharing code. More specifically, we will show you how to:
- Track experiments using MLflow’s Python API.
- Package Python code to reproduce it in conda environments.
PART 2 will be dedicated to model packaging, while PART 3 will show how the concepts outlined in the previous parts can be used in a React web application.
For now, let’s try to understand what MLflow is, and what it can do for us!
What is MLflow?
MLflow is an open-source platform for managing machine learning lifecycles. It is designed to help data scientists and ML engineers facilitate the tracking of experiments and the deployment of code to a wide range of environments.
MLflow consists of the following four main components:
- MLflow Tracking — facilitates the recording of experiments, including the tracking of used models, hyperparameters, and artifacts.
- MLflow Projects — allows teams to package data science code in a reproducible format.
- MLflow Models — allows teams to export and deploy machine learning models in various environments.
- MLflow Registry — enables model storage, versioning, staging, and annotation.
Within the scope of this series, we are going to be focusing on the first three components. And as mentioned at the beginning, in PART 1, we will introduce you to the basics of Tracking and Projects.
The utility of MLflow lies in the fact that it is compatible with many mainstream, industry-accepted tools for model training and deployment. Some of the integrations of MLflow are displayed on the framework’s website:
As you can see, MLflow’s integrations span ML/DL frameworks like TensorFlow and PyTorch, cloud platforms like Amazon SageMaker, Microsoft Azure, and Google Cloud, and containerization platforms like Kubernetes and Docker.
We will be using scikit-learn in our guide to showcase the capabilities of MLflow.
Note: MLflow also states that its framework is used and contributed to by the likes of Facebook, Microsoft, Toyota, Booking.com, and Databricks.
Prerequisites for MLflow
To get started with MLflow, we will need the following:
- The MLflow Python package. We used MLflow version 1.21.0 for this guide.
- The packages of the ML tools that we want to use with MLflow. In our case, we will need scikit-learn.
- Conda, the package manager of the Anaconda distribution. If you don’t need the 7,500+ packages available with Anaconda, you could install Miniconda. Alternatively, you may install the standard version of Anaconda. To find out more about the setup of conda, consult its installation guide.
You can actually install MLflow using the pip package manager (pip install mlflow
), but we need conda to be able to work with MLflow Projects. Conda is one of the tools Projects relies on for dependency management, which is why you should ideally have it on your machine.
Once you install Anaconda on your machine, launch the Anaconda Prompt and run the following command:
pip install mlflow
We are using the pip package manager inside the Anaconda Prompt because MLflow isn’t available from default conda package channels.
In addition to MLflow, we need scikit-learn. However, you should not need to install scikit-learn separately because it comes preinstalled with Anaconda.
If you want to try another ML library with MLflow, make sure to install it with Anaconda. Depending on the package, you may need to use either pip or conda.
Running and Tracking Experiments with MLflow
After you set up Anaconda or Miniconda on your machine, you can start using MLflow Tracking! Let’s see how it works!
What is MLflow Tracking?
MLflow Tracking is a toolset for running and tracking machine learning experiments. Tracking relies on the concept of runs to organize and store tracking data. Each run records the following information:
Runs can be recorded using MLflow in Python, R, Java, or via MLflow’s REST API.
MLflow also allows runs to be grouped into experiments. If you are going to be performing many tracking runs, you should break them up into experiments to keep everything neat and organized.
MLflow Tracking allows runs and artifacts to be recorded on:
- A local machine. This is the option we’ll be exploring in the guide below.
- A local machine with SQLite.
- A local machine with Tracking Server to listen to REST calls.
- Remote machines with Tracking Server.
You may view tracked data in MLflow’s user interface. We’ll check out the UI once we do a tracking run.
Tracking can be done either manually or automatically. With manual tracking, you log parameters, metrics, and artifacts on your own by calling the associated functions and passing the values of interest to them. Alternatively, MLflow has built-in automatic loggers that record a number of predefined pieces of data for each of the supported packages.
We will have a look at either of these methods below.
Tracking Experiments with MLflow Tracking
Importing Dependencies and Setting up an Environment
To get started with MLflow Tracking, we need to import a number of dependencies:
By default, Tracking records runs in a local mlruns directory where you run the project. We’re going to be using the default directory to keep things simple. You can check the current tracking directory by calling mlflow.tracking.get_tracking_uri()
.
If you want to change the tracking directory, use mlflow.set_tracking_uri()
, passing a local or remote path that you want to use to store the data. You will need to prefix the path with file:/. Learn more about setting tracking paths in MLflow’s API reference.
For this guide, we won’t be changing the tracking path.
Breaking Runs Down into Experiments
To keep dozens of runs organized and easily scannable, we can group them into experiments. This isn’t necessary, but organizing runs into experiments can keep everything tidy.
By default, all runs are grouped into an experiment named “Default.” To change this, use mlflow.create_experiment()
, like so:
This will create an experiment named “test” and will return its string ID. The ID can be used as a handle to access the experiment later on.
To record runs to this experiment, we need to select it by using mlflow.set_experiment()
:
If you set an experiment that doesn’t exist, MLflow will create the experiment for you and then set it.
You can also delete an experiment by passing its ID to mlflow.delete_experiment()
:
Keep in mind that deleted experiments are stored in mlruns/.trash
, so you can recover them if necessary.
Also, note that runs performed under an experiment are stored in the directory of that experiment under mlruns
. And even though we assign names to experiments when creating them, experiment directory names reflect their IDs, not the names.
When you launch MLflow, it creates a default experiment with the ID of 0. Runs for this experiment are stored in mlruns/0
.
As we create experiments, their IDs increase incrementally. For example, the ID of the experiment “test” we’ve just created and deleted was 1, so its directory was mlruns/1
.
Logging experiments manually
As mentioned earlier, MLflow allows you to track experiments either manually or automatically. Let’s start with manual tracking, using scikit-learn’s LogisticRegression.
Set a dedicated experiment for manual tracking, like so:
To track data with MLflow Tracking, we can use the following functions:
mlflow.start_run()
: starts a new run and returns anmlflow.ActiveRun
object, which can be used as a context manager within a Pythonwith
block. If a run is currently active,mlflow.start_run()
returns it instead. You don’t need to start runs explicitly — calling any of the logging functions (listed below) starts a run automatically when no run is active.mlflow.end_run()
: ends the currently active run. If you aren’t using awith
block to leveragemlflow.ActiveRun
as a context manager, you must callmlflow.end_run()
after you are done with logging to terminate the run.mlflow.log_param()
: logs a single key-value param, both stored as strings. For batch logging, usemlflow.log_params()
.mlflow.log_metric()
: logs a single key-value metric, where the value must be a number. For batch logging, usemlflow.log_metrics()
.mlflow.set_tag()
: sets a single key-value tag. For batch tagging, usemlflow.set_tags()
.mlflow.log_artifact()
: log a local file or directory as an artifact. Batch logging is done withmlflow.log_artifacts()
.
You can learn more about MLflow’s tracking functions in the API reference.
Here’s how we can put all this together.
In this function, we:
- Check if the script is executed directly (line 2).
- Load a scikit-learn dataset for training (line 4).
- Split the dataset into train and test sets (line 7).
- Create a list of hyperparameters to try — in our case, regularization parameter C (line 12).
- Start a tracking run and name it PARENT_RUN (line 15). We scope the run using the Python keyword with.
- Iterate over our values of C (line 17).
- Start child runs named CHILD_RUN under our parent run (line 18). nested=True indicates that the initiated run is a child run. You don’t necessarily need to launch child runs, but they can help keep data organized if you are evaluating a range of hyperparameters.
- Instantiate LogisticRegression with the current value of C and fit the model on the train set (lines 20–21).
- Log the current value of C (line 24).
- Log the performance of the model on the test set (line 27).
- Save the model as an artifact (line 30).
Once you run the code, experiment results will be logged to the mlruns
path in your project directory. However, you don’t have to check out the results manually in the logging dir — you can leverage MLflow’s UI to more conveniently view your experiments.
To launch MLflow’s user interface, navigate to the directory above mlruns
in the terminal and run the following:
mlflow ui
This will launch the tracking UI on your local machine, using port 5000 by default. You can change the port by adding -p <port> or — port <port>
to the command.
Once the UI is up, navigate to http://localhost:5000 in your web browser to view it. Here’s what the UI looks like:
On the left-hand side of the UI, you can see the list of tracked experiments. Since our runs were done under manual_logging
, we will be able to view the recorded data under it.
In the screenshot above, all the metrics and parameters — namely, C and the test score — logged during the child runs are visible.
You can also view the recorded files locally if you navigate to mlruns/2
. As a reminder, the experiment ID is 2 in our case because there is the default experiment (0), and we’ve also created and deleted a test experiment earlier (1).
The experiment directory would like this for us:
The directory contains the nine child runs with the logged data, their parent run, and the meta.yaml
file that describes the state of the experiment.
If we navigate to one of the run folders, we’ll see something like this:
The logged data is located in the corresponding directories — in our case, artifacts, metrics, and params. tags is empty because we didn’t log any tags.
Under artifacts/model, you will find the files of the model trained with the C value corresponding to the current run.
These files can be used later to transfer and deploy models — more about this in PART 2.
Logging experiments automatically
Depending on what you are looking to accomplish with experiment tracking, setting up runs can be quite time-consuming. Fortunately, MLflow tracking has built-in functionality for automatic logging with popular data science and ML libraries — more precisely:
- Scikit-learn.
- TensorFlow and Keras.
- Gluon.
- XGBoost.
- LightGBM.
- Statsmodels.
- Spark.
- Fastai.
- PyTorch.
Note that automatic logging was experimental as of MLflow 1.21.0, so its functionality will likely change as time goes on.
To demonstrate what automatic logging can do in MLflow Tracking, let’s keep using scikit-learn.
The MLflow documentation lists the variables that are tracked and logged during automatic tracking with each of the supported packages. For scikit-learn, these variables are as follows:
The scikit-learn autologger defines distinct sets of data to be tracked for standalone estimators/pipelines and parameter search estimators. With the autologger, there’s no need to explicitly log metrics, parameters, or artifacts — everything is handled by MLflow automatically. But you can additionally log variables manually if you want to.
To demonstrate the autologger, let’s use scikit-learn’s GridSearchCV algorithm on LogisticRegression. Let’s create another experiment:
And run the autologger:
In this code block, we:
- Enable autologging for scikit-learn (line 4).
- Load some data for training (line 7).
- Set some hyperparameter values to try (line 10).
- Instantiate a LogisticRegression object (line 13) and pass it to our GridSearchCV object (line 14).
- Start an MLflow run (line 17) and fit the grid search algorithm (line 19). In this particular example, we don’t need to launch child runs because the autologger will organize the runs for us.
- Disable autologging (line 22). If we don’t do this, MLflow will continue automatically logging subsequent runs.
If you want, you could also manually log any other data inside the with
block — just like we did previously.
Once training is complete, we will be able to view the logged results in the MLflow UI:
Compared to our manual logger, the scikit-learn autologger captures more data points, including mean fit time and mean score time. With GridSearchCV, it also shows the best achieved cross-validation score and the best C value.
The directory structure for autologger runs is similar to that of manual runs. However, if you navigate to the artifacts directory of the parent run, you will notice that MLflow has also saved:
- The best fitted estimator.
- The fitted parameter search estimator.
- A CSV file with the search results.
- Plots for the training confusion matrix, the precision-recall curve, and the ROC curve.
Note: that the plots will not be generated if you don’t have matplotlib.
You can log all this stuff manually too, but if you’ve got no time and/or don’t need fine-grained control, the autologger should be enough.
Packaging Projects with MLflow Projects
Now that we have a basic understanding of MLflow Tracking, let’s have a look at Projects!
What are MLflow Projects?
MLflow Projects is a component of MLflow that allows users to turn data science and machine learning code into packages reproducible in various environments.
As of MLflow version 1.21.0, Projects could be used to make packages for the following two environments:
- Conda. If conda is chosen as the target environment, it will be used to handle the dependencies for the packaged code.
- Docker container. MLflow supports Docker for code containerization. If your project contains non-Python code or dependencies, you should use Docker.
We will be focusing on making packages for conda environments in this post. Note that the target environment where the code is intended to be executed should have conda or Docker installed (depending on which method you use).
You can read more about MLflow Projects and its capabilities in the MLflow documentation.
Packaging Python Code for Conda Environments with Projects
In MLflow, a project is a Git repository or a local path containing your files. To create an MLflow Project, we need the following components:
- The Python scripts we want to execute.
- A conda environment YAML file to handle dependencies.
- An MLproject file to control the flow of the application.
Note that MLflow can use any local directory or GitHub repository as a project even without MLproject or conda env files. However, configuring them gives you more fine-grained control over the project’s behavior. Read more about specifying projects in the documentation of Projects.
To package Python projects, we need to:
- Create a directory to store our Python scripts, the configuration files listed above, and files that we need to run the scripts (e.g. datasets).
- Create a conda environment YAML file.
- Create an MLproject file.
- Optional: modify our Python scripts to accept command line arguments.
Let’s go over these steps one-by-one.
Creating a Directory for the Project
To package our project, we need to place our scripts and conda/MLflow configuration files in the same local directory. To give you a better idea, here’s what the directory of a packaged project may look like:
For the purposes of this post, name the directory experiment.
Here, you can clearly see all of the components we listed earlier — our Python scripts (auto_log.py and manual_log.py), the conda environment file (conda.yaml), and the MLproject file (MLproject).
We’ve broken down our Python code from above into two scripts to run the autologger and manual logger separately. This isn’t necessary, but we’ve split the code to better show what MLflow Projects can do with more complex projects.
Creating a Conda Environment YAML file
As our next step, we need to create an environment YAML file to help conda handle the dependencies of our project. We’ll create a basic env file to show you how it’s done — you can learn more about creating conda environment files on the conda website.
In the project directory, create a file named conda.yaml. Our name for the YAML file is arbitrary — you may choose any other name that makes sense to you.
Open the file in a text or code editor and insert the following in YAML syntax:
This particular file defines the following:
- The name of the project.
- The channels to install dependencies from.
- The dependencies to install.
Note the last two rows in the file — they specify that MLflow should be installed using pip. This is because Conda’s default channels do not have MLflow, as mentioned earlier.
Creating the MLproject File
Next, we need to create an MLproject file to control the project’s flow.
Create a file named MLproject. The file should NOT have an extension, like .txt or .sh. Next, open the file in an editor and insert the following:
MLproject is again written in YAML syntax. It specifies the following:
- The name of the project (line 1).
- The conda_env file to handle dependencies (line 3). If your conda environment file is named differently, you should type in its name instead.
- The entry_points for the project (line 5) — manual_logger (lines 6 to 9) and auto_logger (lines 10 to 13). If you want to run a particular script from your project, you can use its corresponding entry point.
- The parameters that we want to pass to our Python scripts (lines 7–8 for manual_logger and lines 11–12 for auto_logger). Parameters are specified using the parameter_name: data_type signature — you can also optionally specify a default value for parameters, like we’ve done.
- The command that we want to execute via the corresponding entry point. command specifies the terminal command that will be run upon calling an entry point. In our case, we run a Python script, passing the parameter max_iter to it.
Note that you can also call shell scripts via command. MLflow uses Python to execute .py files and bash to execute .sh files.
Modifying Python Scripts to Accept Command Line Arguments
The .py
files in our project are just regular Python scripts — you can transfer your scripts to your project directly with little to no editing. However, you may need to make a few changes to your code to be able to access arguments passed via the terminal.
We’ve adapted our code to accept command line arguments to help you get started. As an example, here’s what auto_log.py looks like:
We’ve edited a few lines to:
- Import the sys module (line 7).
- Access provided command line arguments and pass them to our LogisticRegression object (line 24). Command line arguments are stored in the list sys.argv. In Python, sys.argv[0] contains the name of the script, while command line arguments are stored in sys.argv[1] and beyond.
The same was done in manual_log.py. Other than that, there are no major changes to the code.
Running Packaged MLflow Projects
Once you’ve done the steps from above, your packaged project is ready to be executed!
To run an MLflow project, navigate to the directory above it and open the terminal. In the terminal, type the following:
mlflow run experiment -e manual_logger — experiment-name manual_logging -P max_iter=1000
Let’s break the command down into parts to clarify what we are doing:
- mlflow run is the command used to run MLflow projects.
- experiment is the local path that contains our MLproject file, the conda environment file, and Python scripts.
- -e manual_logger specifies which of the entry points from MLproject to use to run the project. To run the autologger entry point, you would use -e auto_logger instead.
- — experiment-name manual_logging specifies the experiment name under which the project will be run. The experiment name must match the name specified in the script we want to run.
- -P max_iter=1000 specifies the value to pass to our parameter max_iter. In our case, if we don’t pass any arguments, the default value of 100 will be passed instead.
You can find out more about mlflow run in MLflow docs.
Note that you may also run projects by calling the mlflow.projects.run() function from a Python script. The Python script should again be located in the directory above the project and should contain something like this:
There are similar commands in other programming languages supported by MLflow — make sure to check the documentation for more info.
Next Steps
This was just the surface of what MLflow Tracking and Projects can do! To find out more about the framework’s capabilities, read:
- The MLflow documentation. The quickstart guide will quickly introduce you to the framework’s core functionality.
- The command line interface documentation.
- The Python API reference.
In the next post, we’re going to be focusing on MLflow Models. Models uses some of the elements of Tracking and Projects, so once you are done with this post, you’ll be well-equipped to handle model packaging and deployment!