AutoML Framework Comparison

Part 2: AutoML With Genetic Programming In TPOT

16 min readDec 1, 2021

Written by Tigran Avetisyan

In PART 1 of our “AutoML Framework Comparison” series, we’ve had a look at the AutoML capabilities of the machine learning and data analytics platform H2O.

This post is going to be centered around one of H2O’s competitors — TPOT. More specifically, in this post, we will:

· Check out TPOT’s AutoML capabilities.

· Draw parallels between TPOT and H2O in terms of implementation, ease of use, performance, and flexibility.

We’ve found a lot of big differences between H2O and TPOT, so this will definitely be an interesting read! Let’s get going!

What Is TPOT?

TPOT is an automated machine learning package built on top of scikit-learn. The package relies on genetic programming to help users identify the best pipeline for the task at hand.

Unlike many other AutoML frameworks (like H2O), TPOT can find not only the best models but also the best pipelines composed of feature selectors, preprocessors, and ML/DL models.

Graphically, here are the areas that TPOT is intended to automate:

Redrawn from original at http://epistasislab.github.io/tpot/

On a more technical level, TPOT pipelines look something like this:

So essentially, given clean, high-quality data, TPOT can automatically perform:

· Feature selection, preprocessing, and construction.

· ML model selection.

· Hyperparameter optimization for candidate models.

Implementation of AutoML in TPOT

At its core, TPOT performs automated hyperparameter optimization — i.e. it traverses a range of hyperparameters and models to pick the best combination. However, unlike H2O, TPOT implements some techniques from genetic programming — more specifically:

· Mutation, where random changes are applied to candidate pipelines.

· Crossover, where candidate pipelines randomly swap their parts.

Thanks to these techniques, a certain degree of randomness is introduced to the pipeline optimization process. This randomness can be useful when, for example, a model gets stuck at some performance level and can’t improve through hyperparameter tweaking alone.

Since TPOT is built on scikit-learn, it provides access to most algorithms implemented in that package. Aside from scikit-learn, TPOT also implements classifier neural networks in PyTorch and uses GPU-accelerated estimators in RAPIDS cuML and DMLC XGBoost.

The algorithms tested by TPOT AutoML are broken down into configurations that the TPOT team believes to work well for ML pipeline optimization. These configurations are as follows:

If you check out the source code of these configurations, you will see that they are essentially dictionaries containing candidate algorithms along with their hyperparameter ranges. As an example, here’s a snippet from the default config dictionary:

Each configuration contains the following classes of algorithms:

· Classifiers (DecisionTreeClassifier, KNeighborsClassifier, LinearSVC, and others) or regressors (ElasticNetCV, AdaBoostRegressor, LinearSVR, and others).

· Preprocessors (like Normalizer, MaxAbsScaler, MinMaxScaler, or PCA).

· Feature selectors (like SelectFwe, SelectPercentile, or VarianceThreshold).

Collectively, these are called “operators” in TPOT.

The built-in configurations should be fine for many use cases, but if necessary, you may define custom configurations with custom hyperparameter ranges.

Prerequisites for Using TPOT

Unlike the H2O framework that we explored in the previous post, TPOT doesn’t require you to configure a non-Python environment. You set up and use TPOT pretty much like any other standard Python package — you only need to install dependencies and TPOT itself.

To be able to fully utilize TPOT’s capabilities, you need NumPy, SciPy, scikit-learn, pandas, joblib, and PyTorch. You can install these via pip:

pip install numpy scipy scikit-learn pandas joblib torch

Or in Anaconda:

conda install numpy scipy scikit-learn pandas joblib pytorch

You also need DEAP, update_checker, tqdm, stopit, and xgboost:

pip install deap update_checker tqdm stopit xgboost

Once you install the dependencies, install TPOT with pip install tpot or conda install -c conda-forge tpot (if using Anaconda).

Using TPOT for AutoML

Now that we know what to expect from TPOT, let’s check out what its AutoML algorithm can do for us!

Building the Pipeline Optimizer

To get started with TPOT AutoML, we need to import a number of tools and packages:

Right after this, we can initialize our pipeline optimizer. TPOT has two classes for AutoML:

· TPOTClassifier.

· TPOTRegressor.

We will be using TPOTClassifier for this guide. Most of the steps you’ll see will carry over to TPOTRegressor.

Here’s how we initialize TPOTClassifier:

Notice the arguments we passed to TPOTClassifier:

· config_dict=”TPOT light”: instructs the algorithm only to use a limited set of operators, speeding up training.

· generations=5: sets the number of iterations to 5 (default is 100). More generations will likely yield better results, but we’re using a small number to reduce training time.

· verbosity=2: allows us to see a simple training progress bar.

Optimization in AutoML can take a very long time — from hours to days, depending on the selected TPOT configuration and the dataset. TPOTClassifier and TPOTRegressor have two parameters that allow you to limit optimization time:

· generations.

· max_time_mins: sets the optimization time limit in minutes. By default, max_time_mins is not set. If you set generations to None, you must set max_time_mins.

If both of these parameters are set, optimization will stop whenever one of the conditions is met.

Some other important parameters to know about include:

· population_size: the number of individuals (pipelines) to retain for optimization each generation.

· offspring_size: the number of offspring pipelines to produce in each iteration.

· mutation_rate: determines how many pipelines will be randomly changed in each generation. mutation_rate has a range [0.0, 1.0] and a default value of 0.9.

· crossover_rate: determines how many pipelines should interchange their parts (“breed”) each generation. crossover_rate has a range [0.0, 1.0] and a default value of 0.1.

· scoring: the string name of the function to be used to assess the quality of each pipeline. Default is “accuracy” for classification and “neg_mean_squared_error” for regression. You may also pass a custom callable object with the signature scorer(estimator, X, y).

· cv: specifies the cross-validation strategy used when evaluating pipelines. By default, cv is set to perform 5-fold cross-validation.

· config_dict: a dictionary or string that specifies the set of operators and their hyperparameters for optimization. By default, no configurations are specified, and TPOT uses its default configuration. You may pass the string name of a built-in TPOT configuration you want to use or, alternatively, a custom configuration dictionary.

· template: a predefined pipeline structure that the AutoML algorithm should use.

· use_dask: indicates whether the optimizer should use Dask for distributed computing.

· log_file: the path to which the optimizer should save progress logs.

For a full description of these and other parameters, check out TPOT’s API reference.

One thing to note here — the number of pipelines evaluated depends on generations, population_size, and offspring_size and equals:

population_size + generations x offspring_size

Preparing Data for Training

We will be using the breast cancer dataset from scikit-learn to demonstrate TPOT AutoML in action. This dataset is ready for use out of the box and will allow us to jump straight into AutoML after a few preparation steps.

Let’s load the data and split it into train and test sets:

This is pretty much all we need to do to start AutoML with TPOT!

Training and Evaluating Our AutoML Algorithm

To start pipeline optimization, we just need to call the fit method of pipeline_optimizer, passing X_train and y_train as features and labels.

With our verbosity settings, we will see the best internal CV score achieved during each generation. In the end, we will also see the best pipeline:

Generation 1 - Current best internal CV score: 0.9648351648351647Generation 2 - Current best internal CV score: 0.9824175824175825Generation 3 - Current best internal CV score: 0.9824175824175825Generation 4 - Current best internal CV score: 0.9824175824175825Generation 5 - Current best internal CV score: 0.9846153846153847Best pipeline: LogisticRegression(RobustScaler(CombineDFs(input_matrix, input_matrix)), C=0.1, dual=False, penalty=l2)TPOTClassifier(config_dict='TPOT light', generations=5, verbosity=2)

We can assess the best pipeline’s performance on the test set by calling pipeline_optimizer.score:

0.9736842105263158

You may also access the best pipeline via the optimizer’s attribute fitted_pipeline:

Pipeline(steps=[('featureunion',FeatureUnion(transformer_list=[('functiontransformer-1',FunctionTransformer(func=<function copy at 0x000001897760FCA0>)),('functiontransformer-2',FunctionTransformer(func=<function copy at 0x000001897760FCA0>))])),('robustscaler', RobustScaler()),('logisticregression', LogisticRegression(C=0.1))])

As you can see, pipeline_optimizer.fitted_pipeline contains the names of the selected operators along with their hyperparameters.

If you want to check out all the trained pipelines, use pipeline_optimizer.evaluated_individuals_. This attribute contains a dictionary of all evaluated pipelines along with the tested parameters.

Exporting Best Pipeline Code

If the training results are adequate for you, you can export the pipeline as Python code via this statement:

Once you run this statement, you will have a Python script that looks something like this:

This is the actual code that we can build the best pipeline with! So by exporting the best fitted pipeline, you get a template for rebuilding and retraining the pipeline in a production environment!

Automatic code generation is really neat and is an excellent feature since it eliminates guesswork, allowing you to deploy pipelines right after training!

Doing Inference with the TPOT Pipeline

Prediction with TPOT pipelines is done in the same way as in scikit-learn — you just use the predict method of your pipeline or model:

[1 0 1 0 0 1 0 1 1 0]

Note that with classification, predict will give you the predicted classes of your samples. If you want to obtain probabilities, use predict_proba instead, but keep in mind that some pipelines may not have this method.

Creating Custom TPOT Configurations

If you want to test a specific set of operators, you can set up a custom configuration dictionary and then pass it to TPOTClassifier or TPOTRegressor. Here’s how custom TPOT config dictionaries are expected to be structured:

As you can see, custom configs are nested dictionaries where:

The first-level key indicates the path and name of the operator.
Second-level keys list the desired hyperparameters along with their values to be evaluated.

Note that hyperparameter values that you want to evaluate should be inside Python lists.

You can also evaluate PyTorch neural networks with TPOT. As of TPOT version 0.11.7, only classifier PyTorch nets were implemented, so keep that in mind.

Once you set up your custom configuration, pass it to the config_dict parameter of TPOTClassifier or TPOTRegressor:

And then train the optimizer just like before:

Setting Custom Pipeline Templates

As mentioned earlier, TPOT evaluates three types of operators while training:

Feature selectors.
Feature preprocessors.
Classifiers or regressors.

By default, TPOT decides on its own which of these components to include in evaluated pipelines (though classifiers or regressors are always present). However, you can instruct TPOT to build pipelines of a specific structure via templates.

Custom templates are created as follows:

Note that “Transformer” is the keyword for preprocessor operators in TPOT.

This is just one example of a custom template — you may try to add, remove, or change the order of the steps. However, remember that the template MUST end with “Classifier” or “Regressor”, based on the model type you want.

Instead of a generic key (like “Transformer” or “Selector”), you can also plug in the name of the specific operator you want to evaluate:

Here, we specified “SelectFwe” instead of the generic “Selector.”

Once you’ve defined your template, you pass it to the template parameter of TPOTClassifier or TPOTRegressor:

AutoML with TPOT vs H2O — How Do the Frameworks Compare?

Now that we know the basics of TPOT, we can finally stack it up against H2O.

Both frameworks have strong and weak sides, many of which you may have noticed already. In this section, we’ll outline the general differences in these areas:

Implementation and scalability.
Easiness of environment setup.
Performance.
Easiness of AutoML setup and use.
AutoML pipeline fine-tuning.
Model explainability.

Implementation and scalability

TPOT and H2O are markedly different under the hood. More concretely:

TPOT is built on top of the scikit-learn machine learning library.
H2O, at its core, is written in Java and has an emphasis on distributed computing. H2O can interact with scikit-learn as well.

When it comes to the technical details, the most important point of distinction between H2O and TPOT is their approach to scalability. At a high level, here’s how the frameworks handle scalability and parallel computing:

H2O, as pointed out in PART 1 of this series, is geared toward enterprise users and scalability. The framework is built on top of the Map/Reduce model for large-scale data processing and uses Java Fork/Join for multithreading. H2O is also compatible with Hadoop and Spark — established and trusted frameworks in the world of big data.
TPOT relies on the third-party package Dask for parallel computing, and it also inherits multiprocessing from scikit-learn.

All in all, H2O is seemingly stronger in the area of distributed computing since it leverages more established big data frameworks. However, Dask is no slouch either since it is built to scale to thousands of nodes.

If you are interested in how Dask compares to Spark, check out this comparison written by the Dask team.

Easiness of environment setup

Since H2O is written in Java, you need to have a Java environment set up on your machine, even if you will be using the R or Python packages of H2O. Additionally, if you want to leverage Hadoop and/or Spark, you will need to have them on your machine as well.

In contrast, TPOT follows the setup procedure of standard Python packages — no extra stuff required. For distributed training, you’ll also need Dask, which, again, is set up like other Python packages.

Performance

We’ve tested the performance differences between H2O and TPOT. Since these frameworks have major differences in implementation, we’ve attempted to standardize testing by using the same parameters for the AutoML algorithms whenever possible.

This section outlines our testing methodology and comparison results.

Testing methodology

We performed a number of tests:

Classification and regression in a fixed time frame (3 minutes). This allowed us to see the results the AutoML algorithms achieved in the same time frame.
Classification and regression with early stopping and within 60 minutes (stopping tolerance of 2 generations/training rounds). This allowed us to see how quickly the AutoML algorithms found an optimal model/pipeline for the given dataset. We determined that 60 minutes is enough for the models to reach the optimal solutions with our particular setup.
Classification and regression with a limited number of pipelines/models (50 pipelines/models). This allowed us to see the results achieved after training only a limited number of pipelines or models. Note that H2O actually trained over 50 models because stacked models don’t count toward the model limit.

The following datasets were used in the comparison:

Scikit-learn’s breast cancer dataset for classification.
Scikit-learn’s diabetes dataset for regression.

We compared TPOT and H2O performance in two areas:

Performance in classification and regression on the holdout test set.
Time to complete training. This was measured using the %%time magic command. TPOT and H2O do measure training times, but we used an external tool to negate the possible differences in how the frameworks track time.

The performance metrics used for these tests were as follows:

Classification: area under the ROC curve.
Regression: root mean squared error, or RMSE. Note that TPOT calculates negative RMSE. In practice, this doesn’t appear to be affecting results much, but it’s still important to know.

For TPOT, we ran each of the three tests above with two templates:

Default, with feature selectors and preprocessors allowed.
“Classifier” or “Regressor” to only allow classifiers or regressors to be used. These templates were tested because H2O doesn’t have preprocessing and feature selection like TPOT does.

Additionally, we’ve made sure that the frameworks used the same number of threads (i.e. as many threads as available) and that TPOT used Dask for parallel computing.

Note that we’ve only done one test run to obtain results. It would have been ideal to run several tests and average the results, but this would be time-consuming.

While building the tests, we’ve also noticed that the performance variation between the runs was very small, though the time to train did sometimes vary noticeably.

Here’s a summary of the parameters we used for these tests:

You can find the code for performance testing in this GitHub repo.

Testing results

The results achieved were as follows. Remember that AUC and RMSE were calculated on the test set.

And below is a graphical comparison of these results. Note that we’ve used the absolute values of TPOT RMSEs to make more sensible plots.

From these results, we can notice the following:

Overall, H2O and TPOT achieved roughly the same performance, with the differences between their scores being negligible (though TPOT was generally ahead). Either of these frameworks works well for classification and regression tasks.
When time limits are specified, TPOT tends to overshoot the time limit, while H2O tends not to use all the provided time. This may be because TPOT prioritizes doing complete generations, while H2O stops training when there is not enough time left to train another model.
TPOT’s pipelines take less time to train than H2O models. As a result, in the same amount of time, TPOT evaluates way more models than H2O.
Within the same amount of time, both H2O and TPOT achieve similar levels of performance, even though TPOT manages to evaluate many more pipelines.

Note that with other datasets, you may see larger differences in training times and scores. Data preprocessing may also reveal performance gaps.

Easiness of AutoML setup and use

The next area where we should compare TPOT and H2O is the setup of the AutoML algorithm.

Overall, TPOT will allow you to set up basic pipelines much quicker and with much less code. To demonstrate this, here’s what you would need to do to set up an AutoML optimizer with the breast cancer dataset:

Short and neat, right?

In contrast, basic AutoML with H2O would require you to write something like this:

H2O requires you to write more code and to do more steps, other things being equal! To showcase this, here is a comparison of the steps you need to take to tune a model in H2O and TPOT:

As you can see, H2O has three extra steps:

Launching an H2O cluster.
Loading preprocessed data into H2OFrame objects.
Converting labels to categoricals for classification.

Additionally, TPOT can perform feature preprocessing (including PCA, scaling, normalization, feature agglomeration, and one-hot encoding) and selection. As of version 3.34.0.3, H2O could only do target encoding. This means that with H2O, you may need to do manual feature preparation, whereas TPOT, if necessary, can do that automatically.

AutoML pipeline fine-tuning

Both H2O and TPOT have advantages when it comes to AutoML fine-tuning. At a high level:

· TPOT provides fine-grained control over the evaluated operators and their hyperparameters. H2O doesn’t have an easy way of tweaking candidate model hyperparameters.

· H2O, unlike TPOT, allows you to set up maximum runtime per model, more easily exclude algorithms, and adjust stopping tolerance for training.

There are some other, more subtle differences between the frameworks’ AutoML capabilities, but these were the biggest ones.

Potentially, the added flexibility of TPOT could allow you to achieve better performance than with H2O. As an example, with TPOT, you could:

1. Run the TPOT optimizer and identify the best pipeline (e.g. DecisionTreeClassifier).

2. Take that best pipeline and retrain it with a wider range of hyperparameters, possibly achieving better performance.

With that said, H2O’s features could likewise allow you to achieve excellent results, so it’s not like TPOT has an astronomic edge here.

Model explainability

H2O makes exploring trained models quite a bit easier than TPOT does.

First up, H2O allows you to have a look at a leaderboard of trained models to compare their performance. If you remember from our H2O post, the leaderboard looked like this:

TPOT doesn’t provide an equally neat way to compare models — you could set verbosity of the estimator to 3 and log the results to a file, but H2O makes things easier.

Secondly, H2O provides in-depth performance metrics for its models, including cross-validation results, confusion matrices, gains/lift tables, feature importance tables, and more. To obtain these, you just need to access the respective attributes of the models. You can get these metrics with TPOT and scikit-learn too, but you’ll need to compute them manually by using functions from scikit-learn or elsewhere.

Thirdly, H2O has a wonderful explainability interface that can:

Do residual analysis.
Generate importance heatmaps.
Show the contribution of features to predictions.
Compute the correlation between AutoML model predictions.

With all that said, TPOT has one really nice feature that H2O lacks — automatic pipeline code generation. This feature allows you to rebuild the pipeline from scratch in another environment for further testing or production deployment. You may not need this feature at all, but the fact remains — H2O doesn’t offer anything similar.

Final Words and Next Steps

So all in all, here are the key takeaways to remember with H2O and TPOT:

H2O is written in Java and will therefore integrate into Java environments easier.
TPOT is built on top of scikit-learn, which will be an advantage for scikit-learn users or environments that largely rely on scikit-learn.
TPOT and H2O achieve more or less the same performance within the same timeframe, although TPOT manages to evaluate more pipelines.
TPOT is overall more flexible since it allows you to adjust hyperparameter ranges and decide what sort of operators should and should not be in the tested pipelines.
TPOT evaluates not only ML/DL models but also feature selectors and preprocessors.
H2O has an excellent explainability interface that allows you to view the statistical properties of your data and trained models with just a few lines of code.

In the end, H2O and TPOT are both viable options for automated machine learning. Your choice will ultimately come down to two things:

The environment you are going to be deploying your AutoML workflow in.
Which framework is more appealing to you based on your coding needs and preferences.

In PART 3 of this series, we are going to emulate H2O Flow’s UI for AutoML, but with TPOT. More specifically, we will build a web app where we’ll be able to conveniently select AutoML parameters for TPOT and run its AutoML algorithm with very little code!