Review: H2O.ai automates machine learning

Driverless AI really is able to create and train good machine learning models without requiring machine learning expertise from users.

ai artificial intelligence ml machine learning vector
kohb / Getty Images
At a Glance

Machine learning, and especially deep learning, have turned out to be incredibly useful in the right hands, as well as incredibly demanding of computer hardware. The boom in availability of high-end GPGPUs (general purpose graphics processing units), FPGAs (field-programmable gate arrays), and custom chips such as Google’s Tensor Processing Unit (TPU) isn’t an accident, nor is their appearance on cloud services.

But finding the right hands? There’s the rub—or is it? There is certainly a perceived dearth of qualified data scientists and machine learning programmers. Whether there’s a real lack or not depends on whether the typical corporate hiring process for data scientists and developers makes sense. I would argue that the hiring process is deeply flawed in most organizations.

If companies teamed up domain experts, statistics-literate analysts, SQL programmers, and machine learning programmers, rather than trying to find data scientists with Ph.D.s plus 20 years of experience who were under 39, they would be able to staff up. Further, if they made use of a tool such as H2O.ai’s Driverless AI, which automates a significant portion of the machine learning process, they could make these teams dramatically more efficient.

As we’ll see, Driverless AI is an automatically driven machine learning system that is able to create and train surprisingly good models in a surprisingly short time, without requiring data science expertise. However, while Driverless AI reduces the level of machine learning, feature engineering, and statistical expertise required, it doesn’t eliminate the need to understand your data and the statistical and machine learning algorithms you’re applying to it.  

Bridging the data science skills gap

Before analyzing data, whether with machine learning, deep learning, or statistical models, you need to clean and prepare it. During the model building process, you need to do feature engineering, in which you create new fields that have more relevance to the target results than the data’s original fields, often after doing singular value decomposition (SVD) or cluster analysis. All of that is quite tedious. Building deep neural network models layer by layer and tuning their hyperparameters is also tedious and labor-intensive, and it involves the highly GPU-intensive, processor-intensive, and memory-intensive step of training each model.

There have been at least half-a-dozen attempts to automate machine learning in the last year. These include Auto-sklearn, Auto-Weka, Prodigy, Google AutoML, Google Vizier, and H2O.ai’s Driverless AI, the subject of this review. H2O.ai, the open source deep learning package underlying Driverless AI, also has an AutoML module, which focuses on hyperparameter tuning of the different algorithms and finding the best combination or ensemble for a problem, but does not do any feature engineering.

In addition to the issue of the skilled labor needed to create and optimize models, no one really understands the prediction models created by machine learning systems, which are often nonlinear, non-monotonic, and non-continuous even though they are more accurate than statistical models. There have been many attempts to approximate and annotate machine learning predictions and solve the “black box problem” in the last few years. Several of them have been incorporated into Driverless AI.

The black box problem is especially important in regulated application areas such as finance and medicine. It’s not enough to tell a loan applicant that “the system said no.” You also have to explain why it said no, for example “Your income is too low for the amount of credit for which you’ve applied” or “You had too many missed payments.”

h2oai driverless ai architecture IDG

H2O.ai’s Driverless AI adds proprietary automatic feature engineering and visual model interpretation to the company’s open source stack.

Driverless AI is a proprietary web product (see the architecture diagram above) built on top of the open-source H2O.ai stack (see the architecture diagram below) with the goal of imitating the processes used by Kaggle grandmasters to create great models. The underlying H2O.ai stack is also available pre-built as a JAR file, a Python wheel, and an R package.

h2oai architecture IDG

The H2O.ai stack that lives underneath the Driverless AI layer includes a wide range of machine learning and deep learning algorithms, and can integrate with Apache Hadoop and Apache Spark.

Kaggle is a site for data science that offers standard data sets and runs competitions to analyze them. Some competitions are sponsored and offer substantial prizes. (First prize in the ongoing TSA passenger screening algorithm challenge is $500,000. Yes, substantial.) Kaggle also offers tutorials and an online environment for data science. Each challenge has a leaderboard, and all Kaggle users are ranked by their contributions. There are currently 95 grandmasters and 890 masters in the Kaggle rankings. Of course, those rankings only reflect the people and teams that compete. I am sure there are thousands of master-level and grandmaster-level data scientists quietly working for companies, with no time to compete on Kaggle.

I analyzed two data sets listed on Kaggle with Driverless AI: BNP Parabas, a competition run two years ago to predict whether an insurance claim is eligible for accelerated processing, and the Default of Credit Card Clients Dataset, a UCI data set that can be used to predict whether a loan client is likely to default on his or her next payment. I used Driverless AI 1.0.4 and 1.0.5 (for a bug fix) installed on an AWS p2.8xlarge instance, which has eight Nvidia K80 GPUs.

H2O.ai Driverless AI installation and setup

Installing Driverless AI is a simple matter of choosing the H2O.ai Driverless AI AMI when creating an Amazon EC2 instance. On any other supported cloud or local machine, you basically install Docker, nvidia-docker, and the Nvidia driver; add the standard directories; download and install the Driverless AI Docker container; and add your license. When run, the Driverless AI container exposes a web interface on port 12345 that you can view from a Chrome browser. The default view (see the screenshot below) lists your experiments.

h2oai experiments IDG

The default view of your H2O.ai Driverless AI workspace shows your experiments and gives you links to visualize your data sets and create new experiments. You can add data sets when you add experiments.

GPU support on Ubuntu uses the nvidia-docker program and the Nvidia driver to connect the Driverless AI Docker container to Nvidia GPUs. Driverless AI can use multiple GPUs as well as multiple CPUs.

Driverless AI supports Kepler, Maxwell, Pascal, and Volta GPU microarchitectures. It works fine on Kepler K80 GPUs, which are the type supplied in AWS P2 instances, and which are classified as having a compute capability of 3.7. Driverless AI does not support the older Tesla and Fermi microarchitectures.

You typically load data into the data directory of the VM hosting the Driverless AI container using Secure Copy (scp), assuming that you’re running Driverless AI in the cloud. The container is configured to use this directory as its default when importing.

While you can install a Driverless AI container on your local MacOS or Windows 10 machine, you’ll need to use a machine with at least 16GB of RAM, and give at least 8GB of RAM to Docker. You will be able to experiment with these configurations, but you won’t be able to use it for serious work, both for lack of RAM and lack of GPU support.

H2O.ai Driverless AI machine learning

To run an experiment, log into your Driverless AI server with Chrome on port 12345. Click the New Experiment button, pick a training data set, and you’ll see a screen like the one shown below. The experiment settings need a little explanation.

h2oai bnp experiment setup IDG

After you pick a training data set for an experiment, you can pick the target column, drop columns, pick the test data set, and adjust the experiment settings. In the screenshot above, I haven’t yet picked the test data set.

The Accuracy setting control affects several other parameters: the maximum number of rows, the ensemble level, whether to try target transformations, whether to tune the parameters of the XGBoost model, how many individuals to use in the genetic algorithms, how many cross-validation folds to use in each model, and whether to perform feature selection permutations. The Time setting controls the number of epochs to run. The Interpretability setting controls whether to use a feature selection strategy for the interpretation display. I used the default setting, 5, for all three controls.

h2oai bnp 92 percent complete IDG

The screenshot above shows H2O Driverless AI running an experiment on the BNP Paribas training data set. The experiment is 92 percent complete at this point. Note the change in the maximum AUC (Area Under the Curve) score over the epochs shown at the lower left. Also note the bursty pattern of GPU usage to train each epoch at the lower right and the engineered feature names in the lower middle panel.

When you launch an experiment, Driverless AI starts the feature engineering process, which involves quickly training and scoring a lot of models while applying transformations to the data fields to create new features with better predictive power between epochs. Exactly what transformations are applied depends on the data types.

Text fields might generate TF-IDF (term frequency–inverse document frequency) and word count features. Numeric fields might be converted to categorical values by binning, and averages of fields for categorical values (e.g. average house prices in New York state) may be done by out-of-fold cross validation. Multiple dimensions may be clustered, and the distance between individual rows and the nearest cluster center may become a new feature.

h2oai bnp complete IDG

The BNP Paribas experiment is complete and shows an AUC (Area Under the Curve) score of 0.7636 after 50 epochs. Eight additional actions are available for this complete experiment including interpretation, scoring another data set, and downloading various generated data and a scoring package.

After all of the epochs are evaluated, Driverless AI runs a full training and prediction generation with the final feature set and shows itself as complete. At that point you can look at model interpretations, which require a bit more computation, and then view explanations.

h2oai bnp interpetable model IDG

The Global Interpretable Model Explanation Plot compares the deep learning model prediction, the approximate k-LIME model prediction, and the actual target. You can select individual data points to display their parameters.

The Model Interpretation page includes a global interpretable model explanation plot, a variable importance bar chart, a decision tree surrogate model, and partial dependence and individual conditional expectation plots. All of this is in aid of generating approximate explanations for very exact models, using the k-LIME technique. What’s basically going on here is that Driverless AI runs a k-means analysis to generate clusters, and fits both global and cluster-local generalized linear models to the Driverless AI model prediction. The local models are used to explain rows that are in sufficiently large clusters, and the global model is used to explain rows that are outside the large clusters.

h2oai cc cluster 0 and global reason codes IDG

This screen shows the cluster 0 and global reason codes for an analysis of the credit card payment default data set. Here we see the top three payment values that are associated with a higher probability of default and the top three that are associated with a lower probability of default. The contributions are similar but not identical for the cluster 0 k-LIME and the global models.

If you view explanations, you can see the variables that are major contributors to the global interpretable model, and to the models for individual clusters. Use the Plot dropdown menu to choose the cluster you wish to view.

Driverless AI can generate a downloadable Python scoring package for any experiment, which can run in TCP or HTTP modes. You need Ubuntu 16.04 or later, Python 3.6, and a bunch of Python modules to run the scoring package.

h2oai credit card train data visualization IDG

The H2O Driverless AI visualization page shows an overview of the automatically generated plots and graphs for training or test data sets. This screenshot is for the credit card default training data. You can click on any plot to view it at full size, download it, and see any other plots in its category.

If you click on a data set you can see the key visualizations for that data set, as shown above. If you click on any of the visualizations, you can view it at full size and download it. If there are additional plots in a given category, for example biplots, you can navigate among them.

In addition to controlling Driverless AI via its web UI, you can also write Python client programs using the h2oai_client wheel.

Evaluating automated machine learning

Overall, Driverless AI is impressive—in fact, I’m surprised that it works so well. The company says that its own Kaggle masters, who provided the algorithms for the system, were also surprised. Feature engineering and model training often takes weeks before you get a good answer. Driverless AI can often get a good answer in minutes or hours.

At a Glance
  • H2O.ai's Driverless AI is an automatically driven machine learning system that also does feature engineering and annotation, dramatically reducing the time and effort required to produce good models.

    Pros

    • Driverless AI is able to create and train good models without requiring user expertise
    • Good integration with Nvidia GPUs (K80 and above)
    • Approximate linear models help to explain important factors in a decision
    • Makes quick work of generating and evaluating many models
    • Generates and exports prediction pipeline for trained model

    Cons

    • While the H2O.ai AI platform is open source, Driverless AI is proprietary
    • The concepts behind Driverless AI require a strong statistics and machine learning background
    • Trained data scientists will most likely be able to do more with Driverless AI than business analysts
1 2 Page 1
Page 1 of 2
How to choose a low-code development platform