Auto Machine Learning in H20

JK
john kamau
Oct 07, 20245 min read

Automl H2O - source h2o.ai

Introduction

When one thinks of machine learning, `scikit-learn` often comes to mind as a go-to library for Python users. It provides most of the tools needed to develop models but requires the data analyst to manually handle tasks like data preprocessing and hyper-parameter tuning. `H2O` is an alternative to `scikit-learn`, designed with both programmers and non-programmers in mind. It simplifies the machine learning process by automating many steps and provides both a code-based approach and a user-friendly graphical interface. This makes H2O particularly useful for beginners and those who may not have extensive coding experience.

Auto Machine Learning (AutoML) in H2O further simplifies machine learning by automating key steps such as data preprocessing, feature selection, model training, and hyper-parameter tuning. This article will guide you through using H2O for AutoML, including setting up the environment, building machine learning models, and leveraging both Python and R for tasks. We will explore the reasons to choose H2O and demonstrate its capabilities with practical examples.

Why Choose H2O?

H2O.ai offers several advantages that make it a great choice for machine learning tasks, particularly for users who want a more automated or non-programming-focused approach:

  • User-Friendly for Non-Programmers: H2O’s Flow interface allows users with little to no programming knowledge to build models. It works like Jupyter Notebook but with front-end features such as buttons, drag-and-drop tools, and easy navigation.
  • Multi-Platform Support: H2O works seamlessly with both Python and R, making it versatile for various data science workflows. The syntax is quite similar across both platforms, so once you learn one, it’s easy to adapt to the other.
  • Extensive Documentation: H2O has comprehensive documentation that includes code examples, problem-solving tips, and tutorials to guide users through its functionalities.
  • Integration with Other Platforms: H2O integrates with big data platforms like Sparkling Water, allowing you to scale your models and enhance performance.
  • Java Deployment: You can deploy H2O models in Java environments, making it easy to integrate machine learning models into production systems.

Setting Up the Code for Machine Learning

In this example, we’ll use the `Pima Indian Diabetes` dataset, which you can easily download from Kaggle. Ensure that the dataset is available in your working directory. Before proceeding, you need to install H2O, which can be done by running pip install h2o. After installing H2O, you’ll need to load your dataset. We will first read the data using `Pandas` and then convert it to an H2O frame, which is H2O's data structure for working with datasets.

h2o.ipynb
1# install h2o
2!pip install h2o
3
4# imports
5import h2o
6import pandas as pd
7
8# Initialize H2O cluster
9h2o.init()
10
11# Load the dataset
12df = pd.read_csv('pima-indians-diabetes.csv')
13
14# Convert the DataFrame to an H2O Frame
15data = h2o.H2OFrame(df)
16
17# Display data types and summary
18data.info()

Building a Classification Model

Let’s build a simple classification model using the Random Forest algorithm to predict diabetes outcomes. H2O's syntax for model building is similar to `scikit-learn`, making it intuitive for those familiar with Python.

h2o.ipynb
1from h2o.estimators import H2ORandomForestEstimator
2
3# Define target and features
4target = 'Outcome'
5features = [col for col in data.columns if col != target]
6data['Outcome'] = data['Outcome'].asfactor()
7  
8# Split data into training and test sets
9train, test = data.split_frame(ratios=[0.8])
10
11# Initialize and train the Random Forest model
12rf_model = H2ORandomForestEstimator()
13rf_model.train(x=features, y=target, training_frame=train)
14
15# Evaluate model performance on test data
16performance = rf_model.model_performance(test_data=test)
17print(performance)
18
19# Plot the confusion matrix
20confusion_matrix = performance.confusion_matrix()
21rf_model.varimp_plot()
22

After building the model, it’s essential to understand which features are most important. H2O allows you to extract the variable importance and plot it easily using `Matplotlib`.

h2o.ipynb
1# Get the false positive rate (FPR) and true positive rate (TPR) from the model performance
2fpr = performance.fprs
3tpr = performance.tprs
4
5# Plot the ROC curve
6import matplotlib.pyplot as plt
7
8plt.figure(figsize=(8, 6))
9plt.plot(fpr, tpr, label="ROC Curve (AUC = {:.4f})".format(performance.auc()))
10plt.plot([0, 1], [0, 1], 'k--', label="Random Guess")
11plt.xlim([0.0, 1.0])
12plt.ylim([0.0, 1.05])
13plt.xlabel("False Positive Rate (FPR)")
14plt.ylabel("True Positive Rate (TPR)")
15plt.title("ROC Curve")
16plt.legend(loc="lower right")
17plt.grid(True)
18plt.show()

Visualization produced from the above code.

Auto Machine Learning (AutoML)

H2O’s AutoML feature automates the entire machine learning process, from model training to hyper-parameter tuning, allowing you to quickly identify the best model for your dataset. Here’s how to implement AutoML in H2O:

h2o.ipynb
1from h2o.automl import H2OAutoML
2
3# Initialize AutoML and train models for 5 minutes
4aml = H2OAutoML(max_runtime_secs=30)
5aml.train(x=features, y=target, training_frame=train)
6
7# Get the best model
8best_model = aml.leader
9print(best_model.model_performance(test_data=test))
10performance = best_model.model_performance(test_data=test)
11fpr = performance.fprs
12tpr = performance.tprs
13
14# Plot the ROC curve
15import matplotlib.pyplot as plt
16
17plt.figure(figsize=(8, 6))
18plt.plot(fpr, tpr, label="ROC Curve (AUC = {:.4f})".format(performance.auc()))
19plt.plot([0, 1], [0, 1], 'k--', label="Random Guess")
20plt.xlim([0.0, 1.0])
21plt.ylim([0.0, 1.05])
22plt.xlabel("False Positive Rate (FPR)")
23plt.ylabel("True Positive Rate (TPR)")
24plt.title("ROC Curve")
25plt.legend(loc="lower right")
26plt.grid(True)
27plt.show()

H2O Flow

H2O Flow is a web-based graphical interface that allows users to interact with data and build models without writing any code. Flow simplifies the machine learning process for non-programmers, offering an intuitive drag-and-drop interface to manipulate data, train models, and visualise results.

Using H2O in R

H2O also works well in R, making it easy to replicate your Python workflows in another popular data science language. Here’s an example of how you can build a machine learning model in R using H2O.

h2o.r
1library(h2o)
2
3# Initialize H2O
4h2o.init()
5
6# Load and convert the dataset to an H2O Frame
7data <- read.csv('pima-indians-diabetes.csv')
8data_h2o <- as.h2o(data)
9
10# Define target and features
11target <- 'Outcome'
12features <- setdiff(names(data_h2o), target)
13
14# Split the data
15splits <- h2o.split_frame(data_h2o, ratios = 0.8)
16train <- splits[[1]]
17test <- splits[[2]]
18
19# Train a Random Forest model
20rf_model <- h2o.randomForest(x = features, y = target, training_frame = train)
21
22# Model performance
23perf <- h2o.performance(rf_model, newdata = test)
24print(perf)
25
26# Plot ROC curve
27h2o.roc(perf)

AutoML in R

You can also use AutoML in R, following a similar process as in Python:

autoh2o.r
1# Initialize AutoML
2aml <- h2o.automl(y = target, training_frame = train, max_runtime_secs = 300)
3
4# Get the best model
5best_model <- aml@leader
6
7# Model performance
8perf_best <- h2o.performance(best_model, newdata = test)
9print(perf_best)
10
11# Plot ROC curve for the best model
12h2o.roc(perf_best)

Conclusion

H2O makes machine learning accessible to both beginners and experienced data scientists. Whether you prefer coding in Python or R or using a drag-and-drop interface, H2O provides the tools to quickly develop, train, and evaluate machine learning models. Its AutoML capabilities further reduce the time and effort required to identify the best models, making it an ideal tool for rapid experimentation and deployment.

Share
JK

John Kamau

Co-founder Lead Analyst

Numbers don't lie, but this guy tells stories with them! John Kamau isn't your average data whiz. Sure, he's got the degrees (economics and data science, no less!), but his experience is where things get interesting. For three years, he wrangled numbers like a financial accountant ninja. Then, for seven years, he became a data analyst extraordinaire, leading the data charge for three years at L-IFT (we'll let him explain that one!). What does all this mean? John can predict cash flow like a fortune teller with a spreadsheet, build credit scoring models that make banks jealous, and unearth insights from data that would make Aesop's fables blush. As a co-founder of Aesops, John isn't just crunching numbers; he's using his skills to craft impactful solutions for Kenyans. Think of him as the data Robin Hood, taking insights from the rich (data) and giving them to the people (Kenyans) to make a real difference.

Recent Posts


all rights reservedAesops©2024