H2O AutoML

Introduction: AutoML

Over the past few years, demand for machine learning systems has skyrocketed. This is mostly because machine learning techniques have been effective in a variety of applications. By enabling users from different backgrounds to apply machine learning models to address complicated scenarios, AutoML is fundamentally altering the face of ML-based solutions today. Nevertheless, despite abundant evidence that machine learning can benefit several industries, many firms today still find it difficult to implement ML models. This is due to a lack of seasoned and competent data scientists in the field. Additionally, many machine learning procedures call for more experience than knowledge, particularly when determining which models to train and how to evaluate them. Today, there are many efforts being made to close these gaps, which are rather obvious. We will examine in this post if AutoML can be a solution to these obstacles. Automated machine learning (AutoML) is the process of fully automating the machine learning application to practical issues. With the least amount of human effort, autoML aims to automate as many processes as possible in an ML pipeline without sacrificing the model’s performance. The training, fine-tuning, and deployment of machine learning models are all automated using the automated machine learning technique known as AutoML. Without the need for human participation, AutoML can be used to automatically find the optimal model for a particular dataset and task.

Because it can automate the process of developing and deploying machine learning models, AutoML is a crucial tool for making machine learning approachable to non-experts. This can expedite machine learning research and save time and resources. The key innovation in AutoML is the hyperparameters search technique, which is used for preprocessing elements, choosing model types, and improving their hyperparameters. There are many different types of optimization algorithms, ranging from Bayesian and evolutionary algorithms to random and grid search. Various strategies can be used to develope an AutoML solution, depending on the precise issue that has to be resolved. For instance, some approaches concentrate on identifying the best model for a particular job, while others concentrate on optimising a model for a specific dataset. Whatever the strategy, AutoML can be a potent tool for improving the usability and effectiveness of machine learning.

This post aims to introduce you to H2O as one of the top AutoML Tools and Platforms.

Pro’s

Time saving: It’s a quick prototyping tool, specially if you are not working on critical task, you could use AutoML to do the job for you while you focus on more critical tasks.
Benchmarking: Building a ML/DL model is fun, but, how to know if the model is the best? One option is to use AutoML to benchmark yours.

Con’s

Most AI models that we come across are black box. Similar is the case with these AutoML frameworks. If you don’t understand what you are doing, it could be catastrophic.
Based on the previous point, AutoML is being marketed as a tool for non-data scientists. However, without understanding how a model works and blindly using it for making decisions could be disastrous.

H2O AutoML

The AutoML component of the H2O packahe enables the automatic training and fine-tuning of several models within a user-specified time frame. Using all of the models, the current AutoML function can train and cross-validate a Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, and a Stacked Ensemble. In order for users to complete jobs with the least amount of misunderstanding, AutoML should take into account the issues of data preparation, model development, and ensembles while also offering as few parameters as feasible. With only a few user-supplied parameters, H2O AutoML is able to complete this work with simplicity. The data-related arguments x, y, training frame, and validation frame are used in both the R and Python APIs; y and training frame are required parameters, and the rest are optional. Max runtime sec is a necessary parameter, while max model is optional; if you don’t supply any parameters, it defaults to taking NULL. You may also adjust values for these parameters here. If you don’t want to use every predictor from the frame you gave, you can set it by supplying it to the x parameter, which is the vector of predictors from training frame. Now let’s discuss a few optional and unrelated parameters; try to adjust the settings even if you are unaware of their purpose; doing so will help you learn about some complex subjects:

Enter a dataframe (df) and select the independent variable (y) that you want to forecast. If you want to ensure that your results can be replicated, you can set or modify the seed argument.
The function determines whether the model is a classification (categorical) or regression (continuous) model by examining the class and number of distinct values of the independent variable (y), which can be adjusted using the thresh parameter.
Test and train datasets will be separated from the dataframe. With the split argument, the split percentage can be managed. The msplit() function can duplicate this.
Before moving on, you could also scale and centre your numerical data, use the no outliers function to remove some outliers, and/or use MICE to impute missing values. The function can balance (under-sample) your training data if the model is a classification model. This behaviour is manageable using the balance argument. Up until this point, the model preprocess() function can be used to duplicate the entire procedure.
Runs h2o::h2o.automl(…) to train several models and provide a leaderboard of the best (max models or max time) trained models, ranked by effectiveness. You can also modify certain extra arguments you want to provide to the mother function, such as nfolds for k-fold cross-validations, exclude algos and include algos to exclude or include particular algorithms, and any other extra argument you want.
The best model, given the default performance metric (which can be modified with the stopping metric option), will be chosen to proceed after being cross-validated and evaluated using nfolds. A different model can be chosen using the h2o selectmodel() method, and all calculations and plots can then be redone using this new model.
The test predictions and test actual values will be used to create and render performance metrics and charts (which were NOT passed to the models as inputs to be trained with). Your model’s performance metrics shouldn’t be skewed in this way. The model metrics() method lets you repeat these calculations.
A list containing all the inputs, performance metrics, graphs, the top-ranked model, and leaderboard results. The results can be exported by using the export results() function, or you can (play) see them on the console.

Mapping H2O AutoML Functionalities

Validation_frame: This parameter is used for early stopping of individual models in the automl. It is a dataframe that you pass for validation of a model or can be a part of training data if not passed by you.
leaderboard_frame: If passed the models will be scored according to the values instead of using cross-validation metrics. Again the values are a part of training data if not passed by you.
nfolds: K-fold cross-validation by default 5, can be used to decrease the model performance.
fold_columns: Specifies the index for cross-validation.
- All three frames are passed - No splits.
- Only training frame is passed - The data is split into 80/10/10 training, validation, and leaderboard.
- Training and leaderboard frame is passed - Data split into 80-20 of training and validation frames.
Weights_column: If you want to provide weights to specific columns you can use this parameter, assigning weight 0 means you are excluding the column. ignored_columns: Only in python, it is converse of x.
Stopping_metric: Specifies a metric for early stopping of the grid searches and models default value is logloss for classification and deviation for regression.
Sort_metric: The parameter to sort the leaderboard models at the end. This defaults to AUC for binary classification, mean_per_class_error for multinomial classification, and deviance for regression.
The validation_frame and leaderboard_frame depend on the cross-validation parameter that is nfolds.

H2O AutoML satisfies the need for machine learning experts by developing intuitive machine learning software. This AutoML application seeks to streamline machine learning by offering clear and uniform user interfaces for different machine learning methods. Within a user-specified time range, machine learning models are automatically developed and fine-tuned. The lares package contains several families of functions that enable data scientists and analysts to perform high-quality, reliable analyses without having to write a lot of code. H2O automl, which semi-automatically executes the entire pipeline of a Machine Learning model given a dataset and some adjustable parameters, is one of our more intricate yet valuable functions. You can speed up research and development by using AutoML to train high-quality models that are tailored to your needs.

Before getting to the code, I recommend checking h2o_automl’s full documentation here or within your R session by running ?lares::h2o_automl if you use the lares version. Documentation contains a brief explanation of each parameter that can be entered into the function to acquire the results you need and regulate how it operates.

first we load the necessay packages

library(readr)
library(tidyverse)
library(tidymodels)
library(h2o)
library(gridExtra)
library(kableExtra)

we use the diabetes data from pdp package

data(pima, package = "pdp")
out="diabetes"
preds=colnames(pima)[-c(9)]
df=pima%>%
  drop_na()

df_split <- initial_split(df)
train_data <- training(df_split)
test_data <- testing(df_split)
h2o.init()
train_data <- as.h2o(train_data)

here we train the model

h2oAML <- h2o.automl(
  y = out,
  x = preds,
  training_frame = train_data,
  project_name = "ice_the_kicker_bakeoff",
  balance_classes = T,
  max_runtime_secs = 100,
  seed = 1234
)

a summary can be found in the leaderboard_tbl object

leaderboard_tbl <- h2oAML@leaderboard %>% as_tibble()

leaderboard_tbl %>% head() %>% kable()

model_id	auc	logloss	aucpr	mean_per_class_error	rmse	mse
GBM_grid_1_AutoML_1_20221024_112632_model_40	0.8475490	0.4392863	0.6840091	0.1936275	0.3764929	0.1417469
StackedEnsemble_BestOfFamily_7_AutoML_1_20221024_112632	0.8446623	0.4507552	0.6834712	0.1915033	0.3765024	0.1417541
GBM_grid_1_AutoML_1_20221024_112632_model_30	0.8435185	0.4571160	0.7007188	0.1918301	0.3830612	0.1467359
DeepLearning_grid_1_AutoML_1_20221024_112632_model_3	0.8424837	0.4846210	0.6730490	0.2058824	0.3863806	0.1492899
GBM_grid_1_AutoML_1_20221024_112632_model_87	0.8423203	0.4560024	0.6421818	0.2081699	0.3858873	0.1489090
StackedEnsemble_AllModels_6_AutoML_1_20221024_112632	0.8414488	0.4586412	0.6550735	0.1942810	0.3806513	0.1448954

the leaderboard_tbl object also alows us to identify the best fitting model

model_names <- leaderboard_tbl$model_id

top_model <- h2o.getModel(model_names[1])

top_model@model$model_summary %>%pivot_longer(cols = everything(),names_to = "Parameter", values_to = "Value") %>%kable()

Parameter	Value
number_of_trees	36.000000
number_of_internal_trees	36.000000
model_size_in_bytes	4499.000000
min_depth	3.000000
max_depth	5.000000
mean_depth	3.638889
min_leaves	4.000000
max_leaves	6.000000
mean_leaves	5.305555

then we can measure the performance on the test data

h2o_predictions <- h2o.predict(top_model, newdata = as.h2o(test_data)) %>%
  as_tibble() %>%
  bind_cols(test_data)

h2o_metrics <- bind_rows(
  #Calculate Performance Metrics
  yardstick::f_meas(h2o_predictions, diabetes, predict),
  yardstick::precision(h2o_predictions, diabetes, predict),
  yardstick::recall(h2o_predictions, diabetes, predict)
) %>%
  mutate(label = "h2o", .before = 1) %>% 
  rename_with(~str_remove(.x, '\\.'))  %>% kable()

here is the code to make the confusion matrix on the test data

h2o_cf <- h2o_predictions %>% 
  count(diabetes, pred= predict) %>% 
  mutate(label = "h2o", .before = 1)%>% kable()

Lares package provides an elegant wrapper for H2O AutoML functions

library(lares)

r <- h2o_automl( train_data, y = diabetes, max_models = 10, impute = FALSE, target = 'pos')

you can extract feature importance

head(r$importance)%>% kable()

variable	relative_importance	scaled_importance	importance
glucose	46.2070580	1.0000000	0.4943360
age	21.4305954	0.4637948	0.2292705
insulin	14.2493601	0.3083806	0.1524436
triceps	6.5016131	0.1407061	0.0695561
pedigree	5.0137486	0.1085061	0.0536385
mass	0.0706036	0.0015280	0.0007553

metrics for the perforamnce

r$metrics %>% kable()

x
AUC: Area Under the Curve
ACC: Accuracy
PRC: Precision = Positive Predictive Value
TPR: Sensitivity = Recall = Hit rate = True Positive Rate
TNR: Specificity = Selectivity = True Negative Rate
Logloss (Error): Logarithmic loss [Neutral classification: 0.69315]
Gain: When best n deciles selected, what % of the real target observations are picked?
Lift: When best n deciles selected, how much better than random is?

	neg	pos
neg	8	47
pos	20	14

percentile	value	random	target	total	gain	optimal	lift	response	score
1	pos	13.48315	10	12	29.41176	35.29412	118.13725	29.411765	61.522961
2	pos	21.34831	6	7	47.05882	55.88235	120.43344	17.647059	59.248090
3	pos	30.33708	3	8	55.88235	79.41176	84.20479	8.823529	51.900488
4	pos	40.44944	5	9	70.58824	100.00000	74.50980	14.705882	33.315837
5	pos	52.80899	4	11	82.35294	100.00000	55.94493	11.764706	20.519447
6	pos	59.55056	2	6	88.23529	100.00000	48.16870	5.882353	16.922080
7	pos	70.78652	2	10	94.11765	100.00000	32.95985	5.882353	11.298031
8	pos	79.77528	2	8	100.00000	100.00000	25.35211	5.882353	7.915044
9	pos	89.88764	0	9	100.00000	100.00000	11.25000	0.000000	6.623036
10	pos	100.00000	0	9	100.00000	100.00000	0.00000	0.000000	5.608875

AUC	ACC	PRC	TPR	TNR
0.82834	0.24719	0.22951	0.41176	0.14545

metric	mean	sd	cv_1_valid	cv_2_valid	cv_3_valid	cv_4_valid	cv_5_valid
accuracy	0.7951220	0.0612190	0.7804878	0.7073170	0.8536586	0.7804878	0.8536586
auc	0.7998957	0.0351090	0.8269231	0.7746212	0.8333333	0.8125000	0.7521008
err	0.2048780	0.0612190	0.2195122	0.2926829	0.1463415	0.2195122	0.1463415
err_count	8.4000000	2.5099800	9.0000000	12.0000000	6.0000000	9.0000000	6.0000000
f0point5	0.6316909	0.1223794	0.6470588	0.4545455	0.7446808	0.7407407	0.5714286
f1	0.6559614	0.0777845	0.7096774	0.5714286	0.7000000	0.7272728	0.5714286
f2	0.7002074	0.0872414	0.7857143	0.7692308	0.6603774	0.7142857	0.5714286
lift_top_group	1.7653105	1.0404958	2.5230770	1.7083334	2.4848485	2.1102940	0.0000000
logloss	0.4915730	0.0841900	0.5208641	0.4600092	0.4427218	0.6239158	0.4103542
max_per_class_error	0.3399924	0.0692233	0.2500000	0.3636364	0.3636364	0.2941177	0.4285714
mcc	0.5402240	0.0493520	0.5589962	0.5045250	0.6098242	0.5445812	0.4831933
mean_per_class_accuracy	0.7824624	0.0289855	0.7980769	0.8181818	0.7848485	0.7696078	0.7415966
mean_per_class_error	0.2175377	0.0289855	0.2019231	0.1818182	0.2151515	0.2303922	0.2584034
mse	0.1628366	0.0366396	0.1757022	0.1505909	0.1425439	0.2198379	0.1255082
pr_auc	0.5545080	0.1821300	0.6685007	0.3423082	0.6143393	0.7612409	0.3861512
precision	0.6220635	0.1521601	0.6111111	0.4000000	0.7777778	0.7500000	0.5714286
r2	0.1422753	0.0905973	0.1885841	0.0411239	0.2738899	0.0942462	0.1135321
recall	0.7519657	0.1721000	0.8461539	1.0000000	0.6363636	0.7058824	0.5714286
rmse	0.4015838	0.0442591	0.4191685	0.3880604	0.3775499	0.4688688	0.3542714
specificity	0.8129590	0.1222880	0.7500000	0.6363636	0.9333333	0.8333333	0.9117647

metric	threshold	value	idx
max f1	0.2783620	0.6285714	36
max f2	0.2049758	0.7558140	56
max f0point5	0.3983419	0.5921053	16
max accuracy	0.3983419	0.7804878	16
max precision	0.3983419	0.6279070	16
max recall	0.1054178	1.0000000	86
max specificity	0.6115015	0.9731544	0
max absolute_mcc	0.2783620	0.4686644	36
max min_per_class_accuracy	0.2950617	0.7382550	33
max mean_per_class_accuracy	0.2783620	0.7586290	36
max tns	0.6115015	145.0000000	0
max fns	0.6115015	54.0000000	0
max fps	0.0867271	149.0000000	88
max tps	0.1054178	56.0000000	86
max tnr	0.6115015	0.9731544	0
max fnr	0.6115015	0.9642857	0
max fpr	0.0867271	1.0000000	88
max tpr	0.1054178	1.0000000	86