Configuration¶
The main difference between the use case of regular users compared to advanced users is reflected in the configuration of the pipeline.
Conceptually, a pipeline is composed from a modular composition of steps. The ones available are the following:
Data: Responsible for raw data management
Model: Responsible for model management
Benchmark: Responsible for searching the best model configuration
The configuration should be done through a YAML file, which contains all the pipeline and steps options that are considered.
Global¶
First of all, it is necessary to specify global options, which are the following:
problem_type: The kind of problem that determines the model/s to use. Valid values are:
classification
,regression
.
global:
problem_type:
Afterwards, the steps configuration is provided, which is detailed below.
Data¶
This step is made up of the following ETL phases:
extract: Read the raw data.
transform: Apply a set of data transformations, e.g. normalization, rename columns, etc.
load: Save the processed dataset.
Example of data step in a pipeline file:
data:
extract:
filepath: data/raw/dataset.csv
features:
- bedrooms
- bathrooms
- sqft_living
target: price
transform:
normalize:
features:
module: sklearn.preprocessing.StandardScaler
target:
module: sklearn.preprocessing.StandardScaler
load:
filepath: data/processed/dataset.
Extract¶
In the extract phase the possible configurations are the following:
filepath (str, optional): File path of dataset to use. It is an optional param. The default value is
data/raw/dataset.csv
.target (str): Column name as a target.
features (list, optional): Set of columns to process. It is an optional param. When features is not indicated, the process uses all columns.
Examples:¶
In the following example, the framework reads dataset from
data/raw/boston_dataset.csv
. Also, it gets price
column as a target.
In addition, it uses bedrooms
, bathrooms
, sqft_living
as features.
extract:
filepath: data/raw/boston_dataset.csv
features:
- bedrooms
- bathrooms
- sqft_living
target: price
Transform¶
In this phase the possible transformations are the following:
Encoding¶
The parameter encoding (dict, optional) defines the dataset encoding for categorical features through One Hot Encoding (OHE) method. This is defined through two additional parameters within the encoding dictionary, which are OHE (boolean), which determines whether to apply the encoding to features or not, and features (list/str, optional), that specify the feature or features to apply OHE to; if empty and OHE is True, OHE would be applied to all categorical features from the dataset, except for the target.
Examples¶
The simplest configuration is the following, which means a OHE for all the categorical features:
transform:
The following example, the framework applies OHE for all categorical features:
transform:
encoding:
OHE: True
The opposite case would be to not apply OHE to any feature:
transform:
encoding:
OHE: False
Another example is then OHE should be applied only to specific features. In the following example, OHE is applied only to columns column_a and column_b.
transform:
encoding:
OHE: True
features:
column_a
column_b
Normalize¶
The parameter normalize (dict, optional) defines the dataset
normalization. It is possible to normalize nothing, features, target or
both. With features parameter, it defines which normalization apply to
features. Furthermore, with target parameter, it defines the target
normalization. If the transform step contains an empty normalize key, it
uses a sklearn.preprocessing.StandardScaler
for features and target as
default. On the other hand, if normalize key does not exist, no
normalization is applied. If only features or target (but not both) are to be
normalized, empty settings should be provided for the part that does not
require normalization.
target (list, optional): Column name as a target. It is an optional param. The default value is
target
.features (list, optional): Set of columns to process. It is an optional param. When empty, the process uses all columns.
For any of the previous mentioned, there are three children keys:
module (str, optional): Normalization module to apply. Right now,
sklean.preprocessing.StandardScaler
is the only one supported.params (dict, optional): Specific parameters of the previous module. Should be specified as key-value pairs.
columns (list, optional): Columns to be considered for normalization. By default, all of features/target if empty, depending on the context.
Examples¶
The simplest configuration is the following, which means no normalization:
transform:
In the example below, the framework applies a default normalization parameters
(sklearn.preprocessing.StandardScaler
for both target and features).
transform:
normalize:
If only features or target are to be normalized, just an empty module should be provided for target.
transform:
normalize:
target:
module:
In the example below, the framework uses a
sklearn.preprocessing.StandardScaler
for normalize target and
features. In the case of features, normalization is applied considering std,
and only for columns named column_a and column_b.
transform:
normalize:
features:
module: sklearn.preprocessing.StandardScaler
params:
with_std: True
columns:
- column_a
- column_b
Load¶
In load phase the possible configurations are the following:
filepath (str, optional): file path to store processed dataset.
Examples¶
The simplest configuration is the following:
load:
When load phase is empty, the framework does not save the processed dataset.
The following example, the framework stores the processed data in
data/processed/dataset.csv
.
load:
filepath: data/processed/dataset.csv
Model¶
This step is responsible for model management.
It is made up for the following ETL phases:
extract: The purpose of this phase is to read a previously saved model.
transform: This phase applies the common model functions: training, testing and cross-validation.
load: Saves the initialized model.
Extract¶
In extract phase the possible configurations are the following:
filepath (str [1]): File path of model to read.
Transform¶
This phase applies the common model functions: fit, predict and cross-validation. The available configurations are the following:
fit (dict [2]): Requests a model training on the current dataset. It may have the following additional information:
estimator (dict, optional): Sppecifies the estimator and its hyperparameters. Consists of the following:
module (str, optional): Learner module to use.
params (dict, optional): Additional parameters to pass to module class.
Available models are the ones available from sklearn, and of course just the ones related to the problem type specified. Default models are
sklearn.ensemble.RandomForestRegressor
for regression andsklearn.ensemble.RandomForestClassifier
for classification problems, both withn_estimator
equal to 100.cross_validation (dict, optional): Defines which cross-validation strategy to use for training the model. Dictionary may have the following keys:
module (str, optional): Cross validation module to use.
params (dict, optional): Additional parameters to pass to module class.
Any cross validation method in sklearn cross-validation should work, provided that it follows their consistent structure. Default:
sklearn.model_selection.KFold
with 3 splits.metrics (list): a list of metrics to evaluate the model. Any metric that exists in sklearn.metrics is allowed, of course that apply to the problem type; only the function name is required. Default values are
mean_squared_error
,mean_absolute_percentage_error
,median_absolute_error
,mean_absolute_error
,root_mean_squared_error
for regression problems andaccuracy_score
,precision_score
,recall_score
,specificity_score
,f1_score
androc_auc_score
for classification problems.It is even possible to define custom metrics. For this, what is needed just to define a function named
compute_{metric_name}_metric
in the filehoncaml/models/evaluate.py
, being {metric_name} the name of the metric, and having as input parameters the series of true values, and the series of predicted ones, in this order (there are already a couple of examples). Then, it is just a matter of include the metric name in the configuration.Both options have the possibility to pass additional parameters to the metric function, by specifying the metric as a dictionary instead of a single string. The dictionary key would be the metric name, whereas its values would refer to function parameters.
predict (dict [3]): Requests to run predictions over the dataset.
path (str, optional): Directory where the predictions will be stored. Default value:
data/processed
.
Compulsory for fit pipelines, and excluded for predict pipelines. Related to benchmark pipelines, see the details in Benchmark.
Compulsory for predict pipelines, and excluded for the rest of pipeline types.
Examples¶
The following snippet shows an example of an advanced model transform definition:
transform:
fit:
estimator:
module: sklearn.ensemble.RandomForestRegressor
params:
n_estimators: 100
cross_validation:
module: sklearn.model_selection.KFold
params:
n_splits: 2
Deep learning models¶
Deep learning models implemented in torch require a specific format, different from sklearn based models or similar, in which parameters are passed directly when instantiating the model class.
First of all, module key should have just as value torch
in order to
indicate that a neural net will be used as estimator. Within the params
key, the following keys should be specified [4]:
epochs (int): Number of training epochs.
layers (list) Layers configuration; the structure of each one is:
module (str): Layer module to use.
params (dict, optional [5]): Additional parameters to pass to layers.
In the case of linear layers, as the parameter in_features is dependent on previous layers, only out_features is required; however, if the last layer of the neural net is another linear layer, no out_features should be provided, as dimension will be inferred from targets.
loader: (dict): Specifies data loader options to use. Internal keys:
batch_size (int): Number of rows to consider for each batch.
shuffle (bool): Whether to shuffle data at every epoch.
loss (dict): Loss to consider; requires the following:
module (str): Loss module to use.
params (dict, optional): Additional parameters to pass to module.
optimizer (dict): Optimizer to consider; requires the following:
module (str): Optimizer module to use.
params (dict, optional): Additional parameters to pass to module.
An example of a training configuration for a deep learning model would be:
model:
transform:
fit:
estimator:
module: torch
params:
epochs: 3
layers:
- module: torch.nn.Linear
params:
out_features: 64
- module: torch.nn.ReLU
- module: torch.nn.Linear
params:
out_features: 32
- module: torch.nn.Dropout
- module: torch.nn.Linear
loader:
batch_size: 20
shuffle: True
loss:
module: torch.nn.MSELoss
optimizer:
module: torch.optim.SGD
params:
lr: 0.001
momentum: 0.9
All options are required for training and benchmark pipelines, whereas dataloader is the only one required by predict pipelines.
Optional for all layer types except for linear ones, except for the last layer if it is linear.
Load¶
In load phase the possible configurations are the following:
filepath (str, required): Directory and file name where the model will be saved. If the user specifies the file name as
{autogenerate}.sav
, the filename is generated by the framework following the following convention:{model_type}.{execution_id}.sav
Otherwise, if the user specifies a custom name, the file is saved with that name. The supported formats for saving a model include the extension.sav
results (str, [6]): Directory where to store training cross validation results; generated file will have the following format:
{results}/{execution_id}/results.csv
. If not set, results will not be exported.
Optional for train pipelines, and excluded for the rest of pipeline types.
Benchmark¶
This step is responsible for searching the best model configuration.
It is made up for the following ETL phases:
transform: this phase runs an hyperparamater search algorithm for each specified model. Furthermore, it gets the best model configuration.
load: it saves the best configuration into a yaml file.
Apart from obtaining the best model configuration, it is possible to train the best model through appending a model key after the benchmark step, taking advantage of the modular definition of the solution:
global:
problem_type: regression
steps:
data:
extract:
filepath: {Input data}
target: {Target}
benchmark:
transform:
load:
path: {gReports path}
model:
transform:
fit:
load:
path: {Path to store best model}
Transform¶
This phase runs an hyperparameter search algorithm for each model defined in pipeline file. Furthermore, the user can define a set of metrics to evaluate the experiments, the model’s hyperparamaters to tune, the strategy to split train and test data and parameters of search algorithm.
The available configurations are the following:
models (dict, optional): Dictionary of models and hyperparameters to search for best configuration. Each entry of the list refers to a model to benchmark. Keys should be the following:
{model_name} (dict, optional): Name of model module, e.g.
sklearn.ensemble.RandomForestRegressor
.
Within each module, there should be as many keys as model parameters to search:
{hyperparameter} (dict, optional): Name of hyperparameter, e.g.
n_estimators
. Within each hyperparameter, the following needs to be specified:method (str, optional): Method to consider for searching hyperparameter values.
values (tuple/list, optional): Values to consider for hyperparameter search, passed to specified method.
Available methods and value parameters are defined in the search space. The default models and hyperparameters for each type of problem are defined at honcaml/config/defaults/search_spaces.py.
In case of deep learning models, the name of the model to use is
torch
, and there is a specific chapter to detail the required configuration in Deep learning benchmark.cross_validation (dict, optional): defines which cross-validation strategy to use for training each model. Dictionary may have the following keys:
module (str, optional): Cross validation module to use.
params (dict, optional): Additional parameters to pass to module class.
Any cross validation method in sklearn cross-validation should work, provided that it follows their consistent structure. Default:
sklearn.model_selection.KFold
with 3 splits.metrics (list/str, optional): a list of metrics to report in the benchmark process, or a single one. Actually, reported metrics may be appended with the one specified in tuner settings, if the latter is different (as it is the one used to select the best model configuration). Any metric that exists in sklearn.metrics is allowed, of course that apply to the problem type; only the function name is required. Default values are
mean_squared_error
,mean_absolute_percentage_error
,median_absolute_error
,mean_absolute_error
,root_mean_squared_error
for regression problems andaccuracy_score
,precision_score
,recall_score
,specificity_score
,f1_score
androc_auc_score
for classification problems.It is even possible to define custom metrics. For this, what is needed just to define a function named
compute_{metric_name}_metric
in the filehoncaml/models/evaluate.py
, being {metric_name} the name of the metric, and having as input parameters the series of true values, and the series of predicted ones, in this order (there are already a couple of examples). Then, it is just a matter of include the metric name in the configuration.tuner (dict): defines the configuration of tune process. Their options are the following:
search_algorithm (dict, optional): Specifies the algorithm to perform the search. Consists of the following:
module (str, optional): Algorithm module to use.
params (dict, optional): Additional parameters to pass to module class.
For all available options, see the search algorithms documentation. Default is
ray.tune.search.optuna.OptunaSearch
.tune_config (dict, optional): Parameters to pass to tuner config object, specified as key-value pairs. For available options, see TuneConfig documentation.
run_config (dict, optional): Parameters to be used during run, specified as key-value pairs. For available options, see RunConfig documentation.
scheduler (dict, optional): Allows to define different strategies during the search process. Consists of the following:
module (str, optional): Algorithm module to use.
params (dict, optional): Additional parameters to pass to module class.
For all available options, see schedulers documentation.
Examples¶
The following snippet shows an example of an advanced benchmark transform definition:
metrics:
- mean_squared_error
- mean_absolute_error
- root_mean_square_error
models:
sklearn.ensemble.RandomForestRegressor:
n_estimators:
method: randint
values: [2, 110]
max_features:
method: choice
values: [sqrt, log2, 1]
sklearn.linear_model.LinearRegression:
fit_intercept:
method: choice
values: [True, False]
cross_validation:
module: sklearn.model_selection.KFold
params:
n_splits: 2
tuner:
search_algorithm:
module: ray.tune.search.optuna.OptunaSearch
tune_config:
num_samples: 5
metric: root_mean_squared_error
mode: min
run_config:
stop:
training_iteration: 2
scheduler:
module: ray.tune.schedulers.HyperBandScheduler
Deep learning benchmark¶
Deep learning models, in a benchmark pipeline, require a specific format, due to the fact that models require a custom format as well (it is advisable to review their structure in Deep learning models). The main structure should be the same:
epochs (dict): Typical keys method (with value
randint
) and values should be specified.layers (dict) Layer structure to benchmark; this key is the only one with a completely different structure than specified in deep learning models; this is because the approach for benchmarking them is through what are called blocks. Blocks are a predefined combination of layers that will be shuffled with a specific layer to generate combinations to benchmark. For example, one block could be a linear layer + rectified linear unit, and another one could be a dropout layer. The required structure is the following:
number_blocks (list): List of two values, which is the minimum and maximum number of blocks considered for the models.
types (list): List of strings that specify succession of layer types to be considered as blocks, assuming that their names are contained within torch nn module. Blocks that contain a sequence of layers should join their names with the symbol
+
.params (dict, optional): In case some layer types require specific parameters to be benchmarked, they should be informed within this key. The structure to follow is the following:
{layer name} (str): Layer name, as specified in types.
{parameter name} (str): Name of parameter to be benchmarked. Its internal structure should have the typical benchmark structure, method and values.
loader: (dict): Should still have both keys, batch_size and shuffle, and each of them follow the standard benchmark structure (method and values).
loss (dict): Loss to consider; requires the following:
method (str): Should be equal to
choice
.values (list): For each possible option to consider, specify the following:
module (str): Loss module.
params (dict, optional): Parameters to benchmark for the specific module, in case there are any. Each of them should have the standard structure method and values.
optimizer (dict): Optimizer to consider; requires the following:
method (str): Should be equal to
choice
.values (list): For each possible option to consider, specify the following:
module (str): Optimizer module.
params (dict, optional): Parameters to benchmark for the specific module, in case there are any. Each of them should have the standard structure method and values.
An example of a benchmark configuration for deep learning models would be:
benchmark:
transform:
models:
torch:
epochs:
method: randint
values: [2, 5]
layers:
number_blocks: [3, 6]
types:
- Linear + ReLU
- Dropout
params:
Dropout:
p:
method: uniform
values: [0.4, 0.6]
loader:
batch_size:
method: randint
values: [20, 40]
shuffle:
method: choice
values:
- True
- False
loss:
method: choice
values:
- module: torch.nn.MSELoss
- module: torch.nn.L1Loss
optimizer:
method: choice
values:
- module: torch.optim.SGD
params:
lr:
method: loguniform
values: [0.001, 0.01]
momentum:
method: uniform
values: [0.5, 1]
- module: torch.optim.Adam
params:
lr:
method: loguniform
values: [0.001, 0.1]
eps:
method: loguniform
values: [0.0000001, 0.00001]
Load¶
In load phase the possible configurations are the following:
path (str): Folder in which to store benchmark results.
save_best_config_params (bool, optional): Whether to store a yaml file with best model configuration or not, within specified path.