Configuration¶

The main difference between the use case of regular users compared to advanced users is reflected in the configuration of the pipeline.

Conceptually, a pipeline is composed from a modular composition of steps. The ones available are the following:

Data: Responsible for raw data management
Model: Responsible for model management
Benchmark: Responsible for searching the best model configuration

The configuration should be done through a YAML file, which contains all the pipeline and steps options that are considered.

Global¶

First of all, it is necessary to specify global options, which are the following:

problem_type: The kind of problem that determines the model/s to use. Valid values are: classification, regression.

global:
    problem_type:

Afterwards, the steps configuration is provided, which is detailed below.

Data¶

This step is made up of the following ETL phases:

extract: Read the raw data.
transform: Apply a set of data transformations, e.g. normalization, rename columns, etc.
load: Save the processed dataset.

Example of data step in a pipeline file:

data:
    extract:
      filepath: data/raw/dataset.csv
      features:
        - bedrooms
        - bathrooms
        - sqft_living
      target: price

    transform:
      normalize:
        features:
          module: sklearn.preprocessing.StandardScaler
        target:
          module: sklearn.preprocessing.StandardScaler

    load:
        filepath: data/processed/dataset.

Extract¶

In the extract phase the possible configurations are the following:

filepath (str, optional): File path of dataset to use. It is an optional param. The default value is data/raw/dataset.csv.
target (str): Column name as a target.
features (list, optional): Set of columns to process. It is an optional param. When features is not indicated, the process uses all columns.

Examples:¶

In the following example, the framework reads dataset from data/raw/boston_dataset.csv. Also, it gets price column as a target. In addition, it uses bedrooms, bathrooms, sqft_living as features.

extract:
  filepath: data/raw/boston_dataset.csv
  features:
    - bedrooms
    - bathrooms
    - sqft_living
  target: price

Transform¶

In this phase the possible transformations are the following:

Encoding¶

The parameter encoding (dict, optional) defines the dataset encoding for categorical features through One Hot Encoding (OHE) method. This is defined through two additional parameters within the encoding dictionary, which are OHE (boolean), which determines whether to apply the encoding to features or not, and features (list/str, optional), that specify the feature or features to apply OHE to; if empty and OHE is True, OHE would be applied to all categorical features from the dataset, except for the target.

Examples¶

The simplest configuration is the following, which means a OHE for all the categorical features:

transform:

The following example, the framework applies OHE for all categorical features:

transform:
  encoding:
   OHE: True

The opposite case would be to not apply OHE to any feature:

transform:
  encoding:
   OHE: False

Another example is then OHE should be applied only to specific features. In the following example, OHE is applied only to columns column_a and column_b.

transform:
  encoding:
   OHE: True
   features:
     column_a
     column_b

Normalize¶

The parameter normalize (dict, optional) defines the dataset normalization. It is possible to normalize nothing, features, target or both. With features parameter, it defines which normalization apply to features. Furthermore, with target parameter, it defines the target normalization. If the transform step contains an empty normalize key, it uses a sklearn.preprocessing.StandardScaler for features and target as default. On the other hand, if normalize key does not exist, no normalization is applied. If only features or target (but not both) are to be normalized, empty settings should be provided for the part that does not require normalization.

target (list, optional): Column name as a target. It is an optional param. The default value is target.
features (list, optional): Set of columns to process. It is an optional param. When empty, the process uses all columns.

For any of the previous mentioned, there are three children keys:

module (str, optional): Normalization module to apply. Right now, sklean.preprocessing.StandardScaler is the only one supported.
params (dict, optional): Specific parameters of the previous module. Should be specified as key-value pairs.
columns (list, optional): Columns to be considered for normalization. By default, all of features/target if empty, depending on the context.

Examples¶

The simplest configuration is the following, which means no normalization:

transform:

In the example below, the framework applies a default normalization parameters (sklearn.preprocessing.StandardScaler for both target and features).

transform:
  normalize:

If only features or target are to be normalized, just an empty module should be provided for target.

transform:
  normalize:
    target:
      module:

In the example below, the framework uses a sklearn.preprocessing.StandardScaler for normalize target and features. In the case of features, normalization is applied considering std, and only for columns named column_a and column_b.

transform:
  normalize:
    features:
      module: sklearn.preprocessing.StandardScaler
      params:
        with_std: True
        columns:
          - column_a
          - column_b

Load¶

In load phase the possible configurations are the following:

filepath (str, optional): file path to store processed dataset.

Examples¶

The simplest configuration is the following:

load:

When load phase is empty, the framework does not save the processed dataset.

The following example, the framework stores the processed data in data/processed/dataset.csv.

load:
  filepath: data/processed/dataset.csv

Model¶

This step is responsible for model management.

It is made up for the following ETL phases:

extract: The purpose of this phase is to read a previously saved model.
transform: This phase applies the common model functions: training, testing and cross-validation.
load: Saves the initialized model.

Extract¶

In extract phase the possible configurations are the following:

filepath (str [1]): File path of model to read.

Transform¶

This phase applies the common model functions: fit, predict and cross-validation. The available configurations are the following:

fit (dict [2]): Requests a model training on the current dataset. It may have the following additional information:
- estimator (dict, optional): Sppecifies the estimator and its hyperparameters. Consists of the following:
  - module (str, optional): Learner module to use.
  - params (dict, optional): Additional parameters to pass to module class.
  Available models are the ones available from sklearn, and of course just the ones related to the problem type specified. Default models are sklearn.ensemble.RandomForestRegressor for regression and sklearn.ensemble.RandomForestClassifier for classification problems, both with n_estimator equal to 100.
- cross_validation (dict, optional): Defines which cross-validation strategy to use for training the model. Dictionary may have the following keys:
  - module (str, optional): Cross validation module to use.
  - params (dict, optional): Additional parameters to pass to module class.
  Any cross validation method in sklearn cross-validation should work, provided that it follows their consistent structure. Default: sklearn.model_selection.KFold with 3 splits.
- metrics (list): a list of metrics to evaluate the model. Any metric that exists in sklearn.metrics is allowed, of course that apply to the problem type; only the function name is required. Default values are mean_squared_error, mean_absolute_percentage_error, median_absolute_error, mean_absolute_error, root_mean_squared_error for regression problems and accuracy_score, precision_score, recall_score, specificity_score, f1_score and roc_auc_score for classification problems.
  
  It is even possible to define custom metrics. For this, what is needed just to define a function named compute_{metric_name}_metric in the file honcaml/models/evaluate.py, being {metric_name} the name of the metric, and having as input parameters the series of true values, and the series of predicted ones, in this order (there are already a couple of examples). Then, it is just a matter of include the metric name in the configuration.
  
  Both options have the possibility to pass additional parameters to the metric function, by specifying the metric as a dictionary instead of a single string. The dictionary key would be the metric name, whereas its values would refer to function parameters.
predict (dict [3]): Requests to run predictions over the dataset.
- path (str, optional): Directory where the predictions will be stored. Default value: data/processed.

Examples¶

The following snippet shows an example of an advanced model transform definition:

transform:
  fit:
    estimator:
      module: sklearn.ensemble.RandomForestRegressor
      params:
        n_estimators: 100
    cross_validation:
      module: sklearn.model_selection.KFold
      params:
        n_splits: 2

Deep learning models¶

Deep learning models implemented in torch require a specific format, different from sklearn based models or similar, in which parameters are passed directly when instantiating the model class.

First of all, module key should have just as value torch in order to indicate that a neural net will be used as estimator. Within the params key, the following keys should be specified [4]:

epochs (int): Number of training epochs.
layers (list) Layers configuration; the structure of each one is:
- module (str): Layer module to use.
- params (dict, optional [5]): Additional parameters to pass to layers.
In the case of linear layers, as the parameter in_features is dependent on previous layers, only out_features is required; however, if the last layer of the neural net is another linear layer, no out_features should be provided, as dimension will be inferred from targets.
loader: (dict): Specifies data loader options to use. Internal keys:
- batch_size (int): Number of rows to consider for each batch.
- shuffle (bool): Whether to shuffle data at every epoch.
loss (dict): Loss to consider; requires the following:
- module (str): Loss module to use.
- params (dict, optional): Additional parameters to pass to module.
optimizer (dict): Optimizer to consider; requires the following:
- module (str): Optimizer module to use.
- params (dict, optional): Additional parameters to pass to module.

An example of a training configuration for a deep learning model would be:

model:
  transform:
    fit:
      estimator:
        module: torch
        params:
          epochs: 3
          layers:
            - module: torch.nn.Linear
              params:
                out_features: 64
            - module: torch.nn.ReLU
            - module: torch.nn.Linear
              params:
                out_features: 32
            - module: torch.nn.Dropout
            - module: torch.nn.Linear
          loader:
            batch_size: 20
            shuffle: True
          loss:
            module: torch.nn.MSELoss
          optimizer:
            module: torch.optim.SGD
            params:
              lr: 0.001
              momentum: 0.9

Load¶

In load phase the possible configurations are the following:

filepath (str, required): Directory and file name where the model will be saved. If the user specifies the file name as {autogenerate}.sav, the filename is generated by the framework following the following convention: {model_type}.{execution_id}.sav Otherwise, if the user specifies a custom name, the file is saved with that name. The supported formats for saving a model include the extension .sav
results (str, [6]): Directory where to store training cross validation results; generated file will have the following format: {results}/{execution_id}/results.csv. If not set, results will not be exported.

Benchmark¶

This step is responsible for searching the best model configuration.

It is made up for the following ETL phases:

transform: this phase runs an hyperparamater search algorithm for each specified model. Furthermore, it gets the best model configuration.
load: it saves the best configuration into a yaml file.

Apart from obtaining the best model configuration, it is possible to train the best model through appending a model key after the benchmark step, taking advantage of the modular definition of the solution:

 global:
   problem_type: regression

 steps:
   data:
     extract:
       filepath: {Input data}
       target: {Target}

benchmark:
  transform:
  load:
    path: {gReports path}

model:
  transform:
    fit:
  load:
    path: {Path to store best model}

Transform¶

This phase runs an hyperparameter search algorithm for each model defined in pipeline file. Furthermore, the user can define a set of metrics to evaluate the experiments, the model’s hyperparamaters to tune, the strategy to split train and test data and parameters of search algorithm.

The available configurations are the following:

models (dict, optional): Dictionary of models and hyperparameters to search for best configuration. Each entry of the list refers to a model to benchmark. Keys should be the following:
- {model_name} (dict, optional): Name of model module, e.g. sklearn.ensemble.RandomForestRegressor.
Within each module, there should be as many keys as model parameters to search:
- {hyperparameter} (dict, optional): Name of hyperparameter, e.g. n_estimators. Within each hyperparameter, the following needs to be specified:
  - method (str, optional): Method to consider for searching hyperparameter values.
  - values (tuple/list, optional): Values to consider for hyperparameter search, passed to specified method.
Available methods and value parameters are defined in the search space. The default models and hyperparameters for each type of problem are defined at honcaml/config/defaults/search_spaces.py.

In case of deep learning models, the name of the model to use is torch, and there is a specific chapter to detail the required configuration in Deep learning benchmark.
cross_validation (dict, optional): defines which cross-validation strategy to use for training each model. Dictionary may have the following keys:
- module (str, optional): Cross validation module to use.
- params (dict, optional): Additional parameters to pass to module class.
Any cross validation method in sklearn cross-validation should work, provided that it follows their consistent structure. Default: sklearn.model_selection.KFold with 3 splits.
metrics (list/str, optional): a list of metrics to report in the benchmark process, or a single one. Actually, reported metrics may be appended with the one specified in tuner settings, if the latter is different (as it is the one used to select the best model configuration). Any metric that exists in sklearn.metrics is allowed, of course that apply to the problem type; only the function name is required. Default values are mean_squared_error, mean_absolute_percentage_error, median_absolute_error, mean_absolute_error, root_mean_squared_error for regression problems and accuracy_score, precision_score, recall_score, specificity_score, f1_score and roc_auc_score for classification problems.

It is even possible to define custom metrics. For this, what is needed just to define a function named compute_{metric_name}_metric in the file honcaml/models/evaluate.py, being {metric_name} the name of the metric, and having as input parameters the series of true values, and the series of predicted ones, in this order (there are already a couple of examples). Then, it is just a matter of include the metric name in the configuration.
tuner (dict): defines the configuration of tune process. Their options are the following:
- search_algorithm (dict, optional): Specifies the algorithm to perform the search. Consists of the following:
  - module (str, optional): Algorithm module to use.
  - params (dict, optional): Additional parameters to pass to module class.
For all available options, see the search algorithms documentation. Default is ray.tune.search.optuna.OptunaSearch.
- tune_config (dict, optional): Parameters to pass to tuner config object, specified as key-value pairs. For available options, see TuneConfig documentation.
- run_config (dict, optional): Parameters to be used during run, specified as key-value pairs. For available options, see RunConfig documentation.
- scheduler (dict, optional): Allows to define different strategies during the search process. Consists of the following:
  - module (str, optional): Algorithm module to use.
  - params (dict, optional): Additional parameters to pass to module class.
For all available options, see schedulers documentation.

Examples¶

The following snippet shows an example of an advanced benchmark transform definition:

metrics:
  - mean_squared_error
  - mean_absolute_error
  - root_mean_square_error
models:
  sklearn.ensemble.RandomForestRegressor:
    n_estimators:
      method: randint
      values: [2, 110]
    max_features:
      method: choice
      values: [sqrt, log2, 1]
  sklearn.linear_model.LinearRegression:
    fit_intercept:
      method: choice
      values: [True, False]
cross_validation:
  module: sklearn.model_selection.KFold
  params:
    n_splits: 2
tuner:
  search_algorithm:
    module: ray.tune.search.optuna.OptunaSearch
  tune_config:
    num_samples: 5
    metric: root_mean_squared_error
    mode: min
  run_config:
    stop:
      training_iteration: 2
  scheduler:
    module: ray.tune.schedulers.HyperBandScheduler

Deep learning benchmark¶

Deep learning models, in a benchmark pipeline, require a specific format, due to the fact that models require a custom format as well (it is advisable to review their structure in Deep learning models). The main structure should be the same:

epochs (dict): Typical keys method (with value randint) and values should be specified.
layers (dict) Layer structure to benchmark; this key is the only one with a completely different structure than specified in deep learning models; this is because the approach for benchmarking them is through what are called blocks. Blocks are a predefined combination of layers that will be shuffled with a specific layer to generate combinations to benchmark. For example, one block could be a linear layer + rectified linear unit, and another one could be a dropout layer. The required structure is the following:
- number_blocks (list): List of two values, which is the minimum and maximum number of blocks considered for the models.
- types (list): List of strings that specify succession of layer types to be considered as blocks, assuming that their names are contained within torch nn module. Blocks that contain a sequence of layers should join their names with the symbol +.
- params (dict, optional): In case some layer types require specific parameters to be benchmarked, they should be informed within this key. The structure to follow is the following:
  - {layer name} (str): Layer name, as specified in types.
    - {parameter name} (str): Name of parameter to be benchmarked. Its internal structure should have the typical benchmark structure, method and values.
loader: (dict): Should still have both keys, batch_size and shuffle, and each of them follow the standard benchmark structure (method and values).
loss (dict): Loss to consider; requires the following:
- method (str): Should be equal to choice.
- values (list): For each possible option to consider, specify the following:
  - module (str): Loss module.
  - params (dict, optional): Parameters to benchmark for the specific module, in case there are any. Each of them should have the standard structure method and values.
optimizer (dict): Optimizer to consider; requires the following:
- method (str): Should be equal to choice.
- values (list): For each possible option to consider, specify the following:
  - module (str): Optimizer module.
  - params (dict, optional): Parameters to benchmark for the specific module, in case there are any. Each of them should have the standard structure method and values.

An example of a benchmark configuration for deep learning models would be:

benchmark:
  transform:
    models:
      torch:
        epochs:
          method: randint
          values: [2, 5]
        layers:
          number_blocks: [3, 6]
          types:
            - Linear + ReLU
            - Dropout
          params:
            Dropout:
              p:
                method: uniform
                values: [0.4, 0.6]
        loader:
          batch_size:
            method: randint
            values: [20, 40]
          shuffle:
            method: choice
            values:
              - True
              - False
        loss:
          method: choice
          values:
            - module: torch.nn.MSELoss
            - module: torch.nn.L1Loss
        optimizer:
          method: choice
          values:
            - module: torch.optim.SGD
              params:
                lr:
                  method: loguniform
                  values: [0.001, 0.01]
                momentum:
                  method: uniform
                  values: [0.5, 1]
            - module: torch.optim.Adam
              params:
                lr:
                  method: loguniform
                  values: [0.001, 0.1]
                eps:
                  method: loguniform
                  values: [0.0000001, 0.00001]

Load¶

In load phase the possible configurations are the following:

path (str): Folder in which to store benchmark results.
save_best_config_params (bool, optional): Whether to store a yaml file with best model configuration or not, within specified path.

Configuration¶

Global¶

Data¶

Extract¶

Examples:¶

Transform¶

Encoding¶

Examples¶

Normalize¶

Examples¶

Load¶

Examples¶

Model¶

Extract¶

Transform¶

Examples¶

Deep learning models¶

Load¶

Benchmark¶

Transform¶

Examples¶

Deep learning benchmark¶

Load¶

HoNCAML

Navigation

Related Topics