.. _configuration:
===============
Configuration
===============
The main difference between the use case of regular users compared to advanced
users is reflected in the configuration of the pipeline.
Conceptually, a pipeline is composed from a modular composition of steps. The
ones available are the following:
* :ref:`data-step`: Responsible for raw data management
* :ref:`model-step`: Responsible for model management
* :ref:`benchmark-step`: Responsible for searching the best model configuration
The configuration should be done through a `YAML `_
file, which contains all the pipeline and steps options that are considered.
Global
======
First of all, it is necessary to specify global options, which are the
following:
* problem_type: The kind of problem that determines the model/s to use. Valid
values are: ``classification``, ``regression``.
.. code:: yaml
global:
problem_type:
Afterwards, the steps configuration is provided, which is detailed below.
.. _data-step:
Data
====
This step is made up of the following ETL phases:
1. **extract**: Read the raw data.
2. **transform**: Apply a set of data transformations, e.g. normalization, rename
columns, etc.
3. **load**: Save the processed dataset.
Example of data step in a pipeline file:
.. code:: yaml
data:
extract:
filepath: data/raw/dataset.csv
features:
- bedrooms
- bathrooms
- sqft_living
target: price
transform:
normalize:
features:
module: sklearn.preprocessing.StandardScaler
target:
module: sklearn.preprocessing.StandardScaler
load:
filepath: data/processed/dataset.
Extract
-------
In the extract phase the possible configurations are the following:
- **filepath** (str, optional): File path of dataset to use. It is an
optional param. The default value is ``data/raw/dataset.csv``.
- **target** (str): Column name as a target.
- **features** (list, optional): Set of columns to process. It is an optional
param. When *features* is not indicated, the process uses all columns.
Examples:
^^^^^^^^^
In the following example, the framework reads dataset from
``data/raw/boston_dataset.csv``. Also, it gets ``price`` column as a target.
In addition, it uses ``bedrooms``, ``bathrooms``, ``sqft_living`` as features.
.. code:: yaml
extract:
filepath: data/raw/boston_dataset.csv
features:
- bedrooms
- bathrooms
- sqft_living
target: price
Transform
---------
In this phase the possible transformations are the following:
Encoding
^^^^^^^^
The parameter **encoding** (dict, optional) defines the dataset encoding for
categorical features through *One Hot Encoding* (OHE) method. This is defined
through two additional parameters within the **encoding** dictionary, which are
**OHE** (boolean), which determines whether to apply the encoding to features
or not, and **features** (list/str, optional), that specify the feature or
features to apply OHE to; if empty and **OHE** is True, OHE would be applied to
all categorical features from the dataset, except for the target.
Examples
^^^^^^^^
The simplest configuration is the following, which means a OHE for all
the categorical features:
.. code:: yaml
transform:
The following example, the framework applies OHE for all categorical features:
.. code:: yaml
transform:
encoding:
OHE: True
The opposite case would be to not apply OHE to any feature:
.. code:: yaml
transform:
encoding:
OHE: False
Another example is then OHE should be applied only to specific features. In the
following example, OHE is applied only to columns *column_a* and *column_b*.
.. code:: yaml
transform:
encoding:
OHE: True
features:
column_a
column_b
Normalize
^^^^^^^^^
The parameter **normalize** (dict, optional) defines the dataset
normalization. It is possible to normalize nothing, features, target or
both. With **features** parameter, it defines which normalization apply to
features. Furthermore, with **target** parameter, it defines the target
normalization. If the transform step contains an empty **normalize** key, it
uses a ``sklearn.preprocessing.StandardScaler`` for features and target as
default. On the other hand, if **normalize** key does not exist, no
normalization is applied. If only features or target (but not both) are to be
normalized, empty settings should be provided for the part that does not
require normalization.
- **target** (list, optional): Column name as a target. It is an
optional param. The default value is ``target``.
- **features** (list, optional): Set of columns to process. It is an
optional param. When empty, the process uses all columns.
For any of the previous mentioned, there are three children keys:
- **module** (str, optional): Normalization module to apply. Right now,
``sklean.preprocessing.StandardScaler`` is the only one supported.
- **params** (dict, optional): Specific parameters of the previous
module. Should be specified as key-value pairs.
- **columns** (list, optional): Columns to be considered for normalization. By
default, all of features/target if empty, depending on the context.
Examples
^^^^^^^^
The simplest configuration is the following, which means no normalization:
.. code:: yaml
transform:
In the example below, the framework applies a default normalization parameters
(``sklearn.preprocessing.StandardScaler`` for both target and features).
.. code:: yaml
transform:
normalize:
If only features or target are to be normalized, just an empty module should be
provided for target.
.. code:: yaml
transform:
normalize:
target:
module:
In the example below, the framework uses a
``sklearn.preprocessing.StandardScaler`` for normalize target and
features. In the case of features, normalization is applied considering std,
and only for columns named *column_a* and *column_b*.
.. code:: yaml
transform:
normalize:
features:
module: sklearn.preprocessing.StandardScaler
params:
with_std: True
columns:
- column_a
- column_b
Load
----
In load phase the possible configurations are the following:
- **filepath** (str, optional): file path to store processed dataset.
Examples
^^^^^^^^
The simplest configuration is the following:
.. code:: yaml
load:
When **load** phase is empty, the framework does not save the processed
dataset.
The following example, the framework stores the processed data in
``data/processed/dataset.csv``.
.. code:: yaml
load:
filepath: data/processed/dataset.csv
.. _model-step:
Model
=====
This step is responsible for model management.
It is made up for the following ETL phases:
- **extract**: The purpose of this phase is to read a previously saved model.
- **transform**: This phase applies the common model functions:
training, testing and cross-validation.
- **load**: Saves the initialized model.
Extract
-------
In extract phase the possible configurations are the following:
- **filepath** (str [#comp1]_): File path of model to read.
.. [#comp1]
Compulsory for predict pipelines, and excluded in the rest of pipeline types.
Transform
---------
This phase applies the common model functions: fit, predict and
cross-validation. The available configurations are the following:
- **fit** (dict [#comp2]_): Requests a model training on the current dataset. It
may have the following additional information:
- **estimator** (dict, optional): Sppecifies the estimator and its
hyperparameters. Consists of the following:
- **module** (str, optional): Learner module to use.
- **params** (dict, optional): Additional parameters to pass to module
class.
Available models are the ones `available from sklearn
`_, and of
course just the ones related to the problem type specified.
Default models are ``sklearn.ensemble.RandomForestRegressor`` for
regression and ``sklearn.ensemble.RandomForestClassifier`` for
classification problems, both with ``n_estimator`` equal to 100.
- **cross_validation** (dict, optional): Defines which cross-validation
strategy to use for training the model. Dictionary may have the following
keys:
- **module** (str, optional): Cross validation module to use.
- **params** (dict, optional): Additional parameters to pass to module
class.
Any cross validation method in `sklearn cross-validation
`_
should work, provided that it follows their consistent structure.
Default: ``sklearn.model_selection.KFold`` with 3 splits.
- **metrics** (list): a list of metrics to evaluate the model.
Any metric that exists in `sklearn.metrics
`_
is allowed, of course that apply to the problem type; only the function
name is required. Default values are ``mean_squared_error``,
``mean_absolute_percentage_error``, ``median_absolute_error``,
``mean_absolute_error``, ``root_mean_squared_error`` for regression
problems and ``accuracy_score``, ``precision_score``, ``recall_score``,
``specificity_score``, ``f1_score`` and ``roc_auc_score`` for
classification problems.
It is even possible to define custom metrics. For this, what is needed just
to define a function named ``compute_{metric_name}_metric`` in the file
``honcaml/models/evaluate.py``, being {metric_name} the name of the
metric, and having as input parameters the series of true values, and the
series of predicted ones, in this order (there are already a couple of
examples). Then, it is just a matter of include the metric name in the
configuration.
Both options have the possibility to pass additional parameters to the
metric function, by specifying the metric as a dictionary instead of a
single string. The dictionary key would be the metric name, whereas its
values would refer to function parameters.
- **predict** (dict [#comp3]_): Requests to run predictions over the dataset.
- **path** (str, optional): Directory where the predictions will be
stored. Default value: ``data/processed``.
.. [#comp2]
Compulsory for fit pipelines, and excluded for predict pipelines. Related to
benchmark pipelines, see the details in :ref:`benchmark-step`.
.. [#comp3]
Compulsory for predict pipelines, and excluded for the rest of pipeline
types.
Examples
^^^^^^^^
The following snippet shows an example of an advanced model transform
definition:
.. code:: yaml
transform:
fit:
estimator:
module: sklearn.ensemble.RandomForestRegressor
params:
n_estimators: 100
cross_validation:
module: sklearn.model_selection.KFold
params:
n_splits: 2
.. _deep-learning-models:
Deep learning models
^^^^^^^^^^^^^^^^^^^^
Deep learning models implemented in torch require a specific format, different
from sklearn based models or similar, in which parameters are passed directly
when instantiating the model class.
First of all, **module** key should have just as value ``torch`` in order to
indicate that a neural net will be used as estimator. Within the **params**
key, the following keys should be specified [#comp4]_:
- **epochs** (int): Number of training epochs.
- **layers** (list) Layers configuration; the structure of each one is:
- **module** (str): Layer module to use.
- **params** (dict, optional [#comp5]_): Additional parameters to pass to
layers.
In the case of linear layers, as the parameter **in_features** is dependent
on previous layers, only **out_features** is required; however, if the last
layer of the neural net is another linear layer, no **out_features** should
be provided, as dimension will be inferred from targets.
- **loader**: (dict): Specifies data loader options to use. Internal keys:
- **batch_size** (int): Number of rows to consider for each batch.
- **shuffle** (bool): Whether to shuffle data at every epoch.
- **loss** (dict): Loss to consider; requires the following:
- **module** (str): Loss module to use.
- **params** (dict, optional): Additional parameters to pass to module.
- **optimizer** (dict): Optimizer to consider; requires the following:
- **module** (str): Optimizer module to use.
- **params** (dict, optional): Additional parameters to pass to module.
An example of a training configuration for a deep learning model would be:
.. code:: yaml
model:
transform:
fit:
estimator:
module: torch
params:
epochs: 3
layers:
- module: torch.nn.Linear
params:
out_features: 64
- module: torch.nn.ReLU
- module: torch.nn.Linear
params:
out_features: 32
- module: torch.nn.Dropout
- module: torch.nn.Linear
loader:
batch_size: 20
shuffle: True
loss:
module: torch.nn.MSELoss
optimizer:
module: torch.optim.SGD
params:
lr: 0.001
momentum: 0.9
.. [#comp4]
All options are required for training and benchmark pipelines, whereas
dataloader is the only one required by predict pipelines.
.. [#comp5]
Optional for all layer types except for linear ones, except for the last
layer if it is linear.
Load
----
In load phase the possible configurations are the following:
- **filepath** (str, required): Directory and file name where the model will be saved.
If the user specifies the file name as ``{autogenerate}.sav``, the filename is
generated by the framework following the following
convention: ``{model_type}.{execution_id}.sav``
Otherwise, if the user specifies a custom name, the file is saved with that name.
The supported formats for saving a model include the extension ``.sav``
- **results** (str, [#comp6]_): Directory where to store training cross
validation results; generated file will have the following format:
``{results}/{execution_id}/results.csv``. If not set, results will not be
exported.
.. [#comp6]
Optional for train pipelines, and excluded for the rest of pipeline
types.
.. _benchmark-step:
Benchmark
=========
This step is responsible for searching the best model configuration.
It is made up for the following ETL phases:
- **transform**: this phase runs an hyperparamater search algorithm for each
specified model. Furthermore, it gets the best model configuration.
- **load**: it saves the best configuration into a yaml file.
Apart from obtaining the best model configuration, it is possible to train the
best model through appending a model key after the benchmark step, taking
advantage of the modular definition of the solution:
.. code:: yaml
global:
problem_type: regression
steps:
data:
extract:
filepath: {Input data}
target: {Target}
benchmark:
transform:
load:
path: {gReports path}
model:
transform:
fit:
load:
path: {Path to store best model}
Transform
---------
This phase runs an hyperparameter search algorithm for each model defined in
pipeline file. Furthermore, the user can define a set of metrics to evaluate
the experiments, the model's hyperparamaters to tune, the strategy to split
train and test data and parameters of search algorithm.
The available configurations are the following:
- **models** (dict, optional): Dictionary of models and hyperparameters to
search for best configuration. Each entry of the list refers to a model to
benchmark. Keys should be the following:
- **{model_name}** (dict, optional): Name of model module,
e.g. ``sklearn.ensemble.RandomForestRegressor``.
Within each module, there should be as many keys as model parameters to
search:
- **{hyperparameter}** (dict, optional): Name of hyperparameter,
e.g. ``n_estimators``. Within each hyperparameter, the following needs to
be specified:
- **method** (str, optional): Method to consider for searching
hyperparameter values.
- **values** (tuple/list, optional): Values to consider for hyperparameter
search, passed to specified method.
Available methods and value parameters are defined in the `search space
`_. The default
models and hyperparameters for each type of problem are defined at
*honcaml/config/defaults/search_spaces.py*.
In case of deep learning models, the name of the model to use is ``torch``,
and there is a specific chapter to detail the required configuration in
:ref:`deep-learning-benchmark`.
- **cross_validation** (dict, optional): defines which cross-validation
strategy to use for training each model. Dictionary may have the following
keys:
- **module** (str, optional): Cross validation module to use.
- **params** (dict, optional): Additional parameters to pass to module class.
Any cross validation method in `sklearn cross-validation
`_
should work, provided that it follows their consistent structure.
Default: ``sklearn.model_selection.KFold`` with 3 splits.
- **metrics** (list/str, optional): a list of metrics to report in the
benchmark process, or a single one. Actually, reported metrics may be
appended with the one specified in tuner settings, if the latter is different
(as it is the one used to select the best model configuration). Any metric
that exists in `sklearn.metrics
`_
is allowed, of course that apply to the problem type; only the function name
is required. Default values are ``mean_squared_error``,
``mean_absolute_percentage_error``, ``median_absolute_error``,
``mean_absolute_error``, ``root_mean_squared_error`` for regression problems
and ``accuracy_score``, ``precision_score``, ``recall_score``,
``specificity_score``, ``f1_score`` and ``roc_auc_score`` for classification
problems.
It is even possible to define custom metrics. For this, what is needed just
to define a function named ``compute_{metric_name}_metric`` in the file
``honcaml/models/evaluate.py``, being {metric_name} the name of the
metric, and having as input parameters the series of true values, and the
series of predicted ones, in this order (there are already a couple of
examples). Then, it is just a matter of include the metric name in the
configuration.
- **tuner** (dict): defines the configuration of tune process. Their options
are the following:
- **search_algorithm** (dict, optional): Specifies the algorithm to perform
the search. Consists of the following:
- **module** (str, optional): Algorithm module to use.
- **params** (dict, optional): Additional parameters to pass to module
class.
For all available options, see `the search algorithms documentation
`_.
Default is ``ray.tune.search.optuna.OptunaSearch``.
- **tune_config** (dict, optional): Parameters to pass to tuner config
object, specified as key-value pairs. For available options, see `TuneConfig
documentation
`_.
- **run_config** (dict, optional): Parameters to be used during run,
specified as key-value pairs. For available options, see `RunConfig
documentation
`_.
- **scheduler** (dict, optional): Allows to define different strategies
during the search process. Consists of the following:
- **module** (str, optional): Algorithm module to use.
- **params** (dict, optional): Additional parameters to pass to module
class.
For all available options, see `schedulers documentation
`_.
Examples
^^^^^^^^
The following snippet shows an example of an advanced benchmark transform
definition:
.. code:: yaml
metrics:
- mean_squared_error
- mean_absolute_error
- root_mean_square_error
models:
sklearn.ensemble.RandomForestRegressor:
n_estimators:
method: randint
values: [2, 110]
max_features:
method: choice
values: [sqrt, log2, 1]
sklearn.linear_model.LinearRegression:
fit_intercept:
method: choice
values: [True, False]
cross_validation:
module: sklearn.model_selection.KFold
params:
n_splits: 2
tuner:
search_algorithm:
module: ray.tune.search.optuna.OptunaSearch
tune_config:
num_samples: 5
metric: root_mean_squared_error
mode: min
run_config:
stop:
training_iteration: 2
scheduler:
module: ray.tune.schedulers.HyperBandScheduler
.. _deep-learning-benchmark:
Deep learning benchmark
^^^^^^^^^^^^^^^^^^^^^^^
Deep learning models, in a benchmark pipeline, require a specific format, due
to the fact that models require a custom format as well (it is advisable to
review their structure in :ref:`deep-learning-models`). The main structure
should be the same:
- **epochs** (dict): Typical keys **method** (with value ``randint``) and
**values** should be specified.
- **layers** (dict) Layer structure to benchmark; this key is the only one with
a completely different structure than specified in deep learning models; this
is because the approach for benchmarking them is through what are called
blocks. Blocks are a predefined combination of layers that will be shuffled
with a specific layer to generate combinations to benchmark. For example, one
block could be a linear layer + rectified linear unit, and another one could
be a dropout layer. The required structure is the following:
- **number_blocks** (list): List of two values, which is the minimum and
maximum number of blocks considered for the models.
- **types** (list): List of strings that specify succession of layer types to
be considered as blocks, assuming that their names are contained within
`torch nn module `_. Blocks that
contain a sequence of layers should join their names with the symbol ``+``.
- **params** (dict, optional): In case some layer types require specific
parameters to be benchmarked, they should be informed within this key. The
structure to follow is the following:
- **{layer name}** (str): Layer name, as specified in **types**.
- **{parameter name}** (str): Name of parameter to be benchmarked. Its
internal structure should have the typical benchmark structure,
**method** and **values**.
- **loader**: (dict): Should still have both keys, **batch_size** and
**shuffle**, and each of them follow the standard benchmark structure
(**method** and **values**).
- **loss** (dict): Loss to consider; requires the following:
- **method** (str): Should be equal to ``choice``.
- **values** (list): For each possible option to consider, specify the
following:
- **module** (str): Loss module.
- **params** (dict, optional): Parameters to benchmark for the specific
module, in case there are any. Each of them should have the standard
structure **method** and **values**.
- **optimizer** (dict): Optimizer to consider; requires the following:
- **method** (str): Should be equal to ``choice``.
- **values** (list): For each possible option to consider, specify the
following:
- **module** (str): Optimizer module.
- **params** (dict, optional): Parameters to benchmark for the specific
module, in case there are any. Each of them should have the standard
structure **method** and **values**.
An example of a benchmark configuration for deep learning models would be:
.. code:: yaml
benchmark:
transform:
models:
torch:
epochs:
method: randint
values: [2, 5]
layers:
number_blocks: [3, 6]
types:
- Linear + ReLU
- Dropout
params:
Dropout:
p:
method: uniform
values: [0.4, 0.6]
loader:
batch_size:
method: randint
values: [20, 40]
shuffle:
method: choice
values:
- True
- False
loss:
method: choice
values:
- module: torch.nn.MSELoss
- module: torch.nn.L1Loss
optimizer:
method: choice
values:
- module: torch.optim.SGD
params:
lr:
method: loguniform
values: [0.001, 0.01]
momentum:
method: uniform
values: [0.5, 1]
- module: torch.optim.Adam
params:
lr:
method: loguniform
values: [0.001, 0.1]
eps:
method: loguniform
values: [0.0000001, 0.00001]
Load
----
In load phase the possible configurations are the following:
- **path** (str): Folder in which to store benchmark results.
- **save_best_config_params** (bool, optional): Whether to store a yaml file
with best model configuration or not, within specified **path**.