Reference

HoNCAML follows mainly an OOP coding approach through Python classes. The main ones are detailed in this section.

Execution

The main class used by HoNCAML is execution, which is a wrapper on top of the Pipeline class.

class honcaml.tools.execution.Execution(pipeline_config_file: str)

Class to execute ML pipelines. First, it reads the pipeline content and creates a new Pipeline instance with pipeline file content.

_pipeline_config_file

Pipeline configuration file name.

Type:

str

_execution_id

Execution identifier.

Type:

str

_pipeline

Pipeline instance to run.

Type:

pipeline.Pipeline

run() None

Parses the pipeline file and creates a new Pipeline instance to run.

Pipeline

A pipeline is made of several Steps to be executed.

class honcaml.tools.pipeline.Pipeline(pipeline_content: Dict, execution_id: str)

The pipeline class contains the steps defined by the user. It defines the pipeline to be executed and runs each of the steps defined.

_steps

Steps defining the pipeline.

Type:

List[steps.Step]

_metadata

Objects output from each step.

Type:

Dict

_pipeline_content

Settings defining the pipeline steps.

Type:

Dict

_execution_id

Execution identifier.

Type:

str

run()

Run the pipeline, which means to run each step consecutively.

Steps

The step class is the one that determines the parts of a pipeline to run, and it follows a ETL approach.

class honcaml.steps.base.BaseStep(default_settings: Dict, user_settings: Dict, global_params: Dict, step_rules: Dict)

Abstract class to wrap a pipeline step. It defines the base structure for a step from the main pipeline.

_step_settings

Settings that define the step.

Type:

Dict

_extract_settings

Settings defining the extract ETL process.

Type:

Dict

_transform_settings

Settings defining the transform ETL process.

Type:

Dict

_load_settings

Settings defining the load ETL process.

Type:

Dict

execute() None

Executes the ETL processes from the current step.

property extract_settings: Dict

Getter method for the ‘_extract_settings’ attribute.

Returns:

‘_extract_settings’ current value.

property load_settings: Dict

Getter method for the ‘_load_settings’ attribute.

Returns:

‘_load_settings’ current value.

abstract run(metadata: Dict) Dict

Runs the step.

Parameters:

metadata – Configuration parameters in order to run the step.

property step_settings: Dict

Getter method for the ‘_step_settings’ attribute.

Returns:

‘_step_settings’ current value.

property transform_settings: Dict

Getter method for the ‘_transform_settings’ attribute.

Returns:

‘_transform_settings’ current value.

Data

The data step is the one related to data management.

class honcaml.steps.data.DataStep(default_settings: Dict, user_settings: Dict, global_params: Dict, step_rules: Dict, execution_id: str)

The data step class is a step of the main pipeline. It contains the functionalities to perform the ETL on the requested data.

_dataset

Dataset to be handled.

Type:

data.Dataset

property dataset: BaseDataset

Getter method for the ‘_dataset’ attribute.

Returns:

‘_dataset’ current value.

run(metadata: Dict) Dict

Runs the data steps. Using the dataset created run the ETL functions for the specific dataset: extract, transform and load.

Parameters:

metadata – Accumulated pipeline metadata.

Returns:

Updated pipeline metadata with datasat included.

Return type:

metadata

It includes the following classes that further configure the step:

  • BaseDataset: Defines an abstract class that serves as a parent to the rest of the dataset classes (e.g. TabularDataset, etc.)

    class honcaml.data.base.BaseDataset

    Base class defining a dataset.

    _normalization

    Class to store the normalization parameters for features and target.

    Type:

    Union[norm.Normalization, None]

    property normalization: Normalization

    Getter method for ‘_normalization’ attribute.

    Returns:

    ‘_normalization’ current value.

    abstract preprocess(settings: Dict)

    ETL data transform. Must be implemented by child classes.

    abstract read(settings: Dict)

    ETL data extract. Must be implemented by child classes.

    abstract save(settings: Dict)

    ETL data load. Must be implemented by child classes.

  • Normalization: Wraps all normalization methods that apply to the dataset.

    class honcaml.data.normalization.Normalization(settings: Dict)

    The aim of this class is to store the normalization parameters for dataset features and target.

    _features

    Columns to normalize.

    Type:

    List[str]

    _target

    Targets to normalize.

    Type:

    List[str]

    _features_normalizer

    Normalization module and parameters to apply to a list of features.

    Type:

    Dict

    _target_normalizer

    Normalization module and parameters to apply to a list of targets.

    Type:

    Dict

    property features: List[str]

    Getter method for ‘_features’ attribute.

    Returns:

    ‘_features’ current value.

    property features_normalizer: Callable

    This is a getter method. This function returns a tuple with the normalization module and parameters to apply to a features.

    Returns:

    a module and parameters for features.

    Return type:

    (Tuple[str, dict])

    property target: List[str]

    Getter method for ‘_target’ attribute.

    Returns:

    ‘_target’ current value.

    property target_normalizer: Callable

    Getter method for ‘_target_normalizer’ attribute.

    Returns:

    ‘_target_normalizer’ current value.

  • CrossValidationSplit: Applies CV splitting through the dataset.

    class honcaml.data.transform.CrossValidationSplit(module: str, params: Dict | None = None)

    The aim of this class is to wrap possible cross-validation classes from sklearn framework.

    _module

    Cross-validation module.

    Type:

    str

    _data

    Dict with additional parameters to pass to cross validation module.

    Type:

    Dict

    split(x: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) Generator[Tuple[int, DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None, DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None], None, None]

    Execute a split method from cross-validation module. In addition, the ‘kwargs’ parameter allows passing additional arguments when it creates the object instance. The valid x and y datasets types are: pd.DataFrame, pd.Series and np.ndarray.

    Parameters:
    • x – Dataset with features to split.

    • y – Dataset with targets to split.

    Yields:
    • Split number

    • Feature array for training

    • Feature array for test

    • Target array for training

    • Target array for test

Model

The model step is the one related to model management.

class honcaml.steps.model.ModelStep(default_settings: Dict, user_settings: Dict, global_params: Dict, step_rules: Dict, execution_id: str)

The model step class is a step of the main pipeline. It performs different tasks such as train, predict and evaluate a model. The extract and load functions allow the steps to save or restore a model.

_estimator_config

Definition of the estimator, being the module and their hyperparameters.

Type:

Dict

_model

Model from this library wrapping the specific estimator.

Type:

base_model.BaseModel

property model: BaseModel

Getter method for the ‘_model’ attribute.

Returns:

‘_model’ current value.

run(metadata: Dict) Dict

Runs the model steps. Using the model created run the ETL functions for the specific model: extract, transform and load.

Parameters:

metadata – Accumulated pipeline metadata.

Returns:

Updated pipeline with the best estimator as a model.

Return type:

metadata

  • BaseModel: Defines an abstract class from which models are created.

    class honcaml.models.base.BaseModel(problem_type: str)

    Model base class.

    _estimator_type

    The kind of estimator to be used. Valid values are regressor and classifier.

    Type:

    str

    _estimator

    Estimator defined by child classes.

    abstract build_model(model_config: Dict, *args) None

    Creates the requested estimator. Must be implemented by child classes.

    Parameters:
    • model_config – Model configuration, i.e. module and its hyperparameters.

    • **kwargs – Extra parameters.

    abstract evaluate(x: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], **kwargs: Dict) Dict

    Evaluates the estimator on the given dataset. Must be implemented by child classes.

    Parameters:
    • x – Dataset features.

    • y – Dataset target.

    • **kwargs – Extra parameters.

    Returns:

    Resulting metrics from the evaluation.

    abstract fit(x: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], **kwargs: Dict) None

    Trains the estimator on the specified dataset. Must be implemented by child classes.

    Parameters:
    • x – Dataset features.

    • y – Dataset target.

    • **kwargs – Extra parameters.

    abstract predict(x: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], **kwargs: Dict) List

    Uses the estimator to make predictions on the given dataset features. Must be implemented by child classes.

    Parameters:
    • x – Dataset features.

    • **kwargs – Extra parameters.

    Returns:

    Resulting predictions from the estimator.

    static read(settings: Dict) None

    Read an estimator from disk.

    Parameters:

    settings – Parameter settings defining the read operation.

    save(settings: Dict) None

    Stores the estimator to disk.

    Parameters:

    settings – Parameter settings defining the store operation.

Benchmark

The benchmark step is the one related to meta-model management, specifically to select the best model between all of the available options.

class honcaml.steps.benchmark.BenchmarkStep(default_settings: Dict, user_settings: Dict, global_params: Dict, step_rules: Dict, execution_id: str)

The Benchmark step class is a steps of the main pipeline. The step performs a model ranking by performing a hyperparameter search and model selection based on the user and default settings. The extract and load methods allow the steps to save and restore executions to/from checkpoints.

_store_results_folder

folder path to store results.

Type:

str

_dataset

dataset class intance.

_reported_metrics

metrics to compute during hyper parameter search.

Type:

List[str]

_metric

metric function to optimize.

Type:

str

_mode

maximize or minimize metric.

Type:

str

get_best_model_and_hyperparams_dict() Dict

This function returns a dict with the best model module and the best hyperparameters of benchmark transform step.

Returns:

run(metadata: Dict) Dict

Run the benchmark step. Using a benchmark of models run the ETL functions to rank them and return the best one.

Parameters:

metadata (Dict) – the objects output from each different previous step.

Returns:

the previous objects updated with the ones from

the current step: the best estimator as a model from this library.

Return type:

metadata (Dict)

  • BaseBenchmark: Defines an abstract class for model benchmarking.

    class honcaml.benchmark.base.BaseBenchmark(name: str)
    abstract clean_search_space() dict

    Given a dict with a search space for a model, this function gets the module of model to import and the hyperparameters search space and ensures that method exists.

    Must be implemented by child classes.

    Parameters:

    search_space (Dict) – a dict with the search space to explore

    Returns:

    a dict where for each hyperparameter the corresponding method to generate all possible values during the search.

    Return type:

    (Dict)

    invalidate_experiment() bool

    Logic to specify if an experiment should be invalidated before estimator is cross-validated, in order to avoid unnecessary time and resources. This is due to incoherent or unrealistic combination of parameters that are known beforehand.

    Must be implemented by child classes.

    Parameters:

    search_space – Search space to explore.

    Returns:

    Whether experiment should be invalidated or not

  • EstimatorTrainer: Computes optimised hyperparameters for a specific model, based on tune.Trainable class.

    class honcaml.benchmark.trainable.EstimatorTrainer(config: Dict[str, Any] = None, logger_creator: Callable[[Dict[str, Any]], Logger] = None, remote_checkpoint_dir: str | None = None, custom_syncer: Syncer | None = None)

    This is a class with the aim to runs a set of experiments for search the best model hyperparameters configuration. In addition, it is a child class from ray.tune.Trainable. The functions to override are the following: - setup - step - save_checkpoint - load_checkpoint

    _model_module

    module of model to use.

    Type:

    str

    _dataset

    dataset class instance.

    _cv_split

    cross-validation object with train configurations.

    _param_space

    dict with model’s hyperparameters to search and all possible values.

    Type:

    Dict

    _reported_metrics

    metrics to report

    Type:

    List

    _metric

    metric to use to evaulate the model performance.

    Type:

    str

    _model

    model instance.

    load_checkpoint(checkpoint: Dict | str)

    Subclasses should override this to implement restore().

    Warning

    In this method, do not rely on absolute paths. The absolute path of the checkpoint_dir used in Trainable.save_checkpoint may be changed.

    If Trainable.save_checkpoint returned a prefixed string, the prefix of the checkpoint string returned by Trainable.save_checkpoint may be changed. This is because trial pausing depends on temporary directories.

    The directory structure under the checkpoint_dir provided to Trainable.save_checkpoint is preserved.

    See the example below.

    Example

    >>> from ray.tune.trainable import Trainable
    >>> class Example(Trainable):
    ...    def save_checkpoint(self, checkpoint_path):
    ...        print(checkpoint_path)
    ...        return os.path.join(checkpoint_path, "my/check/point")
    ...    def load_checkpoint(self, checkpoint):
    ...        print(checkpoint)
    >>> trainer = Example()
    >>> # This is used when PAUSED.
    >>> obj = trainer.save_to_object() 
    <logdir>/tmpc8k_c_6hsave_to_object/checkpoint_0/my/check/point
    >>> # Note the different prefix.
    >>> trainer.restore_from_object(obj) 
    <logdir>/tmpb87b5axfrestore_from_object/checkpoint_0/my/check/point
    

    New in version 0.8.7.

    Parameters:

    checkpoint – If dict, the return value is as returned by save_checkpoint. If a string, then it is a checkpoint path that may have a different prefix than that returned by save_checkpoint. The directory structure underneath the checkpoint_dir save_checkpoint is preserved.

    save_checkpoint(checkpoint_dir: str) str | Dict | None

    Subclasses should override this to implement save().

    Warning

    Do not rely on absolute paths in the implementation of Trainable.save_checkpoint and Trainable.load_checkpoint.

    Use validate_save_restore to catch Trainable.save_checkpoint/ Trainable.load_checkpoint errors before execution.

    >>> from ray.tune.utils import validate_save_restore
    >>> MyTrainableClass = ... 
    >>> validate_save_restore(MyTrainableClass) 
    >>> validate_save_restore( 
    ...     MyTrainableClass, use_object_store=True)
    

    New in version 0.8.7.

    Parameters:

    checkpoint_dir – The directory where the checkpoint file must be stored. In a Tune run, if the trial is paused, the provided path may be temporary and moved.

    Returns:

    A dict or string. If string, the return value is expected to be prefixed by tmp_checkpoint_dir. If dict, the return value will be automatically serialized by Tune and passed to Trainable.load_checkpoint().

    Example

    >>> trainable, trainable1, trainable2 = ... 
    >>> print(trainable1.save_checkpoint("/tmp/checkpoint_1")) 
    "/tmp/checkpoint_1"
    >>> print(trainable2.save_checkpoint("/tmp/checkpoint_2")) 
    {"some": "data"}
    >>> trainable.save_checkpoint("/tmp/bad_example") 
    "/tmp/NEW_CHECKPOINT_PATH/my_checkpoint_file" # This will error.
    
    setup(config: Dict) None

    Given a dict with configuration parameters to run a hyperparameter search for a model. The dict has to contain the following parameters:

    • model_module: module of model

    • dataset: dataset class instance

    • cv_split: cross-validation object with train configurations

    • param_space: dict with model’s hyperparameters to search and all

      possible values.

    • metric (str): metric to use for evaluation

    This function is invoked once training starts.

    Parameters:

    config (Dict) – a dict with a set of configuration parameters.

    step() Dict[str, int | float]

    This function is invoked for each iteration during the search process. For each iteration, it runs a cross-validation training with the selected hyperparameters. Furthermore, it returns the mean metrics of the iteration.

    Returns:

    a dict with score of the iteration.

    Return type:

    Dict[str, ct.Number]