Reference¶

HoNCAML follows mainly an OOP coding approach through Python classes. The main ones are detailed in this section.

Execution¶

The main class used by HoNCAML is execution, which is a wrapper on top of the Pipeline class.

class honcaml.tools.execution.Execution(pipeline_config_file: str)¶

Class to execute ML pipelines. First, it reads the pipeline content and creates a new Pipeline instance with pipeline file content.

_pipeline_config_file¶

Pipeline configuration file name.

Type:: str

_execution_id¶

Execution identifier.

Type:: str

_pipeline¶

Pipeline instance to run.

Type:: pipeline.Pipeline

run() → None¶: Parses the pipeline file and creates a new Pipeline instance to run.

Pipeline¶

A pipeline is made of several Steps to be executed.

class honcaml.tools.pipeline.Pipeline(pipeline_content: Dict, execution_id: str)¶

The pipeline class contains the steps defined by the user. It defines the pipeline to be executed and runs each of the steps defined.

_steps¶

Steps defining the pipeline.

Type:: List[steps.Step]

_metadata¶

Objects output from each step.

Type:: Dict

_pipeline_content¶

Settings defining the pipeline steps.

Type:: Dict

_execution_id¶

Execution identifier.

Type:: str

run()¶: Run the pipeline, which means to run each step consecutively.

Steps¶

The step class is the one that determines the parts of a pipeline to run, and it follows a ETL approach.

class honcaml.steps.base.BaseStep(default_settings: Dict, user_settings: Dict, global_params: Dict, step_rules: Dict)¶

Abstract class to wrap a pipeline step. It defines the base structure for a step from the main pipeline.

_step_settings¶

Settings that define the step.

Type:: Dict

_extract_settings¶

Settings defining the extract ETL process.

Type:: Dict

_transform_settings¶

Settings defining the transform ETL process.

Type:: Dict

_load_settings¶

Settings defining the load ETL process.

Type:: Dict

execute() → None¶: Executes the ETL processes from the current step.

property extract_settings: Dict¶

Getter method for the ‘_extract_settings’ attribute.

Returns:: ‘_extract_settings’ current value.

property load_settings: Dict¶

Getter method for the ‘_load_settings’ attribute.

Returns:: ‘_load_settings’ current value.

abstract run(metadata: Dict) → Dict¶

Runs the step.

Parameters:: metadata – Configuration parameters in order to run the step.

property step_settings: Dict¶

Getter method for the ‘_step_settings’ attribute.

Returns:: ‘_step_settings’ current value.

property transform_settings: Dict¶

Getter method for the ‘_transform_settings’ attribute.

Returns:: ‘_transform_settings’ current value.

Data¶

The data step is the one related to data management.

class honcaml.steps.data.DataStep(default_settings: Dict, user_settings: Dict, global_params: Dict, step_rules: Dict, execution_id: str)¶

The data step class is a step of the main pipeline. It contains the functionalities to perform the ETL on the requested data.

_dataset¶

Dataset to be handled.

Type:: data.Dataset

property dataset: BaseDataset¶

Getter method for the ‘_dataset’ attribute.

Returns:: ‘_dataset’ current value.

run(metadata: Dict) → Dict¶

Runs the data steps. Using the dataset created run the ETL functions for the specific dataset: extract, transform and load.

Parameters:: metadata – Accumulated pipeline metadata.
Returns:: Updated pipeline metadata with datasat included.
Return type:: metadata

It includes the following classes that further configure the step:

BaseDataset: Defines an abstract class that serves as a parent to the rest of the dataset classes (e.g. TabularDataset, etc.)

class honcaml.data.base.BaseDataset¶

Base class defining a dataset.

_normalization¶

Class to store the normalization parameters for features and target.

Type:

Union[norm.Normalization, None]

property normalization: Normalization¶

Getter method for ‘_normalization’ attribute.

Returns:

‘_normalization’ current value.

abstract preprocess(settings: Dict)¶

ETL data transform. Must be implemented by child classes.

abstract read(settings: Dict)¶

ETL data extract. Must be implemented by child classes.

abstract save(settings: Dict)¶

ETL data load. Must be implemented by child classes.
Normalization: Wraps all normalization methods that apply to the dataset.

class honcaml.data.normalization.Normalization(settings: Dict)¶

The aim of this class is to store the normalization parameters for dataset features and target.

_features¶

Columns to normalize.

Type:

List[str]

_target¶

Targets to normalize.

Type:

List[str]

_features_normalizer¶

Normalization module and parameters to apply to a list of features.

Type:

Dict

_target_normalizer¶

Normalization module and parameters to apply to a list of targets.

Type:

Dict

property features: List[str]¶

Getter method for ‘_features’ attribute.

Returns:

‘_features’ current value.

property features_normalizer: Callable¶

This is a getter method. This function returns a tuple with the normalization module and parameters to apply to a features.

Returns:

a module and parameters for features.

Return type:

(Tuple[str, dict])

property target: List[str]¶

Getter method for ‘_target’ attribute.

Returns:

‘_target’ current value.

property target_normalizer: Callable¶

Getter method for ‘_target_normalizer’ attribute.

Returns:

‘_target_normalizer’ current value.
CrossValidationSplit: Applies CV splitting through the dataset.
class honcaml.data.transform.CrossValidationSplit(module: str, params: Dict | None = None)¶
The aim of this class is to wrap possible cross-validation classes from sklearn framework.

_module¶

Cross-validation module.

Type:

str

_data¶

Dict with additional parameters to pass to cross validation module.

Type:

Dict
split(x: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) → Generator[Tuple[int, DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None, DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None], None, None]¶
Execute a split method from cross-validation module. In addition, the ‘kwargs’ parameter allows passing additional arguments when it creates the object instance. The valid x and y datasets types are: pd.DataFrame, pd.Series and np.ndarray.

Parameters:

x – Dataset with features to split.

y – Dataset with targets to split.

Yields:

Split number

Feature array for training

Feature array for test

Target array for training

Target array for test

Model¶

The model step is the one related to model management.

class honcaml.steps.model.ModelStep(default_settings: Dict, user_settings: Dict, global_params: Dict, step_rules: Dict, execution_id: str)¶

The model step class is a step of the main pipeline. It performs different tasks such as train, predict and evaluate a model. The extract and load functions allow the steps to save or restore a model.

_estimator_config¶

Definition of the estimator, being the module and their hyperparameters.

Type:: Dict

_model¶

Model from this library wrapping the specific estimator.

Type:: base_model.BaseModel

property model: BaseModel¶

Getter method for the ‘_model’ attribute.

Returns:: ‘_model’ current value.

run(metadata: Dict) → Dict¶

Runs the model steps. Using the model created run the ETL functions for the specific model: extract, transform and load.

Parameters:: metadata – Accumulated pipeline metadata.
Returns:: Updated pipeline with the best estimator as a model.
Return type:: metadata

BaseModel: Defines an abstract class from which models are created.
class honcaml.models.base.BaseModel(problem_type: str)¶
Model base class.

_estimator_type¶

The kind of estimator to be used. Valid values are regressor and classifier.

Type:

str

_estimator¶

Estimator defined by child classes.
abstract build_model(model_config: Dict, *args) → None¶
Creates the requested estimator. Must be implemented by child classes.

Parameters:

model_config – Model configuration, i.e. module and its hyperparameters.

**kwargs – Extra parameters.
abstract evaluate(x: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], **kwargs: Dict) → Dict¶
Evaluates the estimator on the given dataset. Must be implemented by child classes.

Parameters:

x – Dataset features.

y – Dataset target.

**kwargs – Extra parameters.

Returns:

Resulting metrics from the evaluation.
abstract fit(x: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], **kwargs: Dict) → None¶
Trains the estimator on the specified dataset. Must be implemented by child classes.

Parameters:

x – Dataset features.

y – Dataset target.

**kwargs – Extra parameters.
abstract predict(x: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], **kwargs: Dict) → List¶
Uses the estimator to make predictions on the given dataset features. Must be implemented by child classes.

Parameters:

x – Dataset features.

**kwargs – Extra parameters.

Returns:

Resulting predictions from the estimator.
static read(settings: Dict) → None¶

Read an estimator from disk.

Parameters:

settings – Parameter settings defining the read operation.

save(settings: Dict) → None¶

Stores the estimator to disk.

Parameters:

settings – Parameter settings defining the store operation.

Benchmark¶

The benchmark step is the one related to meta-model management, specifically to select the best model between all of the available options.

class honcaml.steps.benchmark.BenchmarkStep(default_settings: Dict, user_settings: Dict, global_params: Dict, step_rules: Dict, execution_id: str)¶

The Benchmark step class is a steps of the main pipeline. The step performs a model ranking by performing a hyperparameter search and model selection based on the user and default settings. The extract and load methods allow the steps to save and restore executions to/from checkpoints.

_store_results_folder¶

folder path to store results.

Type:: str

_dataset¶: dataset class intance.

_reported_metrics¶

metrics to compute during hyper parameter search.

Type:: List[str]

_metric¶

metric function to optimize.

Type:: str

_mode¶

maximize or minimize metric.

Type:: str

get_best_model_and_hyperparams_dict() → Dict¶

This function returns a dict with the best model module and the best hyperparameters of benchmark transform step.

Returns:

run(metadata: Dict) → Dict¶

Run the benchmark step. Using a benchmark of models run the ETL functions to rank them and return the best one.

Parameters:

metadata (Dict) – the objects output from each different previous step.

Returns:

the previous objects updated with the ones from: the current step: the best estimator as a model from this library.

Return type:

metadata (Dict)

BaseBenchmark: Defines an abstract class for model benchmarking.

class honcaml.benchmark.base.BaseBenchmark(name: str)¶

abstract clean_search_space() → dict¶

Given a dict with a search space for a model, this function gets the module of model to import and the hyperparameters search space and ensures that method exists.

Must be implemented by child classes.

Parameters:

search_space (Dict) – a dict with the search space to explore

Returns:

a dict where for each hyperparameter the corresponding method to generate all possible values during the search.

Return type:

(Dict)

invalidate_experiment() → bool¶

Logic to specify if an experiment should be invalidated before estimator is cross-validated, in order to avoid unnecessary time and resources. This is due to incoherent or unrealistic combination of parameters that are known beforehand.

Must be implemented by child classes.

Parameters:

search_space – Search space to explore.

Returns:

Whether experiment should be invalidated or not
EstimatorTrainer: Computes optimised hyperparameters for a specific model, based on tune.Trainable class.
class honcaml.benchmark.trainable.EstimatorTrainer(config: Dict[str, Any] = None, logger_creator: Callable[[Dict[str, Any]], Logger] = None, remote_checkpoint_dir: str | None = None, custom_syncer: Syncer | None = None)¶
This is a class with the aim to runs a set of experiments for search the best model hyperparameters configuration. In addition, it is a child class from ray.tune.Trainable. The functions to override are the following: - setup - step - save_checkpoint - load_checkpoint

_model_module¶

module of model to use.

Type:

str

_dataset¶

dataset class instance.

_cv_split¶

cross-validation object with train configurations.

_param_space¶

dict with model’s hyperparameters to search and all possible values.

Type:

Dict

_reported_metrics¶

metrics to report

Type:

List

_metric¶

metric to use to evaulate the model performance.

Type:

str

_model¶

model instance.
load_checkpoint(checkpoint: Dict | str)¶
Subclasses should override this to implement restore().

Warning

In this method, do not rely on absolute paths. The absolute path of the checkpoint_dir used in Trainable.save_checkpoint may be changed.

If Trainable.save_checkpoint returned a prefixed string, the prefix of the checkpoint string returned by Trainable.save_checkpoint may be changed. This is because trial pausing depends on temporary directories.

The directory structure under the checkpoint_dir provided to Trainable.save_checkpoint is preserved.

See the example below.

Example

>>> from ray.tune.trainable import Trainable >>> class Example(Trainable): ... def save_checkpoint(self, checkpoint_path): ... print(checkpoint_path) ... return os.path.join(checkpoint_path, "my/check/point") ... def load_checkpoint(self, checkpoint): ... print(checkpoint) >>> trainer = Example() >>> # This is used when PAUSED. >>> obj = trainer.save_to_object() <logdir>/tmpc8k_c_6hsave_to_object/checkpoint_0/my/check/point >>> # Note the different prefix. >>> trainer.restore_from_object(obj) <logdir>/tmpb87b5axfrestore_from_object/checkpoint_0/my/check/point

New in version 0.8.7.

Parameters:

checkpoint – If dict, the return value is as returned by save_checkpoint. If a string, then it is a checkpoint path that may have a different prefix than that returned by save_checkpoint. The directory structure underneath the checkpoint_dir save_checkpoint is preserved.
save_checkpoint(checkpoint_dir: str) → str | Dict | None¶
Subclasses should override this to implement save().

Warning

Do not rely on absolute paths in the implementation of Trainable.save_checkpoint and Trainable.load_checkpoint.

Use validate_save_restore to catch Trainable.save_checkpoint/ Trainable.load_checkpoint errors before execution.

>>> from ray.tune.utils import validate_save_restore >>> MyTrainableClass = ... >>> validate_save_restore(MyTrainableClass) >>> validate_save_restore( ... MyTrainableClass, use_object_store=True)

New in version 0.8.7.

Parameters:

checkpoint_dir – The directory where the checkpoint file must be stored. In a Tune run, if the trial is paused, the provided path may be temporary and moved.

Returns:

A dict or string. If string, the return value is expected to be prefixed by tmp_checkpoint_dir. If dict, the return value will be automatically serialized by Tune and passed to Trainable.load_checkpoint().

Example

>>> trainable, trainable1, trainable2 = ... >>> print(trainable1.save_checkpoint("/tmp/checkpoint_1")) "/tmp/checkpoint_1" >>> print(trainable2.save_checkpoint("/tmp/checkpoint_2")) {"some": "data"} >>> trainable.save_checkpoint("/tmp/bad_example") "/tmp/NEW_CHECKPOINT_PATH/my_checkpoint_file" # This will error.
setup(config: Dict) → None¶
Given a dict with configuration parameters to run a hyperparameter search for a model. The dict has to contain the following parameters:

model_module: module of model

dataset: dataset class instance

cv_split: cross-validation object with train configurations

param_space: dict with model’s hyperparameters to search and all
possible values.

metric (str): metric to use for evaluation

This function is invoked once training starts.

Parameters:

config (Dict) – a dict with a set of configuration parameters.
step() → Dict[str, int | float]¶

This function is invoked for each iteration during the search process. For each iteration, it runs a cross-validation training with the selected hyperparameters. Furthermore, it returns the mean metrics of the iteration.

Returns:

a dict with score of the iteration.

Return type:

Dict[str, ct.Number]

Reference¶

Execution¶

Pipeline¶

Steps¶

Data¶

Model¶

Benchmark¶

HoNCAML

Navigation

Related Topics