Reference¶
HoNCAML follows mainly an OOP coding approach through Python classes. The main ones are detailed in this section.
Execution¶
The main class used by HoNCAML is execution, which is a wrapper on top of the Pipeline class.
- class honcaml.tools.execution.Execution(pipeline_config_file: str)¶
Class to execute ML pipelines. First, it reads the pipeline content and creates a new Pipeline instance with pipeline file content.
- _pipeline_config_file¶
Pipeline configuration file name.
- Type:
str
- _execution_id¶
Execution identifier.
- Type:
str
- _pipeline¶
Pipeline instance to run.
- Type:
- run() None ¶
Parses the pipeline file and creates a new Pipeline instance to run.
Pipeline¶
A pipeline is made of several Steps to be executed.
- class honcaml.tools.pipeline.Pipeline(pipeline_content: Dict, execution_id: str)¶
The pipeline class contains the steps defined by the user. It defines the pipeline to be executed and runs each of the steps defined.
- _steps¶
Steps defining the pipeline.
- Type:
List[steps.Step]
- _metadata¶
Objects output from each step.
- Type:
Dict
- _pipeline_content¶
Settings defining the pipeline steps.
- Type:
Dict
- _execution_id¶
Execution identifier.
- Type:
str
- run()¶
Run the pipeline, which means to run each step consecutively.
Steps¶
The step class is the one that determines the parts of a pipeline to run, and it follows a ETL approach.
- class honcaml.steps.base.BaseStep(default_settings: Dict, user_settings: Dict, global_params: Dict, step_rules: Dict)¶
Abstract class to wrap a pipeline step. It defines the base structure for a step from the main pipeline.
- _step_settings¶
Settings that define the step.
- Type:
Dict
- _extract_settings¶
Settings defining the extract ETL process.
- Type:
Dict
- _transform_settings¶
Settings defining the transform ETL process.
- Type:
Dict
- _load_settings¶
Settings defining the load ETL process.
- Type:
Dict
- execute() None ¶
Executes the ETL processes from the current step.
- property extract_settings: Dict¶
Getter method for the ‘_extract_settings’ attribute.
- Returns:
‘_extract_settings’ current value.
- property load_settings: Dict¶
Getter method for the ‘_load_settings’ attribute.
- Returns:
‘_load_settings’ current value.
- abstract run(metadata: Dict) Dict ¶
Runs the step.
- Parameters:
metadata – Configuration parameters in order to run the step.
- property step_settings: Dict¶
Getter method for the ‘_step_settings’ attribute.
- Returns:
‘_step_settings’ current value.
- property transform_settings: Dict¶
Getter method for the ‘_transform_settings’ attribute.
- Returns:
‘_transform_settings’ current value.
Data¶
The data step is the one related to data management.
- class honcaml.steps.data.DataStep(default_settings: Dict, user_settings: Dict, global_params: Dict, step_rules: Dict, execution_id: str)¶
The data step class is a step of the main pipeline. It contains the functionalities to perform the ETL on the requested data.
- _dataset¶
Dataset to be handled.
- Type:
data.Dataset
- property dataset: BaseDataset¶
Getter method for the ‘_dataset’ attribute.
- Returns:
‘_dataset’ current value.
- run(metadata: Dict) Dict ¶
Runs the data steps. Using the dataset created run the ETL functions for the specific dataset: extract, transform and load.
- Parameters:
metadata – Accumulated pipeline metadata.
- Returns:
Updated pipeline metadata with datasat included.
- Return type:
metadata
It includes the following classes that further configure the step:
BaseDataset: Defines an abstract class that serves as a parent to the rest of the dataset classes (e.g. TabularDataset, etc.)
- class honcaml.data.base.BaseDataset¶
Base class defining a dataset.
- _normalization¶
Class to store the normalization parameters for features and target.
- Type:
Union[norm.Normalization, None]
- property normalization: Normalization¶
Getter method for ‘_normalization’ attribute.
- Returns:
‘_normalization’ current value.
- abstract preprocess(settings: Dict)¶
ETL data transform. Must be implemented by child classes.
- abstract read(settings: Dict)¶
ETL data extract. Must be implemented by child classes.
- abstract save(settings: Dict)¶
ETL data load. Must be implemented by child classes.
Normalization: Wraps all normalization methods that apply to the dataset.
- class honcaml.data.normalization.Normalization(settings: Dict)¶
The aim of this class is to store the normalization parameters for dataset features and target.
- _features¶
Columns to normalize.
- Type:
List[str]
- _target¶
Targets to normalize.
- Type:
List[str]
- _features_normalizer¶
Normalization module and parameters to apply to a list of features.
- Type:
Dict
- _target_normalizer¶
Normalization module and parameters to apply to a list of targets.
- Type:
Dict
- property features: List[str]¶
Getter method for ‘_features’ attribute.
- Returns:
‘_features’ current value.
- property features_normalizer: Callable¶
This is a getter method. This function returns a tuple with the normalization module and parameters to apply to a features.
- Returns:
a module and parameters for features.
- Return type:
(Tuple[str, dict])
- property target: List[str]¶
Getter method for ‘_target’ attribute.
- Returns:
‘_target’ current value.
- property target_normalizer: Callable¶
Getter method for ‘_target_normalizer’ attribute.
- Returns:
‘_target_normalizer’ current value.
CrossValidationSplit: Applies CV splitting through the dataset.
- class honcaml.data.transform.CrossValidationSplit(module: str, params: Dict | None = None)¶
The aim of this class is to wrap possible cross-validation classes from sklearn framework.
- _module¶
Cross-validation module.
- Type:
str
- _data¶
Dict with additional parameters to pass to cross validation module.
- Type:
Dict
- split(x: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) Generator[Tuple[int, DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None, DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None], None, None] ¶
Execute a split method from cross-validation module. In addition, the ‘kwargs’ parameter allows passing additional arguments when it creates the object instance. The valid x and y datasets types are: pd.DataFrame, pd.Series and np.ndarray.
- Parameters:
x – Dataset with features to split.
y – Dataset with targets to split.
- Yields:
Split number
Feature array for training
Feature array for test
Target array for training
Target array for test
Model¶
The model step is the one related to model management.
- class honcaml.steps.model.ModelStep(default_settings: Dict, user_settings: Dict, global_params: Dict, step_rules: Dict, execution_id: str)¶
The model step class is a step of the main pipeline. It performs different tasks such as train, predict and evaluate a model. The extract and load functions allow the steps to save or restore a model.
- _estimator_config¶
Definition of the estimator, being the module and their hyperparameters.
- Type:
Dict
- _model¶
Model from this library wrapping the specific estimator.
- Type:
base_model.BaseModel
- property model: BaseModel¶
Getter method for the ‘_model’ attribute.
- Returns:
‘_model’ current value.
- run(metadata: Dict) Dict ¶
Runs the model steps. Using the model created run the ETL functions for the specific model: extract, transform and load.
- Parameters:
metadata – Accumulated pipeline metadata.
- Returns:
Updated pipeline with the best estimator as a model.
- Return type:
metadata
BaseModel: Defines an abstract class from which models are created.
- class honcaml.models.base.BaseModel(problem_type: str)¶
Model base class.
- _estimator_type¶
The kind of estimator to be used. Valid values are regressor and classifier.
- Type:
str
- _estimator¶
Estimator defined by child classes.
- abstract build_model(model_config: Dict, *args) None ¶
Creates the requested estimator. Must be implemented by child classes.
- Parameters:
model_config – Model configuration, i.e. module and its hyperparameters.
**kwargs – Extra parameters.
- abstract evaluate(x: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], **kwargs: Dict) Dict ¶
Evaluates the estimator on the given dataset. Must be implemented by child classes.
- Parameters:
x – Dataset features.
y – Dataset target.
**kwargs – Extra parameters.
- Returns:
Resulting metrics from the evaluation.
- abstract fit(x: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], **kwargs: Dict) None ¶
Trains the estimator on the specified dataset. Must be implemented by child classes.
- Parameters:
x – Dataset features.
y – Dataset target.
**kwargs – Extra parameters.
- abstract predict(x: DataFrame | Series | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], **kwargs: Dict) List ¶
Uses the estimator to make predictions on the given dataset features. Must be implemented by child classes.
- Parameters:
x – Dataset features.
**kwargs – Extra parameters.
- Returns:
Resulting predictions from the estimator.
- static read(settings: Dict) None ¶
Read an estimator from disk.
- Parameters:
settings – Parameter settings defining the read operation.
- save(settings: Dict) None ¶
Stores the estimator to disk.
- Parameters:
settings – Parameter settings defining the store operation.
Benchmark¶
The benchmark step is the one related to meta-model management, specifically to select the best model between all of the available options.
- class honcaml.steps.benchmark.BenchmarkStep(default_settings: Dict, user_settings: Dict, global_params: Dict, step_rules: Dict, execution_id: str)¶
The Benchmark step class is a steps of the main pipeline. The step performs a model ranking by performing a hyperparameter search and model selection based on the user and default settings. The extract and load methods allow the steps to save and restore executions to/from checkpoints.
- _store_results_folder¶
folder path to store results.
- Type:
str
- _dataset¶
dataset class intance.
- _reported_metrics¶
metrics to compute during hyper parameter search.
- Type:
List[str]
- _metric¶
metric function to optimize.
- Type:
str
- _mode¶
maximize or minimize metric.
- Type:
str
- get_best_model_and_hyperparams_dict() Dict ¶
This function returns a dict with the best model module and the best hyperparameters of benchmark transform step.
Returns:
- run(metadata: Dict) Dict ¶
Run the benchmark step. Using a benchmark of models run the ETL functions to rank them and return the best one.
- Parameters:
metadata (Dict) – the objects output from each different previous step.
- Returns:
- the previous objects updated with the ones from
the current step: the best estimator as a model from this library.
- Return type:
metadata (Dict)
BaseBenchmark: Defines an abstract class for model benchmarking.
- class honcaml.benchmark.base.BaseBenchmark(name: str)¶
- abstract clean_search_space() dict ¶
Given a dict with a search space for a model, this function gets the module of model to import and the hyperparameters search space and ensures that method exists.
Must be implemented by child classes.
- Parameters:
search_space (Dict) – a dict with the search space to explore
- Returns:
a dict where for each hyperparameter the corresponding method to generate all possible values during the search.
- Return type:
(Dict)
- invalidate_experiment() bool ¶
Logic to specify if an experiment should be invalidated before estimator is cross-validated, in order to avoid unnecessary time and resources. This is due to incoherent or unrealistic combination of parameters that are known beforehand.
Must be implemented by child classes.
- Parameters:
search_space – Search space to explore.
- Returns:
Whether experiment should be invalidated or not
EstimatorTrainer: Computes optimised hyperparameters for a specific model, based on tune.Trainable class.
- class honcaml.benchmark.trainable.EstimatorTrainer(config: Dict[str, Any] = None, logger_creator: Callable[[Dict[str, Any]], Logger] = None, remote_checkpoint_dir: str | None = None, custom_syncer: Syncer | None = None)¶
This is a class with the aim to runs a set of experiments for search the best model hyperparameters configuration. In addition, it is a child class from ray.tune.Trainable. The functions to override are the following: - setup - step - save_checkpoint - load_checkpoint
- _model_module¶
module of model to use.
- Type:
str
- _dataset¶
dataset class instance.
- _cv_split¶
cross-validation object with train configurations.
- _param_space¶
dict with model’s hyperparameters to search and all possible values.
- Type:
Dict
- _reported_metrics¶
metrics to report
- Type:
List
- _metric¶
metric to use to evaulate the model performance.
- Type:
str
- _model¶
model instance.
- load_checkpoint(checkpoint: Dict | str)¶
Subclasses should override this to implement restore().
Warning
In this method, do not rely on absolute paths. The absolute path of the checkpoint_dir used in
Trainable.save_checkpoint
may be changed.If
Trainable.save_checkpoint
returned a prefixed string, the prefix of the checkpoint string returned byTrainable.save_checkpoint
may be changed. This is because trial pausing depends on temporary directories.The directory structure under the checkpoint_dir provided to
Trainable.save_checkpoint
is preserved.See the example below.
Example
>>> from ray.tune.trainable import Trainable >>> class Example(Trainable): ... def save_checkpoint(self, checkpoint_path): ... print(checkpoint_path) ... return os.path.join(checkpoint_path, "my/check/point") ... def load_checkpoint(self, checkpoint): ... print(checkpoint) >>> trainer = Example() >>> # This is used when PAUSED. >>> obj = trainer.save_to_object() <logdir>/tmpc8k_c_6hsave_to_object/checkpoint_0/my/check/point >>> # Note the different prefix. >>> trainer.restore_from_object(obj) <logdir>/tmpb87b5axfrestore_from_object/checkpoint_0/my/check/point
New in version 0.8.7.
- Parameters:
checkpoint – If dict, the return value is as returned by save_checkpoint. If a string, then it is a checkpoint path that may have a different prefix than that returned by save_checkpoint. The directory structure underneath the checkpoint_dir save_checkpoint is preserved.
- save_checkpoint(checkpoint_dir: str) str | Dict | None ¶
Subclasses should override this to implement
save()
.Warning
Do not rely on absolute paths in the implementation of
Trainable.save_checkpoint
andTrainable.load_checkpoint
.Use
validate_save_restore
to catchTrainable.save_checkpoint
/Trainable.load_checkpoint
errors before execution.>>> from ray.tune.utils import validate_save_restore >>> MyTrainableClass = ... >>> validate_save_restore(MyTrainableClass) >>> validate_save_restore( ... MyTrainableClass, use_object_store=True)
New in version 0.8.7.
- Parameters:
checkpoint_dir – The directory where the checkpoint file must be stored. In a Tune run, if the trial is paused, the provided path may be temporary and moved.
- Returns:
A dict or string. If string, the return value is expected to be prefixed by tmp_checkpoint_dir. If dict, the return value will be automatically serialized by Tune and passed to
Trainable.load_checkpoint()
.
Example
>>> trainable, trainable1, trainable2 = ... >>> print(trainable1.save_checkpoint("/tmp/checkpoint_1")) "/tmp/checkpoint_1" >>> print(trainable2.save_checkpoint("/tmp/checkpoint_2")) {"some": "data"} >>> trainable.save_checkpoint("/tmp/bad_example") "/tmp/NEW_CHECKPOINT_PATH/my_checkpoint_file" # This will error.
- setup(config: Dict) None ¶
Given a dict with configuration parameters to run a hyperparameter search for a model. The dict has to contain the following parameters:
model_module: module of model
dataset: dataset class instance
cv_split: cross-validation object with train configurations
- param_space: dict with model’s hyperparameters to search and all
possible values.
metric (str): metric to use for evaluation
This function is invoked once training starts.
- Parameters:
config (Dict) – a dict with a set of configuration parameters.
- step() Dict[str, int | float] ¶
This function is invoked for each iteration during the search process. For each iteration, it runs a cross-validation training with the selected hyperparameters. Furthermore, it returns the mean metrics of the iteration.
- Returns:
a dict with score of the iteration.
- Return type:
Dict[str, ct.Number]