Python PyPI License Documentation Status

Welcome to skrobot’s documentation!

skrobot logo

API Reference

skrobot package

Subpackages

skrobot.core package
Submodules
skrobot.core.experiment module
class skrobot.core.experiment.Experiment(experiments_repository)[source]

Bases: object

The Experiment class can be used to build, track and run an experiment.

It can run BaseTask tasks in the context of an experiment.

When building an experiment and/or running tasks, various metadata as well as task-related files are stored for tracking experiments.

Lastly, an experiment can be configured to send notifications when running a task, which can be useful for teams who need to get notified for the progress of the experiment.

__init__(experiments_repository)[source]

This is the constructor method and can be used to create a new object instance of Experiment class.

Parameters

experiments_repository (str) – The root directory path under which a unique directory is created for the experiment.

set_notifier(notifier: skrobot.notification.base_notifier.BaseNotifier)[source]

Optional method.

Set the experiment’s notifier.

Parameters

notifier (BaseNotifier) – The experiment’s notifier.

Returns

The object instance itself.

Return type

Experiment

set_source_code_file_path(source_code_file_path)[source]

Optional method.

Set the experiment’s source code file path.

Parameters

source_code_file_path (str) – The experiment’s source code file path.

Returns

The object instance itself.

Return type

Experiment

set_experimenter(experimenter)[source]

Optional method.

Set the experimenter’s name.

By default the experimenter’s name is anonymous. However, if you want to override it you can pass a new name.

Parameters

experimenter (str) – The experimenter’s name.

Returns

The object instance itself.

Return type

Experiment

build()[source]

Build the Experiment.

When an experiment is built, it creates a unique directory under which it stores various experiment-related metadata and files for tracking reasons.

Specifically, under the experiment’s directory an experiment.log JSON file is created, which contains a unique auto-generated experiment ID, the current date & time, and the experimenter’s name.

Also, the experiment’s directory name contains the experimenter’s name as well as current date & time.

Lastly, in case set_source_code_file_path() is used, the experiment’s source code file is copied also under the experiment’s directory.

Returns

The object instance itself.

Return type

Experiment

run(task)[source]

Run a BaseTask task.

When running a task, its recorded parameters (e.g., train_task.params) and any other task-related generated files are stored under experiment’s directory for tracking reasons.

The task’s recorded parameters are in JSON format.

Also, in case set_notifier() is used to set a notifier, a notification is sent for the success or failure (including the error message) of the task’s execution.

Lastly, in case an exception occurs, a text file (e.g., train_task.errors) is generated under experiment’s directory containing the error message.

Parameters

task (BaseTask) – The task to run.

Returns

The task’s result.

Return type

Depends on the task parameter.

skrobot.core.task_runner module
class skrobot.core.task_runner.TaskRunner(output_directory_path)[source]

Bases: object

The TaskRunner class is a simplified version (in functionality) of the Experiment class.

It leaves out all the “experiment” stuff and is focused mostly in the execution and tracking of BaseTask tasks.

__init__(output_directory_path)[source]

This is the constructor method and can be used to create a new object instance of TaskRunner class.

Parameters

output_directory_path (str) – The output directory path under which task-related generated files are stored.

run(task)[source]

Run a BaseTask task.

When running a task, its recorded parameters (e.g., train_task.params) and any other task-related generated files are stored under output directory for tracking reasons.

The task’s recorded parameters are in JSON format.

Lastly, in case an exception occurs, a text file (e.g., train_task.errors) is generated under output directory containing the error message.

Parameters

task (BaseTask) – The task to run.

Returns

The task’s result.

Return type

Depends on the task parameter.

skrobot.feature_selection package
Submodules
skrobot.feature_selection.column_selector module
class skrobot.feature_selection.column_selector.ColumnSelector(cols, drop_axis=False)[source]

Bases: sklearn.base.BaseEstimator

The ColumnSelector class is an implementation of a column selector for scikit-learn pipelines.

It can be used for manual feature selection to select specific columns from an input data set.

It can select columns either by integer indices or by names.

__init__(cols, drop_axis=False)[source]

This is the constructor method and can be used to create a new object instance of ColumnSelector class.

Parameters
  • cols (list) – A non-empty list specifying the columns to be selected. For example, [1, 4, 5] to select the 2nd, 5th, and 6th columns, and [‘A’,’C’,’D’] to select the columns A, C and D.

  • drop_axis (bool, optional) – Can be used to reshape the output data set from (n_samples, 1) to (n_samples) by dropping the last axis. It defaults to False.

fit_transform(X, y=None)[source]

Returns a slice of the input data set.

Parameters
  • X ({NumPy array, pandas DataFrame, SciPy sparse matrix}) – Input vectors of shape (n_samples, n_features), where n_samples is the number of samples and n_features is the number of features.

  • y (None) – Ignored.

Returns

Subset of the input data set of shape (n_samples, k_features), where n_samples is the number of samples and k_features <= n_features.

Return type

{NumPy array, SciPy sparse matrix}

transform(X, y=None)[source]

Returns a slice of the input data set.

Parameters
  • X ({NumPy array, pandas DataFrame, SciPy sparse matrix}) – Input vectors of shape (n_samples, n_features), where n_samples is the number of samples and n_features is the number of features.

  • y (None) – Ignored.

Returns

Subset of the input data set of shape (n_samples, k_features), where n_samples is the number of samples and k_features <= n_features.

Return type

{NumPy array, SciPy sparse matrix}

fit(X, y=None)[source]

This is a mock method and does nothing.

Parameters
  • X (None) – Ignored.

  • y (None) – Ignored.

Returns

The object instance itself.

Return type

ColumnSelector

skrobot.notification package
Submodules
skrobot.notification.base_notifier module
class skrobot.notification.base_notifier.BaseNotifier[source]

Bases: abc.ABC

The BaseNotifier is an abstract base class for implementing notifiers.

A notifier can be used to send notifications.

abstract notify(message)[source]

An abstract method for sending the notification.

Parameters

message (str) – The notification’s message.

skrobot.tasks package
Submodules
skrobot.tasks.base_cross_validation_task module
class skrobot.tasks.base_cross_validation_task.BaseCrossValidationTask(type_name, args)[source]

Bases: skrobot.tasks.base_task.BaseTask

The BaseCrossValidationTask is an abstract base class for implementing tasks that use cross-validation functionality.

It can support both stratified k-fold cross-validation as well as cross-validation with user-defined folds.

By default, stratified k-fold cross-validation is used with the default parameters of stratified_folds() method.

__init__(type_name, args)[source]

This is the constructor method and can be used from child BaseCrossValidationTask implementations.

Parameters
  • type_name (str) – The task’s type name. A common practice is to pass the name of the task’s class.

  • args (dict) – The task’s parameters. A common practice is to pass the parameters at the time of task’s object creation. It is a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

custom_folds(folds_file_path, fold_column='fold')[source]

Optional method.

Use cross-validation with user-defined custom folds.

Parameters
  • folds_file_path (str) – The path to the file containing the user-defined folds for the samples. The file needs to be formatted with the same separation delimiter (comma for CSV, tab for TSV, etc.) as the one used in the input data set files provided to the task. The file must contain two data columns and the first row must be the header. The first column is for the sample IDs and needs to be the same as the one used in the input data set files provided to the task. The second column is for the fold IDs (e.g., 1 through 5, A through D, etc.).

  • fold_column (str, optional) – The column name for the fold IDs. It defaults to ‘fold’.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

stratified_folds(total_folds=3, shuffle=False)[source]

Optional method.

Use stratified k-fold cross-validation.

The folds are made by preserving the percentage of samples for each class.

Parameters
  • total_folds (int, optional) – Number of folds. Must be at least 2. It defaults to 3.

  • shuffle (bool, optional) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled. It defaults to False.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

get_configuration()

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

get_type()

Get the task’s type name.

Returns

The task’s type name.

Return type

str

abstract run(output_directory)

An abstract method for running the task.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

skrobot.tasks.base_task module
class skrobot.tasks.base_task.BaseTask(type_name, args)[source]

Bases: abc.ABC

The BaseTask is an abstract base class for implementing tasks.

A task is a configurable and reproducible piece of code built on top of scikit-learn that can be used in machine learning pipelines.

__init__(type_name, args)[source]

This is the constructor method and can be used from child BaseTask implementations.

Parameters
  • type_name (str) – The task’s type name. A common practice is to pass the name of the task’s class.

  • args (dict) – The task’s parameters. A common practice is to pass the parameters at the time of task’s object creation. It is a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

get_type()[source]

Get the task’s type name.

Returns

The task’s type name.

Return type

str

get_configuration()[source]

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

abstract run(output_directory)[source]

An abstract method for running the task.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

skrobot.tasks.evaluation_cross_validation_task module
class skrobot.tasks.evaluation_cross_validation_task.EvaluationCrossValidationTask(estimator, train_data_set_file_path, test_data_set_file_path=None, estimator_params=None, field_delimiter=',', feature_columns='all', id_column='id', label_column='label', random_seed=42, threshold_selection_by='f1', metric_greater_is_better=True, threshold_tuning_range=(0.01, 1.0, 0.01), export_classification_reports=False, export_confusion_matrixes=False, export_roc_curves=False, export_pr_curves=False, export_false_positives_reports=False, export_false_negatives_reports=False, export_also_for_train_folds=False, fscore_beta=1)[source]

Bases: skrobot.tasks.base_cross_validation_task.BaseCrossValidationTask

The EvaluationCrossValidationTask class can be used to evaluate a scikit-learn estimator/pipeline on some data.

The following evaluation results can be generated on-demand for hold-out test data set as well as train/validation cross-validation folds:

  • PR / ROC Curves

  • Confusion Matrixes

  • Classification Reports

  • Performance Metrics

  • False Positives

  • False Negatives

It can support both stratified k-fold cross-validation as well as cross-validation with user-defined folds.

By default, stratified k-fold cross-validation is used with the default parameters of stratified_folds() method.

__init__(estimator, train_data_set_file_path, test_data_set_file_path=None, estimator_params=None, field_delimiter=',', feature_columns='all', id_column='id', label_column='label', random_seed=42, threshold_selection_by='f1', metric_greater_is_better=True, threshold_tuning_range=(0.01, 1.0, 0.01), export_classification_reports=False, export_confusion_matrixes=False, export_roc_curves=False, export_pr_curves=False, export_false_positives_reports=False, export_false_negatives_reports=False, export_also_for_train_folds=False, fscore_beta=1)[source]

This is the constructor method and can be used to create a new object instance of EvaluationCrossValidationTask class.

Parameters
  • estimator (scikit-learn {estimator, pipeline}) – It can be either an estimator (e.g., LogisticRegression) or a pipeline ending with an estimator. The estimator needs to be able to predict probabilities through a predict_proba method.

  • train_data_set_file_path (str) – The file path of the input train data set. It can be either a URL or a disk file path.

  • test_data_set_file_path (str, optional) – The file path of the input test data set. It can be either a URL or a disk file path. It defaults to None.

  • estimator_params (dict, optional) – The parameters to override in the provided estimator/pipeline. It defaults to None.

  • field_delimiter (str, optional) – The separation delimiter (comma for CSV, tab for TSV, etc.) used in the input train/test data set files. It defaults to ‘,’.

  • feature_columns ({str, list}, optional) – Either ‘all’ to use from the input train/test data set files all the columns or a list of column names to select specific columns. It defaults to ‘all’.

  • id_column (str, optional) – The name of the column in the input train/test data set files containing the sample IDs. It defaults to ‘id’.

  • label_column (str, optional) – The name of the column in the input train/test data set files containing the ground truth labels. It defaults to ‘label’.

  • random_seed (int, optional) – The random seed used in the random number generator. It can be used to reproduce the output. It defaults to 42.

  • threshold_selection_by ({str, float}, optional) – The evaluation results will be generated either for a specific provided threshold value (e.g., 0.49) or for the best threshold found from threshold tuning, based on a specific provided metric (e.g., ‘f1’, ‘f0.55’). It defaults to ‘f1’.

  • metric_greater_is_better (bool, optional) – This flag will control the direction of searching of the best threshold and it depends on the provided metric in threshold_selection_by. True, means that greater metric values is better and False means the opposite. It defaults to True.

  • threshold_tuning_range (tuple, optional) – A range in form (start_value, stop_value, step_size) for generating a sequence of threshold values in threshold tuning. It generates the sequence by incrementing the start value using the step size until it reaches the stop value. It defaults to (0.01, 1.0, 0.01).

  • export_classification_reports (bool, optional) – If this task will export classification reports. It defaults to False.

  • export_confusion_matrixes (bool, optional) – If this task will export confusion matrixes. It defaults to False.

  • export_roc_curves (bool, optional) – If this task will export ROC curves. It defaults to False.

  • export_pr_curves (bool, optional) – If this task will export PR curves. It defaults to False.

  • export_false_positives_reports (bool, optional) – If this task will export false positives reports. It defaults to False.

  • export_false_negatives_reports (bool, optional) – If this task will export false negatives reports. It defaults to False.

  • export_also_for_train_folds (bool, optional) – If this task will export the evaluation results also for the train folds of cross-validation. It defaults to False.

  • fscore_beta (float, optional) – The beta parameter in F-measure. It determines the weight of recall in the score. beta < 1 lends more weight to precision, while beta > 1 favors recall (beta -> 0 considers only precision, beta -> +inf only recall). It defaults to 1.

run(output_directory)[source]

Run the task.

All of the evaluation results are stored as files under the output directory path.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

Returns

The task’s result. Specifically, the threshold used along with its related performance metrics and summary metrics from all cross-validation splits as well as hold-out test data set.

Return type

dict

custom_folds(folds_file_path, fold_column='fold')

Optional method.

Use cross-validation with user-defined custom folds.

Parameters
  • folds_file_path (str) – The path to the file containing the user-defined folds for the samples. The file needs to be formatted with the same separation delimiter (comma for CSV, tab for TSV, etc.) as the one used in the input data set files provided to the task. The file must contain two data columns and the first row must be the header. The first column is for the sample IDs and needs to be the same as the one used in the input data set files provided to the task. The second column is for the fold IDs (e.g., 1 through 5, A through D, etc.).

  • fold_column (str, optional) – The column name for the fold IDs. It defaults to ‘fold’.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

get_configuration()

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

get_type()

Get the task’s type name.

Returns

The task’s type name.

Return type

str

stratified_folds(total_folds=3, shuffle=False)

Optional method.

Use stratified k-fold cross-validation.

The folds are made by preserving the percentage of samples for each class.

Parameters
  • total_folds (int, optional) – Number of folds. Must be at least 2. It defaults to 3.

  • shuffle (bool, optional) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled. It defaults to False.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

skrobot.tasks.feature_selection_cross_validation_task module
class skrobot.tasks.feature_selection_cross_validation_task.FeatureSelectionCrossValidationTask(estimator, train_data_set_file_path, estimator_params=None, field_delimiter=',', preprocessor=None, preprocessor_params=None, min_features_to_select=1, scoring='f1', feature_columns='all', id_column='id', label_column='label', random_seed=42, verbose=3, n_jobs=1)[source]

Bases: skrobot.tasks.base_cross_validation_task.BaseCrossValidationTask

The FeatureSelectionCrossValidationTask class can be used to perform feature selection with Recursive Feature Elimination using a scikit-learn estimator on some data.

A scikit-learn preprocessor can be used on the input train data set before feature selection runs.

It can support both stratified k-fold cross-validation as well as cross-validation with user-defined folds.

By default, stratified k-fold cross-validation is used with the default parameters of stratified_folds() method.

__init__(estimator, train_data_set_file_path, estimator_params=None, field_delimiter=',', preprocessor=None, preprocessor_params=None, min_features_to_select=1, scoring='f1', feature_columns='all', id_column='id', label_column='label', random_seed=42, verbose=3, n_jobs=1)[source]

This is the constructor method and can be used to create a new object instance of FeatureSelectionCrossValidationTask class.

Parameters
  • estimator (scikit-learn estimator) – An estimator (e.g., LogisticRegression). It needs to provide feature importances through either a coef_ or a feature_importances_ attribute.

  • train_data_set_file_path (str) – The file path of the input train data set. It can be either a URL or a disk file path.

  • estimator_params (dict, optional) – The parameters to override in the provided estimator. It defaults to None.

  • field_delimiter (str, optional) – The separation delimiter (comma for CSV, tab for TSV, etc.) used in the input train data set file. It defaults to ‘,’.

  • preprocessor (scikit-learn preprocessor, optional) – The preprocessor you want to run on the input train data set before feature selection. You can set for example a scikit-learn ColumnTransformer, OneHotEncoder, etc. It defaults to None.

  • preprocessor_params (dict, optional) – The parameters to override in the provided preprocessor. It defaults to None.

  • min_features_to_select (int, optional) – The minimum number of features to be selected. This number of features will always be scored. It defaults to 1.

  • scoring ({str, callable}, optional) – A single scikit-learn scorer string (e.g., ‘f1’) or a callable that is built with scikit-learn make_scorer. Note that when using custom scorers, each scorer should return a single value. It defaults to ‘f1’.

  • feature_columns ({str, list}, optional) – Either ‘all’ to use from the input train data set file all the columns or a list of column names to select specific columns. It defaults to ‘all’.

  • id_column (str, optional) – The name of the column in the input train data set file containing the sample IDs. It defaults to ‘id’.

  • label_column (str, optional) – The name of the column in the input train data set file containing the ground truth labels. It defaults to ‘label’.

  • random_seed (int, optional) – The random seed used in the random number generator. It can be used to reproduce the output. It defaults to 42.

  • verbose (int, optional) – Controls the verbosity of output. The higher, the more messages. It defaults to 3.

  • n_jobs (int, optional) – Number of jobs to run in parallel. -1 means using all processors. It defaults to 1.

run(output_directory)[source]

Run the task.

The selected features are returned as a result and also stored in a features_selected.txt text file under the output directory path.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

Returns

The task’s result. Specifically, the selected features, which can be either column names from the input train data set or column indexes from the preprocessed data set, depending on whether a preprocessor was used or not.

Return type

list

custom_folds(folds_file_path, fold_column='fold')

Optional method.

Use cross-validation with user-defined custom folds.

Parameters
  • folds_file_path (str) – The path to the file containing the user-defined folds for the samples. The file needs to be formatted with the same separation delimiter (comma for CSV, tab for TSV, etc.) as the one used in the input data set files provided to the task. The file must contain two data columns and the first row must be the header. The first column is for the sample IDs and needs to be the same as the one used in the input data set files provided to the task. The second column is for the fold IDs (e.g., 1 through 5, A through D, etc.).

  • fold_column (str, optional) – The column name for the fold IDs. It defaults to ‘fold’.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

get_configuration()

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

get_type()

Get the task’s type name.

Returns

The task’s type name.

Return type

str

stratified_folds(total_folds=3, shuffle=False)

Optional method.

Use stratified k-fold cross-validation.

The folds are made by preserving the percentage of samples for each class.

Parameters
  • total_folds (int, optional) – Number of folds. Must be at least 2. It defaults to 3.

  • shuffle (bool, optional) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled. It defaults to False.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

skrobot.tasks.hyperparameters_search_cross_validation_task module
class skrobot.tasks.hyperparameters_search_cross_validation_task.HyperParametersSearchCrossValidationTask(estimator, search_params, train_data_set_file_path, estimator_params=None, field_delimiter=',', scorers=['roc_auc', 'average_precision', 'f1', 'precision', 'recall', 'accuracy'], feature_columns='all', id_column='id', label_column='label', objective_score='f1', random_seed=42, verbose=3, n_jobs=1, return_train_score=True)[source]

Bases: skrobot.tasks.base_cross_validation_task.BaseCrossValidationTask

The HyperParametersSearchCrossValidationTask class can be used to search the best hyperparameters of a scikit-learn estimator/pipeline on some data.

Cross-Validation

It can support both stratified k-fold cross-validation as well as cross-validation with user-defined folds.

By default, stratified k-fold cross-validation is used with the default parameters of stratified_folds() method.

Search

It can support both grid search as well as random search.

By default, grid search is used.

__init__(estimator, search_params, train_data_set_file_path, estimator_params=None, field_delimiter=',', scorers=['roc_auc', 'average_precision', 'f1', 'precision', 'recall', 'accuracy'], feature_columns='all', id_column='id', label_column='label', objective_score='f1', random_seed=42, verbose=3, n_jobs=1, return_train_score=True)[source]

This is the constructor method and can be used to create a new object instance of HyperParametersSearchCrossValidationTask class.

Parameters
  • estimator (scikit-learn {estimator, pipeline}) – It can be either an estimator (e.g., LogisticRegression) or a pipeline ending with an estimator.

  • search_params ({dict, list of dictionaries}) – Dictionary with hyperparameters names as keys and lists of hyperparameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of hyperparameter settings.

  • train_data_set_file_path (str) – The file path of the input train data set. It can be either a URL or a disk file path.

  • estimator_params (dict, optional) – The parameters to override in the provided estimator/pipeline. It defaults to None.

  • field_delimiter (str, optional) – The separation delimiter (comma for CSV, tab for TSV, etc.) used in the input train data set file. It defaults to ‘,’.

  • scorers ({list, dict}, optional) – Multiple metrics to evaluate the predictions on the hold-out data. Either give a list of (unique) strings or a dict with names as keys and callables as values. The callables should be scorers built using scikit-learn make_scorer. Note that when using custom scorers, each scorer should return a single value. It defaults to [‘roc_auc’, ‘average_precision’, ‘f1’, ‘precision’, ‘recall’, ‘accuracy’].

  • feature_columns ({str, list}, optional) – Either ‘all’ to use from the input train data set file all the columns or a list of column names to select specific columns. It defaults to ‘all’.

  • id_column (str, optional) – The name of the column in the input train data set file containing the sample IDs. It defaults to ‘id’.

  • label_column (str, optional) – The name of the column in the input train data set file containing the ground truth labels. It defaults to ‘label’.

  • objective_score (str, optional) – The scorer that would be used to find the best hyperparameters for refitting the best estimator/pipeline at the end. It defaults to ‘f1’.

  • random_seed (int, optional) – The random seed used in the random number generator. It can be used to reproduce the output. It defaults to 42.

  • verbose (int, optional) – Controls the verbosity of output. The higher, the more messages. It defaults to 3.

  • n_jobs (int, optional) – Number of jobs to run in parallel. -1 means using all processors. It defaults to 1.

  • return_train_score (bool, optional) – If False, training scores will not be computed and returned. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. It defaults to True.

Optional method.

Use the grid search method when searching the best hyperparameters.

Returns

The object instance itself.

Return type

HyperParametersSearchCrossValidationTask

Optional method.

Use the random search method when searching the best hyperparameters.

Parameters

n_iters (int, optional) – Number of hyperparameter settings that are sampled. n_iters trades off runtime vs quality of the solution. It defaults to 200.

Returns

The object instance itself.

Return type

HyperParametersSearchCrossValidationTask

run(output_directory)[source]

Run the task.

The search results (search_results) are stored also in a search_results.html file as a static HTML table under the output directory path.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

Returns

The task’s result. Specifically, 1) best_estimator: The estimator/pipeline that was chosen by the search, i.e. estimator/pipeline which gave best score on the hold-out data. 2) best_params: The hyperparameters setting that gave the best results on the hold-out data. 3) best_score: Mean cross-validated score of the best_estimator. 4) search_results: Metrics measured for each of the hyperparameters setting in the search. 5) best_index: The index (of the search_results) which corresponds to the best candidate hyperparameters setting.

Return type

dict

custom_folds(folds_file_path, fold_column='fold')

Optional method.

Use cross-validation with user-defined custom folds.

Parameters
  • folds_file_path (str) – The path to the file containing the user-defined folds for the samples. The file needs to be formatted with the same separation delimiter (comma for CSV, tab for TSV, etc.) as the one used in the input data set files provided to the task. The file must contain two data columns and the first row must be the header. The first column is for the sample IDs and needs to be the same as the one used in the input data set files provided to the task. The second column is for the fold IDs (e.g., 1 through 5, A through D, etc.).

  • fold_column (str, optional) – The column name for the fold IDs. It defaults to ‘fold’.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

get_configuration()

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

get_type()

Get the task’s type name.

Returns

The task’s type name.

Return type

str

stratified_folds(total_folds=3, shuffle=False)

Optional method.

Use stratified k-fold cross-validation.

The folds are made by preserving the percentage of samples for each class.

Parameters
  • total_folds (int, optional) – Number of folds. Must be at least 2. It defaults to 3.

  • shuffle (bool, optional) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled. It defaults to False.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

skrobot.tasks.prediction_task module
class skrobot.tasks.prediction_task.PredictionTask(estimator, data_set_file_path, field_delimiter=',', feature_columns='all', id_column='id', prediction_column='prediction', threshold=0.5)[source]

Bases: skrobot.tasks.base_task.BaseTask

The PredictionTask class can be used to predict new data using a scikit-learn estimator/pipeline.

__init__(estimator, data_set_file_path, field_delimiter=',', feature_columns='all', id_column='id', prediction_column='prediction', threshold=0.5)[source]

This is the constructor method and can be used to create a new object instance of PredictionTask class.

Parameters
  • estimator (scikit-learn {estimator, pipeline}) – It can be either an estimator (e.g., LogisticRegression) or a pipeline ending with an estimator. The estimator needs to be able to predict probabilities through a predict_proba method.

  • data_set_file_path (str) – The file path of the input data set. It can be either a URL or a disk file path.

  • field_delimiter (str, optional) – The separation delimiter (comma for CSV, tab for TSV, etc.) used in the input data set file. It defaults to ‘,’.

  • feature_columns ({str, list}, optional) – Either ‘all’ to use from the input data set file all the columns or a list of column names to select specific columns. It defaults to ‘all’.

  • id_column (str, optional) – The name of the column in the input data set file containing the sample IDs. It defaults to ‘id’.

  • prediction_column (str, optional) – The name of the column for the predicted binary class labels. It defaults to ‘prediction’.

  • threshold (float, optional) – The threshold to use for converting the predicted probability into a binary class label. It defaults to 0.5.

run(output_directory)[source]

Run the task.

The predictions are returned as a result and also stored in a predictions.csv CSV file under the output directory path.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

Returns

The task’s result. Specifically, the predictions for the input data set, containing the sample IDs, the predicted binary class labels, and the predicted probabilities for the positive class.

Return type

pandas DataFrame

get_configuration()

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

get_type()

Get the task’s type name.

Returns

The task’s type name.

Return type

str

skrobot.tasks.train_task module
class skrobot.tasks.train_task.TrainTask(estimator, train_data_set_file_path, estimator_params=None, field_delimiter=',', feature_columns='all', id_column='id', label_column='label', random_seed=42)[source]

Bases: skrobot.tasks.base_task.BaseTask

The TrainTask class can be used to fit a scikit-learn estimator/pipeline on train data.

__init__(estimator, train_data_set_file_path, estimator_params=None, field_delimiter=',', feature_columns='all', id_column='id', label_column='label', random_seed=42)[source]

This is the constructor method and can be used to create a new object instance of TrainTask class.

Parameters
  • estimator (scikit-learn {estimator, pipeline}) – It can be either an estimator (e.g., LogisticRegression) or a pipeline ending with an estimator.

  • train_data_set_file_path (str) – The file path of the input train data set. It can be either a URL or a disk file path.

  • estimator_params (dict, optional) – The parameters to override in the provided estimator/pipeline. It defaults to None.

  • field_delimiter (str, optional) – The separation delimiter (comma for CSV, tab for TSV, etc.) used in the input train data set file. It defaults to ‘,’.

  • feature_columns ({str, list}, optional) – Either ‘all’ to use from the input train data set file all the columns or a list of column names to select specific columns. It defaults to ‘all’.

  • id_column (str, optional) – The name of the column in the input train data set file containing the sample IDs. It defaults to ‘id’.

  • label_column (str, optional) – The name of the column in the input train data set file containing the ground truth labels. It defaults to ‘label’.

  • random_seed (int, optional) – The random seed used in the random number generator. It can be used to reproduce the output. It defaults to 42.

run(output_directory)[source]

Run the task.

The fitted estimator/pipeline is returned as a result and also stored in a trained_model.pkl pickle file under the output directory path.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

Returns

The task’s result. Specifically, the fitted estimator/pipeline.

Return type

dict

get_configuration()

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

get_type()

Get the task’s type name.

Returns

The task’s type name.

Return type

str

What is it about?

skrobot is a Python module for designing, running and tracking Machine Learning experiments / tasks. It is built on top of scikit-learn framework.

Why does it exists?

It can help Data Scientists and Machine Learning Engineers:

  • to keep track of modelling experiments / tasks

  • to automate the repetitive (and boring) stuff when designing modelling pipelines

  • to spend more time on the things that truly matter when solving a problem

How do I install it?

$ pip install skrobot

Which are the components?

NOTE : Currently, skrobot can be used only for binary classification problems.

For the module’s users

Component

What is this?

Train Task

This task can be used to fit a scikit-learn estimator on some data.

Prediction Task

This task can be used to predict new data using a scikit-learn estimator.

Evaluation Cross Validation Task

This task can be used to evaluate a scikit-learn estimator on some data.

Feature Selection Cross Validation Task

This task can be used to perform feature selection with Recursive Feature Elimination using a scikit-learn estimator on some data.

Hyperparameters Search Cross Validation Task

This task can be used to search the best hyperparameters of a scikit-learn estimator on some data.

Experiment

This is used to build, track and run an experiment. It can run tasks in the context of an experiment.

Task Runner

This is a simplified version (in functionality) of the Experiment component. It leaves out all the “experiment” stuff and is focused mostly in the execution and tracking of tasks.

For the module’s developers

Component

What is this?

Base Task

All tasks inherit from this component. A task is a configurable and reproducible piece of code built on top of scikit-learn that can be used in machine learning pipelines.

Base Cross Validation Task

All tasks that use cross validation functionality inherit from this component.

Base Notifier

All notifiers inherit from this component. A notifier can be used to send success / failure notifications for tasks execution.

How do I use it?

The following examples use many of skrobot’s components to built a machine learning modelling pipeline. Please try them and we would love to have your feedback! Furthermore, many examples can be found in the project’s repository.

Example on Titanic Dataset

The following example has generated the following results.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

from skrobot.core import Experiment
from skrobot.tasks import TrainTask
from skrobot.tasks import PredictionTask
from skrobot.tasks import FeatureSelectionCrossValidationTask
from skrobot.tasks import EvaluationCrossValidationTask
from skrobot.tasks import HyperParametersSearchCrossValidationTask
from skrobot.feature_selection import ColumnSelector
from skrobot.notification import BaseNotifier

######### Initialization Code

train_data_set_file_path = 'https://bit.ly/titanic-data-train'

test_data_set_file_path = 'https://bit.ly/titanic-data-test'

new_data_set_file_path = 'https://bit.ly/titanic-data-new'

random_seed = 42

id_column = 'PassengerId'

label_column = 'Survived'

numerical_features = ['Age', 'Fare', 'SibSp', 'Parch']

categorical_features = ['Embarked', 'Sex', 'Pclass']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(transformers=[
    ('numerical_transformer', numeric_transformer, numerical_features),
    ('categorical_transformer', categorical_transformer, categorical_features)])

classifier = LogisticRegression(solver='liblinear', random_state=random_seed)

search_params = {
    "classifier__C" : [ 1.e-01, 1.e+00, 1.e+01 ],
    "classifier__penalty" : [ "l1", "l2" ],
    "preprocessor__numerical_transformer__imputer__strategy" : [ "mean", "median" ]
}

######### skrobot Code

# Define a Notifier (This is optional and you can implement any notifier you want, e.g. for Slack / Trello / Discord)
class ConsoleNotifier(BaseNotifier):
    def notify (self, message):
        print(message)

# Build an Experiment
experiment = Experiment('experiments-output').set_source_code_file_path(__file__).set_experimenter('echatzikyriakidis').set_notifier(ConsoleNotifier()).build()

# Run Feature Selection Task
features_columns = experiment.run(FeatureSelectionCrossValidationTask (estimator=classifier,
                                                                       train_data_set_file_path=train_data_set_file_path,
                                                                       preprocessor=preprocessor,
                                                                       min_features_to_select=4,
                                                                       id_column=id_column,
                                                                       label_column=label_column,
                                                                       random_seed=random_seed).stratified_folds(total_folds=5, shuffle=True))

pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('selector', ColumnSelector(cols=features_columns)),
                       ('classifier', classifier)])

# Run Hyperparameters Search Task
hyperparameters_search_results = experiment.run(HyperParametersSearchCrossValidationTask (estimator=pipe,
                                                                                          search_params=search_params,
                                                                                          train_data_set_file_path=train_data_set_file_path,
                                                                                          id_column=id_column,
                                                                                          label_column=label_column,
                                                                                          random_seed=random_seed).random_search(n_iters=100).stratified_folds(total_folds=5, shuffle=True))

# Run Evaluation Task
evaluation_results = experiment.run(EvaluationCrossValidationTask(estimator=pipe,
                                                                  estimator_params=hyperparameters_search_results['best_params'],
                                                                  train_data_set_file_path=train_data_set_file_path,
                                                                  test_data_set_file_path=test_data_set_file_path,
                                                                  id_column=id_column,
                                                                  label_column=label_column,
                                                                  random_seed=random_seed,
                                                                  export_classification_reports=True,
                                                                  export_confusion_matrixes=True,
                                                                  export_pr_curves=True,
                                                                  export_roc_curves=True,
                                                                  export_false_positives_reports=True,
                                                                  export_false_negatives_reports=True,
                                                                  export_also_for_train_folds=True).stratified_folds(total_folds=5, shuffle=True))

# Run Train Task
train_results = experiment.run(TrainTask(estimator=pipe,
                                         estimator_params=hyperparameters_search_results['best_params'],
                                         train_data_set_file_path=train_data_set_file_path,
                                         id_column=id_column,
                                         label_column=label_column,
                                         random_seed=random_seed))

# Run Prediction Task
predictions = experiment.run(PredictionTask(estimator=train_results['estimator'],
                                            data_set_file_path=new_data_set_file_path,
                                            id_column=id_column,
                                            prediction_column=label_column,
                                            threshold=evaluation_results['threshold']))

# Print in-memory results
print(features_columns)

print(hyperparameters_search_results['best_params'])
print(hyperparameters_search_results['best_index'])
print(hyperparameters_search_results['best_estimator'])
print(hyperparameters_search_results['best_score'])
print(hyperparameters_search_results['search_results'])

print(evaluation_results['threshold'])
print(evaluation_results['cv_threshold_metrics'])
print(evaluation_results['cv_splits_threshold_metrics'])
print(evaluation_results['cv_splits_threshold_metrics_summary'])
print(evaluation_results['test_threshold_metrics'])

print(train_results['estimator'])

print(predictions)

Example on SMS Spam Collection Dataset

The following example has generated the following results.

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.linear_model import SGDClassifier

from skrobot.core import Experiment
from skrobot.tasks import TrainTask
from skrobot.tasks import PredictionTask
from skrobot.tasks import EvaluationCrossValidationTask
from skrobot.tasks import HyperParametersSearchCrossValidationTask
from skrobot.feature_selection import ColumnSelector

######### Initialization Code

train_data_set_file_path = 'https://bit.ly/sms-spam-ham-data-train'

test_data_set_file_path = 'https://bit.ly/sms-spam-ham-data-test'

new_data_set_file_path = 'https://bit.ly/sms-spam-ham-data-new'

field_delimiter = '\t'

random_seed = 42

pipe = Pipeline(steps=[
    ('column_selection', ColumnSelector(cols=['message'], drop_axis=True)),
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('feature_selection', SelectPercentile(chi2)),
    ('classifier', SGDClassifier(loss='log'))])

search_params = {
    'classifier__max_iter': [ 20, 50, 80 ],
    'classifier__alpha': [ 0.00001, 0.000001 ],
    'classifier__penalty': [ 'l2', 'elasticnet' ],
    "vectorizer__stop_words" : [ "english", None ],
    "vectorizer__ngram_range" : [ (1, 1), (1, 2) ],
    "vectorizer__max_df": [ 0.5, 0.75, 1.0 ],
    "tfidf__use_idf" : [ True, False ],
    "tfidf__norm" : [ 'l1', 'l2' ],
    "feature_selection__percentile" : [ 70, 60, 50 ]
}

######### skrobot Code

# Build an Experiment
experiment = Experiment('experiments-output').set_source_code_file_path(__file__).set_experimenter('echatzikyriakidis').build()

# Run Hyperparameters Search Task
hyperparameters_search_results = experiment.run(HyperParametersSearchCrossValidationTask (estimator=pipe,
                                                                                          search_params=search_params,
                                                                                          train_data_set_file_path=train_data_set_file_path,
                                                                                          field_delimiter=field_delimiter,
                                                                                          random_seed=random_seed).random_search().stratified_folds(total_folds=5, shuffle=True))

# Run Evaluation Task
evaluation_results = experiment.run(EvaluationCrossValidationTask(estimator=pipe,
                                                                  estimator_params=hyperparameters_search_results['best_params'],
                                                                  train_data_set_file_path=train_data_set_file_path,
                                                                  test_data_set_file_path=test_data_set_file_path,
                                                                  field_delimiter=field_delimiter,
                                                                  random_seed=random_seed,
                                                                  export_classification_reports=True,
                                                                  export_confusion_matrixes=True,
                                                                  export_pr_curves=True,
                                                                  export_roc_curves=True,
                                                                  export_false_positives_reports=True,
                                                                  export_false_negatives_reports=True,
                                                                  export_also_for_train_folds=True).stratified_folds(total_folds=5, shuffle=True))

# Run Train Task
train_results = experiment.run(TrainTask(estimator=pipe,
                                         estimator_params=hyperparameters_search_results['best_params'],
                                         train_data_set_file_path=train_data_set_file_path,
                                         field_delimiter=field_delimiter,
                                         random_seed=random_seed))

# Run Prediction Task
predictions = experiment.run(PredictionTask(estimator=train_results['estimator'],
                                            data_set_file_path=new_data_set_file_path,
                                            field_delimiter=field_delimiter,
                                            threshold=evaluation_results['threshold']))

# Print in-memory results
print(hyperparameters_search_results['best_params'])
print(hyperparameters_search_results['best_index'])
print(hyperparameters_search_results['best_estimator'])
print(hyperparameters_search_results['best_score'])
print(hyperparameters_search_results['search_results'])

print(evaluation_results['threshold'])
print(evaluation_results['cv_threshold_metrics'])
print(evaluation_results['cv_splits_threshold_metrics'])
print(evaluation_results['cv_splits_threshold_metrics_summary'])
print(evaluation_results['test_threshold_metrics'])

print(train_results['estimator'])

print(predictions)

Sample of generated results?

Classification Reports

Image 1 Image 2 Image 3

Confusion Matrixes

Image 4 Image 5 Image 6

False Negatives

Image 7 Image 8

False Positives

Image 9 Image 10

PR Curves

Image 11 Image 12

ROC Curves

Image 13 Image 14

Performance Metrics

On train / validation CV folds:

Image 15

On hold-out test set:

Image 16

Hyperparameters Search Results

Image 17

Task Parameters Logging

Image 18 Image 19 Image 20 Image 21 Image 22

Experiment Logging

Image 23

Features Selected

The selected column indexes from the transformed features (this is generated when a preprocessor is used):

Image 24

The selected column names from the original features (this is generated when no preprocessor is used):

Image 25

Experiment Source Code

Image 26

Predictions

Image 27

The people behind it?

Development:

Support, testing and features recommendation:

And last but not least, all the open-source contributors whose work went into RELEASES.

Can I contribute?

Of course, the project is Free Software and you can contribute to it!

What license do you use?

See our LICENSE for more details.