skrobot.tasks package

Submodules

skrobot.tasks.base_task module

class skrobot.tasks.base_task.BaseTask(type_name, args)[source]

Bases: abc.ABC

The BaseTask is an abstract base class for implementing tasks.

A task is a configurable and reproducible piece of code built on top of scikit-learn that can be used in machine learning pipelines.

__init__(type_name, args)[source]

This is the constructor method and can be used from child BaseTask implementations.

Parameters
  • type_name (str) – The task’s type name. A common practice is to pass the name of the task’s class.

  • args (dict) – The task’s parameters. A common practice is to pass the parameters at the time of task’s object creation. It is a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

get_type()[source]

Get the task’s type name.

Returns

The task’s type name.

Return type

str

get_configuration()[source]

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

abstract run(output_directory)[source]

An abstract method for running the task.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

skrobot.tasks.base_cross_validation_task module

class skrobot.tasks.base_cross_validation_task.BaseCrossValidationTask(type_name, args)[source]

Bases: skrobot.tasks.base_task.BaseTask

The BaseCrossValidationTask is an abstract base class for implementing tasks that use cross-validation functionality.

It can support both stratified k-fold cross-validation as well as cross-validation with user-defined folds.

By default, stratified k-fold cross-validation is used with the default parameters of stratified_folds() method.

__init__(type_name, args)[source]

This is the constructor method and can be used from child BaseCrossValidationTask implementations.

Parameters
  • type_name (str) – The task’s type name. A common practice is to pass the name of the task’s class.

  • args (dict) – The task’s parameters. A common practice is to pass the parameters at the time of task’s object creation. It is a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

custom_folds(folds_data, fold_column='fold')[source]

Optional method.

Use cross-validation with user-defined custom folds.

Parameters
  • folds_data ({str or pandas DataFrame}) – The input folds data. It can be either a URL, a disk file path or a pandas DataFrame. The folds data contain the user-defined folds for the samples. If a URL or a disk file path is provided the data must be formatted with the same separation delimiter (comma for CSV, tab for TSV, etc.) as the one used in the input data set files provided to the task. The data must contain two columns and the first row must be the header. The first column is for the sample IDs and needs to be the same as the one used in the input data set files provided to the task. The second column is for the fold IDs (e.g., 1 through 5, A through D, etc.).

  • fold_column (str, optional) – The column name for the fold IDs. It defaults to ‘fold’.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

stratified_folds(total_folds=3, shuffle=False)[source]

Optional method.

Use stratified k-fold cross-validation.

The folds are made by preserving the percentage of samples for each class.

Parameters
  • total_folds (int, optional) – Number of folds. Must be at least 2. It defaults to 3.

  • shuffle (bool, optional) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled. It defaults to False.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

get_configuration()

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

get_type()

Get the task’s type name.

Returns

The task’s type name.

Return type

str

abstract run(output_directory)

An abstract method for running the task.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

skrobot.tasks.evaluation_cross_validation_task module

class skrobot.tasks.evaluation_cross_validation_task.EvaluationCrossValidationTask(estimator, train_data_set, test_data_set=None, estimator_params=None, field_delimiter=',', feature_columns='all', id_column='id', label_column='label', random_seed=42, threshold_selection_by='f1', metric_greater_is_better=True, threshold_tuning_range=(0.01, 1.0, 0.01), export_classification_reports=False, export_confusion_matrixes=False, export_roc_curves=False, export_pr_curves=False, export_false_positives_reports=False, export_false_negatives_reports=False, export_also_for_train_folds=False, fscore_beta=1)[source]

Bases: skrobot.tasks.base_cross_validation_task.BaseCrossValidationTask

The EvaluationCrossValidationTask class can be used to evaluate a scikit-learn estimator/pipeline on some data.

The following evaluation results can be generated on-demand for hold-out test data set as well as train/validation cross-validation folds:

  • PR / ROC Curves

  • Confusion Matrixes

  • Classification Reports

  • Performance Metrics

  • False Positives

  • False Negatives

It can support both stratified k-fold cross-validation as well as cross-validation with user-defined folds.

By default, stratified k-fold cross-validation is used with the default parameters of stratified_folds() method.

__init__(estimator, train_data_set, test_data_set=None, estimator_params=None, field_delimiter=',', feature_columns='all', id_column='id', label_column='label', random_seed=42, threshold_selection_by='f1', metric_greater_is_better=True, threshold_tuning_range=(0.01, 1.0, 0.01), export_classification_reports=False, export_confusion_matrixes=False, export_roc_curves=False, export_pr_curves=False, export_false_positives_reports=False, export_false_negatives_reports=False, export_also_for_train_folds=False, fscore_beta=1)[source]

This is the constructor method and can be used to create a new object instance of EvaluationCrossValidationTask class.

Parameters
  • estimator (scikit-learn {estimator, pipeline}) – It can be either an estimator (e.g., LogisticRegression) or a pipeline ending with an estimator. The estimator needs to be able to predict probabilities through a predict_proba method.

  • train_data_set ({str or pandas DataFrame}) – The input train data set. It can be either a URL, a disk file path or a pandas DataFrame.

  • test_data_set ({str or pandas DataFrame}, optional) – The input test data set. It can be either a URL, a disk file path or a pandas DataFrame. It defaults to None.

  • estimator_params (dict, optional) – The parameters to override in the provided estimator/pipeline. It defaults to None.

  • field_delimiter (str, optional) – The separation delimiter (comma for CSV, tab for TSV, etc.) used in the input train/test data set files. It defaults to ‘,’.

  • feature_columns ({str, list}, optional) – Either ‘all’ to use from the input train/test data set files all the columns or a list of column names to select specific columns. It defaults to ‘all’.

  • id_column (str, optional) – The name of the column in the input train/test data set files containing the sample IDs. It defaults to ‘id’.

  • label_column (str, optional) – The name of the column in the input train/test data set files containing the ground truth labels. It defaults to ‘label’.

  • random_seed (int, optional) – The random seed used in the random number generator. It can be used to reproduce the output. It defaults to 42.

  • threshold_selection_by ({str, float}, optional) – The evaluation results will be generated either for a specific provided threshold value (e.g., 0.49) or for the best threshold found from threshold tuning, based on a specific provided metric (e.g., ‘f1’, ‘f0.55’). It defaults to ‘f1’.

  • metric_greater_is_better (bool, optional) – This flag will control the direction of searching of the best threshold and it depends on the provided metric in threshold_selection_by. True, means that greater metric values is better and False means the opposite. It defaults to True.

  • threshold_tuning_range (tuple, optional) – A range in form (start_value, stop_value, step_size) for generating a sequence of threshold values in threshold tuning. It generates the sequence by incrementing the start value using the step size until it reaches the stop value. It defaults to (0.01, 1.0, 0.01).

  • export_classification_reports (bool, optional) – If this task will export classification reports. It defaults to False.

  • export_confusion_matrixes (bool, optional) – If this task will export confusion matrixes. It defaults to False.

  • export_roc_curves (bool, optional) – If this task will export ROC curves. It defaults to False.

  • export_pr_curves (bool, optional) – If this task will export PR curves. It defaults to False.

  • export_false_positives_reports (bool, optional) – If this task will export false positives reports. It defaults to False.

  • export_false_negatives_reports (bool, optional) – If this task will export false negatives reports. It defaults to False.

  • export_also_for_train_folds (bool, optional) – If this task will export the evaluation results also for the train folds of cross-validation. It defaults to False.

  • fscore_beta (float, optional) – The beta parameter in F-measure. It determines the weight of recall in the score. beta < 1 lends more weight to precision, while beta > 1 favors recall (beta -> 0 considers only precision, beta -> +inf only recall). It defaults to 1.

run(output_directory)[source]

Run the task.

All of the evaluation results are stored as files under the output directory path.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

Returns

The task’s result. Specifically, the threshold used along with its related performance metrics and summary metrics from all cross-validation splits as well as hold-out test data set.

Return type

dict

custom_folds(folds_data, fold_column='fold')

Optional method.

Use cross-validation with user-defined custom folds.

Parameters
  • folds_data ({str or pandas DataFrame}) – The input folds data. It can be either a URL, a disk file path or a pandas DataFrame. The folds data contain the user-defined folds for the samples. If a URL or a disk file path is provided the data must be formatted with the same separation delimiter (comma for CSV, tab for TSV, etc.) as the one used in the input data set files provided to the task. The data must contain two columns and the first row must be the header. The first column is for the sample IDs and needs to be the same as the one used in the input data set files provided to the task. The second column is for the fold IDs (e.g., 1 through 5, A through D, etc.).

  • fold_column (str, optional) – The column name for the fold IDs. It defaults to ‘fold’.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

get_configuration()

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

get_type()

Get the task’s type name.

Returns

The task’s type name.

Return type

str

stratified_folds(total_folds=3, shuffle=False)

Optional method.

Use stratified k-fold cross-validation.

The folds are made by preserving the percentage of samples for each class.

Parameters
  • total_folds (int, optional) – Number of folds. Must be at least 2. It defaults to 3.

  • shuffle (bool, optional) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled. It defaults to False.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

skrobot.tasks.feature_selection_cross_validation_task module

class skrobot.tasks.feature_selection_cross_validation_task.FeatureSelectionCrossValidationTask(estimator, train_data_set, estimator_params=None, field_delimiter=',', preprocessor=None, preprocessor_params=None, min_features_to_select=1, scoring='f1', feature_columns='all', id_column='id', label_column='label', random_seed=42, verbose=3, n_jobs=1)[source]

Bases: skrobot.tasks.base_cross_validation_task.BaseCrossValidationTask

The FeatureSelectionCrossValidationTask class can be used to perform feature selection with Recursive Feature Elimination using a scikit-learn estimator on some data.

A scikit-learn preprocessor can be used on the input train data set before feature selection runs.

It can support both stratified k-fold cross-validation as well as cross-validation with user-defined folds.

By default, stratified k-fold cross-validation is used with the default parameters of stratified_folds() method.

__init__(estimator, train_data_set, estimator_params=None, field_delimiter=',', preprocessor=None, preprocessor_params=None, min_features_to_select=1, scoring='f1', feature_columns='all', id_column='id', label_column='label', random_seed=42, verbose=3, n_jobs=1)[source]

This is the constructor method and can be used to create a new object instance of FeatureSelectionCrossValidationTask class.

Parameters
  • estimator (scikit-learn estimator) – An estimator (e.g., LogisticRegression). It needs to provide feature importances through either a coef_ or a feature_importances_ attribute.

  • train_data_set ({str, pandas DataFrame}) – The input train data set. It can be either a URL, a disk file path or a pandas DataFrame.

  • estimator_params (dict, optional) – The parameters to override in the provided estimator. It defaults to None.

  • field_delimiter (str, optional) – The separation delimiter (comma for CSV, tab for TSV, etc.) used in the input train data set file. It defaults to ‘,’.

  • preprocessor (scikit-learn preprocessor, optional) – The preprocessor you want to run on the input train data set before feature selection. You can set for example a scikit-learn ColumnTransformer, OneHotEncoder, etc. It defaults to None.

  • preprocessor_params (dict, optional) – The parameters to override in the provided preprocessor. It defaults to None.

  • min_features_to_select (int, optional) – The minimum number of features to be selected. This number of features will always be scored. It defaults to 1.

  • scoring ({str, callable}, optional) – A single scikit-learn scorer string (e.g., ‘f1’) or a callable that is built with scikit-learn make_scorer. Note that when using custom scorers, each scorer should return a single value. It defaults to ‘f1’.

  • feature_columns ({str, list}, optional) – Either ‘all’ to use from the input train data set file all the columns or a list of column names to select specific columns. It defaults to ‘all’.

  • id_column (str, optional) – The name of the column in the input train data set file containing the sample IDs. It defaults to ‘id’.

  • label_column (str, optional) – The name of the column in the input train data set file containing the ground truth labels. It defaults to ‘label’.

  • random_seed (int, optional) – The random seed used in the random number generator. It can be used to reproduce the output. It defaults to 42.

  • verbose (int, optional) – Controls the verbosity of output. The higher, the more messages. It defaults to 3.

  • n_jobs (int, optional) – Number of jobs to run in parallel. -1 means using all processors. It defaults to 1.

run(output_directory)[source]

Run the task.

The selected features are returned as a result and also stored in a features_selected.txt file under the output directory path.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

Returns

The task’s result. Specifically, the selected features, which can be either column names from the input train data set or column indexes from the preprocessed data set, depending on whether a preprocessor was used or not.

Return type

list

custom_folds(folds_data, fold_column='fold')

Optional method.

Use cross-validation with user-defined custom folds.

Parameters
  • folds_data ({str or pandas DataFrame}) – The input folds data. It can be either a URL, a disk file path or a pandas DataFrame. The folds data contain the user-defined folds for the samples. If a URL or a disk file path is provided the data must be formatted with the same separation delimiter (comma for CSV, tab for TSV, etc.) as the one used in the input data set files provided to the task. The data must contain two columns and the first row must be the header. The first column is for the sample IDs and needs to be the same as the one used in the input data set files provided to the task. The second column is for the fold IDs (e.g., 1 through 5, A through D, etc.).

  • fold_column (str, optional) – The column name for the fold IDs. It defaults to ‘fold’.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

get_configuration()

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

get_type()

Get the task’s type name.

Returns

The task’s type name.

Return type

str

stratified_folds(total_folds=3, shuffle=False)

Optional method.

Use stratified k-fold cross-validation.

The folds are made by preserving the percentage of samples for each class.

Parameters
  • total_folds (int, optional) – Number of folds. Must be at least 2. It defaults to 3.

  • shuffle (bool, optional) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled. It defaults to False.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

skrobot.tasks.hyperparameters_search_cross_validation_task module

class skrobot.tasks.hyperparameters_search_cross_validation_task.HyperParametersSearchCrossValidationTask(estimator, search_params, train_data_set, estimator_params=None, field_delimiter=',', scorers=['roc_auc', 'average_precision', 'f1', 'precision', 'recall', 'accuracy'], feature_columns='all', id_column='id', label_column='label', objective_score='f1', random_seed=42, verbose=3, n_jobs=1, return_train_score=True)[source]

Bases: skrobot.tasks.base_cross_validation_task.BaseCrossValidationTask

The HyperParametersSearchCrossValidationTask class can be used to search the best hyperparameters of a scikit-learn estimator/pipeline on some data.

Cross-Validation

It can support both stratified k-fold cross-validation as well as cross-validation with user-defined folds.

By default, stratified k-fold cross-validation is used with the default parameters of stratified_folds() method.

Search

It can support both grid search as well as random search.

By default, grid search is used.

__init__(estimator, search_params, train_data_set, estimator_params=None, field_delimiter=',', scorers=['roc_auc', 'average_precision', 'f1', 'precision', 'recall', 'accuracy'], feature_columns='all', id_column='id', label_column='label', objective_score='f1', random_seed=42, verbose=3, n_jobs=1, return_train_score=True)[source]

This is the constructor method and can be used to create a new object instance of HyperParametersSearchCrossValidationTask class.

Parameters
  • estimator (scikit-learn {estimator, pipeline}) – It can be either an estimator (e.g., LogisticRegression) or a pipeline ending with an estimator.

  • search_params ({dict, list of dictionaries}) – Dictionary with hyperparameters names as keys and lists of hyperparameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of hyperparameter settings.

  • train_data_set ({str, pandas DataFrame}) – The input train data set. It can be either a URL, a disk file path or a pandas DataFrame.

  • estimator_params (dict, optional) – The parameters to override in the provided estimator/pipeline. It defaults to None.

  • field_delimiter (str, optional) – The separation delimiter (comma for CSV, tab for TSV, etc.) used in the input train data set file. It defaults to ‘,’.

  • scorers ({list, dict}, optional) – Multiple metrics to evaluate the predictions on the hold-out data. Either give a list of (unique) strings or a dict with names as keys and callables as values. The callables should be scorers built using scikit-learn make_scorer. Note that when using custom scorers, each scorer should return a single value. It defaults to [‘roc_auc’, ‘average_precision’, ‘f1’, ‘precision’, ‘recall’, ‘accuracy’].

  • feature_columns ({str, list}, optional) – Either ‘all’ to use from the input train data set file all the columns or a list of column names to select specific columns. It defaults to ‘all’.

  • id_column (str, optional) – The name of the column in the input train data set file containing the sample IDs. It defaults to ‘id’.

  • label_column (str, optional) – The name of the column in the input train data set file containing the ground truth labels. It defaults to ‘label’.

  • objective_score (str, optional) – The scorer that would be used to find the best hyperparameters for refitting the best estimator/pipeline at the end. It defaults to ‘f1’.

  • random_seed (int, optional) – The random seed used in the random number generator. It can be used to reproduce the output. It defaults to 42.

  • verbose (int, optional) – Controls the verbosity of output. The higher, the more messages. It defaults to 3.

  • n_jobs (int, optional) – Number of jobs to run in parallel. -1 means using all processors. It defaults to 1.

  • return_train_score (bool, optional) – If False, training scores will not be computed and returned. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. It defaults to True.

Optional method.

Use the grid search method when searching the best hyperparameters.

Returns

The object instance itself.

Return type

HyperParametersSearchCrossValidationTask

Optional method.

Use the random search method when searching the best hyperparameters.

Parameters

n_iters (int, optional) – Number of hyperparameter settings that are sampled. n_iters trades off runtime vs quality of the solution. It defaults to 200.

Returns

The object instance itself.

Return type

HyperParametersSearchCrossValidationTask

run(output_directory)[source]

Run the task.

The search results (search_results) are stored also in a search_results.html file as a static HTML table under the output directory path.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

Returns

The task’s result. Specifically, 1) best_estimator: The estimator/pipeline that was chosen by the search, i.e. estimator/pipeline which gave best score on the hold-out data. 2) best_params: The hyperparameters setting that gave the best results on the hold-out data. 3) best_score: Mean cross-validated score of the best_estimator. 4) search_results: Metrics measured for each of the hyperparameters setting in the search. 5) best_index: The index (of the search_results) which corresponds to the best candidate hyperparameters setting.

Return type

dict

custom_folds(folds_data, fold_column='fold')

Optional method.

Use cross-validation with user-defined custom folds.

Parameters
  • folds_data ({str or pandas DataFrame}) – The input folds data. It can be either a URL, a disk file path or a pandas DataFrame. The folds data contain the user-defined folds for the samples. If a URL or a disk file path is provided the data must be formatted with the same separation delimiter (comma for CSV, tab for TSV, etc.) as the one used in the input data set files provided to the task. The data must contain two columns and the first row must be the header. The first column is for the sample IDs and needs to be the same as the one used in the input data set files provided to the task. The second column is for the fold IDs (e.g., 1 through 5, A through D, etc.).

  • fold_column (str, optional) – The column name for the fold IDs. It defaults to ‘fold’.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

get_configuration()

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

get_type()

Get the task’s type name.

Returns

The task’s type name.

Return type

str

stratified_folds(total_folds=3, shuffle=False)

Optional method.

Use stratified k-fold cross-validation.

The folds are made by preserving the percentage of samples for each class.

Parameters
  • total_folds (int, optional) – Number of folds. Must be at least 2. It defaults to 3.

  • shuffle (bool, optional) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled. It defaults to False.

Returns

The object instance itself.

Return type

BaseCrossValidationTask

skrobot.tasks.deep_feature_synthesis_task module

class skrobot.tasks.deep_feature_synthesis_task.DeepFeatureSynthesisTask(entities=None, relationships=None, entityset=None, target_entity=None, cutoff_time=None, instance_ids=None, agg_primitives=None, trans_primitives=None, groupby_trans_primitives=None, allowed_paths=None, max_depth=2, ignore_entities=None, ignore_variables=None, primitive_options=None, seed_features=None, drop_contains=None, drop_exact=None, where_primitives=None, max_features=- 1, save_progress=None, training_window=None, approximate=None, chunk_size=None, n_jobs=1, dask_kwargs=None, verbose=False, return_variable_types=None, progress_callback=None, include_cutoff_time=True, export_feature_graphs=False, export_feature_information=False, label_column='label')[source]

Bases: skrobot.tasks.base_task.BaseTask

The DeepFeatureSynthesisTask class is a wrapper for Featuretools. It can be used to automate feature engineering and create features from temporal and relational datasets.

__init__(entities=None, relationships=None, entityset=None, target_entity=None, cutoff_time=None, instance_ids=None, agg_primitives=None, trans_primitives=None, groupby_trans_primitives=None, allowed_paths=None, max_depth=2, ignore_entities=None, ignore_variables=None, primitive_options=None, seed_features=None, drop_contains=None, drop_exact=None, where_primitives=None, max_features=- 1, save_progress=None, training_window=None, approximate=None, chunk_size=None, n_jobs=1, dask_kwargs=None, verbose=False, return_variable_types=None, progress_callback=None, include_cutoff_time=True, export_feature_graphs=False, export_feature_information=False, label_column='label')[source]

This is the constructor method and can be used to create a new object instance of DeepFeatureSynthesisTask class.

Most of the arguments are documented here: https://featuretools.alteryx.com/en/stable/generated/featuretools.dfs.html#featuretools.dfs

Parameters
  • export_feature_graphs (bool, optional) – If this task will export feature computation graphs. It defaults to False.

  • export_feature_information (bool, optional) – If this task will export feature information. The feature definitions can be used to recalculate features for a different data set. It defaults to False.

  • label_column (str, optional) – The name of the column containing the ground truth labels. It defaults to ‘label’.

run(output_directory)[source]

Run the task.

The synthesized output data set is returned as a result and also stored in a synthesized_dataset.csv file under the output directory path.

The features information are stored in a feature_information.html file as a static HTML table under the output directory path.

The feature computation graphs are stored as PNG files under the output directory path.

Also, the feature definitions are stored in a feature_definitions.txt file under the output directory path.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

Returns

The task’s result. Specifically, 1) synthesized_dataset: The synthesized output data set as a pandas DataFrame. 2) feature_definitions: The definitions of features in the synthesized output data set. The feature definitions can be used to recalculate features for a different data set.

Return type

dict

get_configuration()

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

get_type()

Get the task’s type name.

Returns

The task’s type name.

Return type

str

skrobot.tasks.dataset_calculation_task module

class skrobot.tasks.dataset_calculation_task.DatasetCalculationTask(feature_definitions, entityset=None, cutoff_time=None, instance_ids=None, entities=None, relationships=None, training_window=None, approximate=None, save_progress=None, verbose=False, chunk_size=None, n_jobs=1, dask_kwargs=None, progress_callback=None, include_cutoff_time=True)[source]

Bases: skrobot.tasks.base_task.BaseTask

The DatasetCalculationTask class is a wrapper for Featuretools. It can be used to calculate a data set using some feature definitions and input data.

__init__(feature_definitions, entityset=None, cutoff_time=None, instance_ids=None, entities=None, relationships=None, training_window=None, approximate=None, save_progress=None, verbose=False, chunk_size=None, n_jobs=1, dask_kwargs=None, progress_callback=None, include_cutoff_time=True)[source]

This is the constructor method and can be used to create a new object instance of DatasetCalculationTask class.

Most of the arguments are documented here: https://featuretools.alteryx.com/en/stable/generated/featuretools.calculate_feature_matrix.html#featuretools.calculate_feature_matrix

Parameters

feature_definitions ({str or list[FeatureBase]}) – The feature definitions to be calculated. It can be either a disk file path or a list[FeatureBase] as exported by DeepFeatureSynthesisTask task.

run(output_directory)[source]

Run the task.

The calculated data set is returned as a result.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

Returns

The task’s result. Specifically, the calculated data set for the input data and feature definitions.

Return type

pandas DataFrame

get_configuration()

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

get_type()

Get the task’s type name.

Returns

The task’s type name.

Return type

str

skrobot.tasks.train_task module

class skrobot.tasks.train_task.TrainTask(estimator, train_data_set, estimator_params=None, field_delimiter=',', feature_columns='all', id_column='id', label_column='label', random_seed=42)[source]

Bases: skrobot.tasks.base_task.BaseTask

The TrainTask class can be used to fit a scikit-learn estimator/pipeline on train data.

__init__(estimator, train_data_set, estimator_params=None, field_delimiter=',', feature_columns='all', id_column='id', label_column='label', random_seed=42)[source]

This is the constructor method and can be used to create a new object instance of TrainTask class.

Parameters
  • estimator (scikit-learn {estimator, pipeline}) – It can be either an estimator (e.g., LogisticRegression) or a pipeline ending with an estimator.

  • train_data_set ({str, pandas DataFrame}) – The input train data set. It can be either a URL, a disk file path or a pandas DataFrame.

  • estimator_params (dict, optional) – The parameters to override in the provided estimator/pipeline. It defaults to None.

  • field_delimiter (str, optional) – The separation delimiter (comma for CSV, tab for TSV, etc.) used in the input train data set file. It defaults to ‘,’.

  • feature_columns ({str, list}, optional) – Either ‘all’ to use from the input train data set file all the columns or a list of column names to select specific columns. It defaults to ‘all’.

  • id_column (str, optional) – The name of the column in the input train data set file containing the sample IDs. It defaults to ‘id’.

  • label_column (str, optional) – The name of the column in the input train data set file containing the ground truth labels. It defaults to ‘label’.

  • random_seed (int, optional) – The random seed used in the random number generator. It can be used to reproduce the output. It defaults to 42.

run(output_directory)[source]

Run the task.

The fitted estimator/pipeline is returned as a result and also stored in a trained_model.pkl pickle file under the output directory path.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

Returns

The task’s result. Specifically, the fitted estimator/pipeline.

Return type

dict

get_configuration()

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

get_type()

Get the task’s type name.

Returns

The task’s type name.

Return type

str

skrobot.tasks.prediction_task module

class skrobot.tasks.prediction_task.PredictionTask(estimator, data_set, field_delimiter=',', feature_columns='all', id_column='id', prediction_column='prediction', threshold=0.5)[source]

Bases: skrobot.tasks.base_task.BaseTask

The PredictionTask class can be used to predict new data using a scikit-learn estimator/pipeline.

__init__(estimator, data_set, field_delimiter=',', feature_columns='all', id_column='id', prediction_column='prediction', threshold=0.5)[source]

This is the constructor method and can be used to create a new object instance of PredictionTask class.

Parameters
  • estimator (scikit-learn {estimator, pipeline}) – It can be either an estimator (e.g., LogisticRegression) or a pipeline ending with an estimator. The estimator needs to be able to predict probabilities through a predict_proba method.

  • data_set ({str, pandas DataFrame}) – The input data set. It can be either a URL, a disk file path or a pandas DataFrame.

  • field_delimiter (str, optional) – The separation delimiter (comma for CSV, tab for TSV, etc.) used in the input data set file. It defaults to ‘,’.

  • feature_columns ({str, list}, optional) – Either ‘all’ to use from the input data set file all the columns or a list of column names to select specific columns. It defaults to ‘all’.

  • id_column (str, optional) – The name of the column in the input data set file containing the sample IDs. It defaults to ‘id’.

  • prediction_column (str, optional) – The name of the column for the predicted binary class labels. It defaults to ‘prediction’.

  • threshold (float, optional) – The threshold to use for converting the predicted probability into a binary class label. It defaults to 0.5.

run(output_directory)[source]

Run the task.

The predictions are returned as a result and also stored in a predictions.csv CSV file under the output directory path.

Parameters

output_directory (str) – The output directory path under which task-related generated files are stored.

Returns

The task’s result. Specifically, the predictions for the input data set, containing the sample IDs, the predicted binary class labels, and the predicted probabilities for the positive class.

Return type

pandas DataFrame

get_configuration()

Get the task’s parameters.

Returns

The task’s parameters as a dictionary of key-value pairs, where the key is the parameter name and the value is the parameter value.

Return type

dict

get_type()

Get the task’s type name.

Returns

The task’s type name.

Return type

str