deepinsight.doctor.utils package

Submodules

deepinsight.doctor.utils.calibration module

deepinsight.doctor.utils.calibration.dt_calibration_curve(y_true, y_prob, sample_weight=None, n_bins=10, pos_label=None)
deepinsight.doctor.utils.calibration.dt_calibration_loss(freqs, avg_preds, weights, reducer='sum', normalize=True)

deepinsight.doctor.utils.crossval module

class deepinsight.doctor.utils.crossval.DKULeaveOneGroupOut(column_name)

Bases: object

get_n_splits(X, y, groups=None)
set_column_labels(column_labels)
split(X, y, groups=None)
class deepinsight.doctor.utils.crossval.DKULeavePGroupsOut(column_name, p)

Bases: object

get_n_splits(X, y, groups=None)
set_column_labels(column_labels)
split(X, y, groups=None)

deepinsight.doctor.utils.dataframe_cache module

deepinsight.doctor.utils.dataframe_cache.clear_cache()
deepinsight.doctor.utils.dataframe_cache.get_dataframe(dataset, *args, **kwargs)
deepinsight.doctor.utils.dataframe_cache.hashablify(c)

deepinsight.doctor.utils.interrupt_optimization module

deepinsight.doctor.utils.interrupt_optimization.create_interrupt_file()
deepinsight.doctor.utils.interrupt_optimization.must_interrupt()
deepinsight.doctor.utils.interrupt_optimization.set_before_interrupt_check_callback(new_callback)
deepinsight.doctor.utils.interrupt_optimization.set_interrupt_folder(folder_p)

deepinsight.doctor.utils.lift_curve module

class deepinsight.doctor.utils.lift_curve.LiftBuilder(data, actual, predicted, with_weight=False)

Bases: object

Builds the data for lift curves

build()

deepinsight.doctor.utils.listener module

class deepinsight.doctor.utils.listener.ExitState(listener)

Bases: object

class deepinsight.doctor.utils.listener.ProgressListener(verbose=True)

Bases: object

add_future_step(name)
add_future_steps(names)
pop_state()
push_state(name, target=None)
reset()
set_current_progress(progress)
to_jsonifiable()
deepinsight.doctor.utils.listener.unix_time_millis()

deepinsight.doctor.utils.magic_main module

deepinsight.doctor.utils.magic_main.magic_main(main)

deepinsight.doctor.utils.metrics module

deepinsight.doctor.utils.metrics.check_test_set_ok_for_classification(y_true)
deepinsight.doctor.utils.metrics.log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None)

Log loss, aka logistic loss or cross-entropy loss.

sk-learn version is bugged when a class never appears in the predictions.

deepinsight.doctor.utils.metrics.log_odds(array, clip_min=0.0, clip_max=1.0)

Compute the log odd of each elements of a numpy array logodd = p / (1-p) with p a probability :param array: (numpy array) :param clip_min: (float) minimum value :param clip_max: (float) maximum value :return: a numpy array with the same dimension as input array

deepinsight.doctor.utils.metrics.mcalibration_loss(y_true, y_pred, sample_weight=None)
deepinsight.doctor.utils.metrics.mean_absolute_percentage_error(y_true, y_pred, sample_weight=None)
deepinsight.doctor.utils.metrics.mroc_auc_score(y_true, y_predictions, sample_weight=None)

Returns a auc score. Handles multi-class

For multi-class, the AUC score is in fact the MAUC score described in

David J. Hand and Robert J. Till. 2001. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Mach. Learn. 45, 2 (October 2001), 171-186. DOI=10.1023/A:1010920819831

http://dx.doi.org/10.1023/A:1010920819831

deepinsight.doctor.utils.metrics.rmse_score(y, y_pred, sample_weight=None)

Root Mean Square Error, more readable than MSE

deepinsight.doctor.utils.metrics.rmsle_score(y, y_pred, sample_weight=None)

Root Mean Square Logarithmic Error https://www.kaggle.com/wiki/RootMeanSquaredLogarithmicError

deepinsight.doctor.utils.split module

deepinsight.doctor.utils.split.df_from_split_desc(split_desc, split, feature_params, prediction_type=None)
deepinsight.doctor.utils.split.df_from_split_desc_no_normalization(split_desc, split, feature_params, prediction_type=None)

deepinsight.doctor.utils.subsampler module

class deepinsight.doctor.utils.subsampler.Subsampler(df, variable, sampling_type='stratified', ratio=0.1)

Bases: object

balanced_subsampling()

Subsample targetting the representation of clusters in a scatter plot. This has really no statistical property whatsoever.

Proper stratified subsampling may lead to cluster with too few sample to be visible.

This method tries a same number of points for each class.

The number of rows outputted is ‘about’ ratio * nb_rows.

# TODO we may want to change this code to # make big cluster actually look big.

cluster_sampling()

Sample on the categories itself.

Select a proportion (prop) of the categories.

run()
stratified_forced_subsampling()

Pick samples from each category proportionally, but force a minimal sample size per category.

stratified_subsampling()

Pick samples from each category proportionally.

deepinsight.doctor.utils.subsampler.subsample(df, variable, sampling_type='stratified', ratio=0.1)

Module contents

deepinsight.doctor.utils.datetime_to_epoch(series)
deepinsight.doctor.utils.dt_isnan(val)

Safe isnan that accepts non-numeric

deepinsight.doctor.utils.dt_nonan(val)

Replaces numerical NaNs by None

deepinsight.doctor.utils.dt_nonaninf(val)

Replaces numerical NaNs and Inf by None

deepinsight.doctor.utils.dt_write_mode_for_pickling()
deepinsight.doctor.utils.make_running_traininfo(folder, start_time, listener)
deepinsight.doctor.utils.merge_listeners(plistener, mlistener)
deepinsight.doctor.utils.ml_dtype_from_deepinsight_column(schema_column, feature_type, feature_role, prediction_type=None)
deepinsight.doctor.utils.ml_dtypes_from_deepinsight_schema(schema, params, prediction_type=None)
deepinsight.doctor.utils.normalize_dataframe(df, params, missing_columns='ERROR')

Normalizes a dataframe so that it can be used as input for a preprocessing pipeline. You should not have to add anything here …

Does 2 things:
  • Add missing columns (for API node)
  • Converts datetime to epoch
deepinsight.doctor.utils.remove_all_nan(obj)

Removes all nan values from an object, recursively. No thanks to the stupid JSON spec

deepinsight.doctor.utils.strip_accents(s)
deepinsight.doctor.utils.update_gridsearch_info(folder, grid_search_scores)
deepinsight.doctor.utils.write_done_traininfo(folder, start_time, start_training_time, end_time, listener, end_preprocessing_time=None)
deepinsight.doctor.utils.write_model_status(modeling_set, status)
deepinsight.doctor.utils.write_preproc_file(run_folder, filename, obj)
deepinsight.doctor.utils.write_running_traininfo(folder, start_time, listener)