deepinsight.doctor package

Subpackages

Submodules

deepinsight.doctor.clustering_entrypoints module

deepinsight.doctor.clustering_entrypoints.clustering_train_score_save(transformed_src, src_index, preprocessing_params, modeling_params, run_folder, listener, update_fn, pipeline)

Trains one model and saves results to run_folder

deepinsight.doctor.commands module

Commands available from the doctor main kernel server.

To add a command, simple add a method. Method starting by a _ are not exposed.

Arguments with default values are supported. *args ,**kargs are not supported.

If one of your json parameter is a global in python, you can suffix your parameter by an _ (e.g. input_)

deepinsight.doctor.commands.build_pipeline_and_handler(collector_data, core_params, run_folder, preprocessing_params, selection_state_folder=None, allow_empty_mf=False)
deepinsight.doctor.commands.clustering_rescore(split_desc, preprocessing_folder, model_folder)
deepinsight.doctor.commands.compute_pdp(job_id, split_desc, core_params, preprocessing_folder, model_folder, computation_parameters=None)
deepinsight.doctor.commands.compute_subpopulation(job_id, split_desc, core_params, preprocessing_folder, model_folder, computation_parameters=None)
deepinsight.doctor.commands.create_clustering_notebook(model_name, model_date, dataset_smartname, script, preparation_output_schema, split_stuff, preprocessing_params, pre_train, post_train)
deepinsight.doctor.commands.create_ensemble(split_desc, core_params, model_folder, preprocessing_folder, model_folders, preprocessing_folders)
deepinsight.doctor.commands.create_prediction_notebook(model_name, model_date, dataset_smartname, script, preparation_output_schema, split_stuff, core_params, preprocessing_params, pre_train, post_train)
deepinsight.doctor.commands.ping()
deepinsight.doctor.commands.train_clustering_models_nosave(split_desc, preprocessing_set)

Regular (mode 1) train: - Non streamed single split + fit preprocess on train + preprocess test - Fit N models sequentially

  • Fit
  • Save clf
  • Compute and save clf performance
  • Score, save scored test set + scored performnace
deepinsight.doctor.commands.train_prediction_keras(core_params, preprocessing_set, split_desc)
deepinsight.doctor.commands.train_prediction_kfold(core_params, preprocessing_set, split_desc)
deepinsight.doctor.commands.train_prediction_models_nosave(core_params, preprocessing_set, split_desc)

Regular (mode 1) train: - Non streamed single split + fit preprocess on train + preprocess test - Fit N models sequentially

  • Fit
  • Save clf
  • Compute and save clf performance
  • Score, save scored test set + scored performnace

deepinsight.doctor.constants module

deepinsight.doctor.dtapi module

deepinsight.doctor.dtapi.json_api(api)
deepinsight.doctor.dtapi.trim_underscores(s)

deepinsight.doctor.exception module

exception deepinsight.doctor.exception.DoctorException(message='', code=400, errorType='ExpectedException')

Bases: Exception

deepinsight.doctor.forest module

class deepinsight.doctor.forest.ClassificationIML(**params)

Bases: deepinsight.doctor.forest.IML

class deepinsight.doctor.forest.IML(**params)

Bases: object

classes_
estimators_
feature_importances_
fit(X, Y, sample_weight=None)
get_params(**kwargs)
merge(clf2)
model(params)
predict(X)
predict_proba(X)
set_params(**params)
should_continue(Ytest, Y1, Y2)
class deepinsight.doctor.forest.RandomForestClassifierIML(**params)

Bases: deepinsight.doctor.forest.ClassificationIML

Random Forest with autostop of growing the forest

i = 0
merge(clf2)
model(params)
class deepinsight.doctor.forest.RandomForestRegressorIML(**params)

Bases: deepinsight.doctor.forest.RegressionIML

Random Forest with autostop of growing the forest

i = 0
merge(clf2)
model(params)
class deepinsight.doctor.forest.RegressionIML(**params)

Bases: deepinsight.doctor.forest.IML

deepinsight.doctor.multiframe module

class deepinsight.doctor.multiframe.DataFrameBuilder(prefix='')

Bases: object

A dataframe builder just receives columns to ultimately create a dataframe, respecting the insertion order.

add_column(column_name, column_values)
columns
prefix
to_dataframe()
class deepinsight.doctor.multiframe.DataFrameWrapper(df)

Bases: object

shape
class deepinsight.doctor.multiframe.MultiFrame

Bases: object

The multiframe agglomerates horizontally several blocks of columns. All blocks must have the same number of rows. Each block is named.

Blocks can be:

  • Pandas DataFrames
  • Numpy arrays
  • Scipy sparse matrices

The MultiFrame also gives a single dataframe builder that allows you to build a dataframe from several series.

append_df(name, df, keep=True)
append_np_block(name, array, col_names)
append_sparse(name, matrix)
as_csr_matrix()
as_dataframe()
as_np_array()
static block_as_np_array(blk)
col_as_series(block, col_name)
columns()
drop_rows(deletion_mask)
flush_df_builder(name)
get_block(name)
get_df_builder(name)

Helper for building a dataframe from series

has_df_builder(name)
iter_blocks(with_keep_info=False)
iter_columns()
iter_dataframes()
nnz()
select_columns(names)
set_index_from_df(df)
shape()
stats()
class deepinsight.doctor.multiframe.NamedNPArray(array, names)

Bases: object

shape
class deepinsight.doctor.multiframe.SparseMatrixWithNames(matrix, names)

Bases: object

shape
deepinsight.doctor.multiframe.delete_rows_csr(mat, indices)

Remove the rows denoted by indices form the CSR sparse matrix mat. Taken from http://stackoverflow.com/questions/13077527

deepinsight.doctor.multiframe.is_series_like(series)

deepinsight.doctor.notebook_builder module

notebook_builder.py Base classes for creating IPython notebooks

class deepinsight.doctor.notebook_builder.ClusteringNotebookBuilder(model_name, model_date, dataset_smartname, script_steps, preparation_output_schema, split_stuff, preprocessing_params, pre_train, post_train)

Bases: deepinsight.doctor.notebook_builder.NotebookBuilder

context()
is_supervized()
template_name()
title()
class deepinsight.doctor.notebook_builder.NotebookBuilder

Bases: object

algorithm
categorical_preprocessing_context()
context()
create_notebook()
handle_missing_context()
is_supervized()
rescale_context()
template()
template_name()
text_preprocessing_context()
title()
class deepinsight.doctor.notebook_builder.PredictionNotebookBuilder(model_name, model_date, dataset_smartname, script_steps, preparation_output_schema, split_stuff, core_params, preprocessing_params, pre_train, post_train)

Bases: deepinsight.doctor.notebook_builder.NotebookBuilder

categorical_preprocessing_context()
context()
is_supervized()
prediction_type
target_variable
template_name()
title()
deepinsight.doctor.notebook_builder.code_cell(code)
deepinsight.doctor.notebook_builder.comment_cell(comment)
deepinsight.doctor.notebook_builder.extract_input_columns(preprocessing_params, with_target=False, with_profiling=True)
deepinsight.doctor.notebook_builder.header_cell(msg=None, level=1)
deepinsight.doctor.notebook_builder.parse_cells_from_render(content)

deepinsight.doctor.prediction_entrypoints module

deepinsight.doctor.prediction_entrypoints.prediction_train_model_keras(transformed_normal, train_df, test_df, pipeline, modeling_params, core_params, per_feature, run_folder, listener, update_fn, target_map, generated_features_mapping, save_model=True)

Fit a CLF on Keras, save it, computes intrinsic scores, writes them, scores a test set it, write scores and extrinsinc perf

deepinsight.doctor.prediction_entrypoints.prediction_train_model_kfold(full_df_clean, core_params, split_desc, preprocessing_params, optimized_params, pp_folder, m_folder, listener, update_fn, with_sample_weight, with_class_weight, calibrate_proba=False)
deepinsight.doctor.prediction_entrypoints.prediction_train_score_save(transformed_train, transformed_test, test_df_index, core_params, split_desc, modeling_params, run_folder, listener, target_map, update_fn, pipeline, m_folder)

Fit a CLF, save it, computes intrinsic scores, writes them, scores a test set it, write scores and extrinsinc perf

deepinsight.doctor.prediction_entrypoints.prediction_train_score_save_ensemble(train, test, core_params, split_desc, modeling_params, run_folder, listener, target_map, update_fn, pipeline, with_sample_weight)

Fit a CLF, save it, computes intrinsic scores, writes them, scores a test set it, write scores and extrinsinc perf

deepinsight.doctor.preprocessing_collector module

Perform the initial feature analysis that will drive the actual preprocessor for prediction Takes the preprocessing params and the train dataframe and outputs the feature analysis data.

class deepinsight.doctor.preprocessing_collector.ClusteringPreprocessingDataCollector(train_df, preprocessing_params)

Bases: deepinsight.doctor.preprocessing_collector.PreprocessingDataCollector

feature_needs_analysis(params)

params is the params object from preprocessing params

class deepinsight.doctor.preprocessing_collector.PredictionPreprocessingDataCollector(train_df, preprocessing_params)

Bases: deepinsight.doctor.preprocessing_collector.PreprocessingDataCollector

feature_needs_analysis(params)

params is the params object from preprocessing params

class deepinsight.doctor.preprocessing_collector.PreprocessingDataCollector(train_df, preprocessing_params)

Bases: object

build()
get_feature_analysis_data(name, params)

Analyzes a single feature (preprocessing params -> feature analysis data) params is the preprocessing params for this feature.

It must contain: - name, type, role (role_reason) - missing_handling, missing_impute_with, category_handling, rescaling

deepinsight.doctor.preprocessing_handler module

class deepinsight.doctor.preprocessing_handler.BinaryClassificationPreprocessingHandler(core_params, preprocessing_params, data_path)

Bases: deepinsight.doctor.preprocessing_handler.PredictionPreprocessingHandler

target_map
class deepinsight.doctor.preprocessing_handler.ClusteringPreprocessingHandler(core_params, preprocessing_params, data_path)

Bases: deepinsight.doctor.preprocessing_handler.PreprocessingHandler

Build the preprocessing pipeline for clustering projects

Clustering preprocessing is especially difficult from misc reasons, we need to keep track of the multiframe at different state of its processing :

  • train

    The model used for clustering performs on preprocessed INPUT columns, on which we may or may not remove outliers, and may or may not apply a PCA.

    • TRAIN
  • profiling

    Columns that are not actually INPUT should still be preprocessed (e.g. Dummified) in order to compute different statistics on the the different values. Such columns have a role called “PROFILING”.

    Dataframe preprocessed, (including PROFILING columns)

    • PREPROCESSED
  • feature importance

    Feature importance is done by making a classification on the variables. In order to have its result human readable, we need to do this analysis on prepca values.

    • TRAIN_PREPCA
  • outliers

    The outliers labels is used to make sure we can reannotated the initial datasets (for feature importance and profiling)

    • OUTLIERS
preprocessing_steps()
class deepinsight.doctor.preprocessing_handler.MulticlassPreprocessingHandler(core_params, preprocessing_params, data_path)

Bases: deepinsight.doctor.preprocessing_handler.PredictionPreprocessingHandler

target_map
class deepinsight.doctor.preprocessing_handler.PredictionPreprocessingHandler(core_params, preprocessing_params, data_path)

Bases: deepinsight.doctor.preprocessing_handler.PreprocessingHandler

has_sample_weight_variable
preprocessing_steps(with_target=False, verbose=True, allow_empty_mf=False)
sample_weight_variable
set_selection_state_folder(selection_state_folder)
target_map
weight_map
class deepinsight.doctor.preprocessing_handler.PreprocessingHandler(core_params, preprocessing_params, data_path)

Bases: object

Manager class for the preprocessing

static build(core_params, preprocessing_params, data_path)

Build the proper type of preprocessing handling depending on the preprocessing params

build_preprocessing_pipeline(*args, **kwargs)
get_impact_coder(column)
get_pca_resource()
get_resource(resource_name, type)

Resources are just dictionaries either: - pickled in a .pkl named after their resource name. - dumped to a .json named after their resource name

get_texthash_svd_data(column)
input_columns(with_target=True, with_profiling=True)

Return the list of input features.

Can help limit RAM usage, by giving that to get_dataframe.

(includes profiling columns)

open(relative_filepath, *args, **kargs)

open a file relatively to self.folder_path

prediction_type
preprocessing_steps(verbose=True, **kwargs)
report(pipeline)
save_data()
target_variable
class deepinsight.doctor.preprocessing_handler.RegressionPreprocessingHandler(core_params, preprocessing_params, data_path)

Bases: deepinsight.doctor.preprocessing_handler.PredictionPreprocessingHandler

target_map
deepinsight.doctor.preprocessing_handler.extract_input_columns(preprocessing_params, with_target=False, with_profiling=True, with_sample_weight=False)
deepinsight.doctor.preprocessing_handler.get_rescaler(in_block, column_name, column_params, column_collector)

Build a rescaler for the original column

deepinsight.doctor.preprocessing_handler.load_relfilepath(basepath, relative_filepath)

Returns None if the file does not exists

deepinsight.doctor.server module

Main doctor entry point. This is a HTTP server which receives commands from the AnalysisMLKernel Java class

deepinsight.doctor.server.serve(port, secret)

Module contents