deepinsight.doctor.preprocessing package

Submodules

deepinsight.doctor.preprocessing.dataframe_preprocessing module

Preprocessing takes a dataframe as an input, and returns a dataframe as an output.

At the end of the pipeline, the matrix underlying the dataframe should be ready to use for scikit-learn’s ML algorithm.

class deepinsight.doctor.preprocessing.dataframe_preprocessing.AddReferenceInOutput(output_name_from, output_name_to)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Add an alias in output

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.AllInteractionFeaturesGenerator(in_block, out_block, features)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Generates all polynomial interaction features from the imputed input numericals

process(input_df, current_mf, output_ppr, generated_features_mapping)
report_fit(ret_obj, core_params)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.BaseCountVectorizerProcessor(column_name, min_df, max_df, max_features, min_gram, max_gram, stop_words=None)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

gen_voc(vec)
init_resources(mp)
report_fit(ret_obj, core_params)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.BinarizeSeries(in_block, in_col, out_block, threshold)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Rescale a single series in-place in a DF block

init_resources(resources_handler)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.BlockStdRescalingProcessor(in_block)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

A avg/std rescaler that needs to be fit. Operates on a whole DF block

fit_and_process(input_df, current_mf, output_ppr, generated_features_mapping)
init_resources(mp)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.CategoricalCategoricalInteraction(out_block, column_1, column_2, max_features)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

fit_and_process(input_df, current_mf, output_ppr, generated_features_mapping)
init_resources(resources_handler)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.CategoricalFeatureHashingProcessor(input_block, column_name, n_features=1048576)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Hashing trick for category features . This creates an extremely huge sparse matrix and should only be used with algorithms that support them. . It takes values from an input block

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.CategoricalsCountTransformerGenerator(preprocessing_settings, settings)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

fit_and_process(input_df, cur_mf, output_ppr, generated_features_mapping)
get_evolution_def()
get_input_features()
init_resources(mp)
process(input_df, cur_mf, output_ppr, generated_features_mapping)
set_evolution_state(es)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.CategoricalsImpactCodingTransformerGenerator(output_name=None)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

class deepinsight.doctor.preprocessing.dataframe_preprocessing.CopyMultipleColumnsFromInput(columns, output_block_name)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.CustomPreprocessingStep(input_col, code, wants_matrix, fit_and_process_only_fits=False, accepts_tensor=False)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

fit_and_process(input_df, current_mf, output_ppr, generated_features_mapping)
init_resources(resources_handler)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.DropNARows(output_name=None)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Drop rows containing any NA value in all DataFrame and np array blocks of the current multiframe.

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.DropRowsWhereNoTarget(output_name=None, allow_empty_mf=False)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Drop rows for which the target is na (probably because it was an unknown class)

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.DropRowsWhereNoTargetNorWeight(output_name=None, allow_empty_mf=False)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Drop rows for which the target is na or the weight is na (probably because it was an unknown class)

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.DumpFullMF(name)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.DumpInputDF(name)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.DumpMFDetails(name)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.DumpPipelineState(name)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.EmitCurrentMFAsResult(output_name)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Emits the current multi frame in the result object and optionally injects a brand new multiframe in the pipeline

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.ExtractColumn(column_name, output_name)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Extracts a single column from the current multiframe and puts it as a Series in result

column_name
output_name
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.FastSparseDummifyProcessor(input_block, input_column_name, values, should_drop)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

init_resources(resources_handler)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.FeatureSelectorOutputExecStep(selector)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Used if feature selection was already trained

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.FeatureSelectorOutputTrainStep(selector)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Used if feature selection was not already trained

fit_and_process(input_df, current_mf, output_ppr, generated_features_mapping)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.FileFunctionPreprocessing(input_col, code, file_reader, func_name, fit_and_process_only_fits=True)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

fit_and_process(input_df, current_mf, output_ppr, generated_features_mapping)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.FlagMissingValue2(feature, output_block_name)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

init_resources(resources_handler)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.FlushDFBuilder(block_name)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.GeneratedFeaturesMapping

Bases: object

ONE_FEATURE_PER_COLUMN = 'one_feature_per_column'
TO_ONE_FEATURE = 'to_one_feature'
add_per_column_mapping(block_name, original_name, new_name)
add_whole_block_mapping(block_name, original_name)
get_per_column_original(block_name, new_name)
get_whole_block_original(block_name)
should_send_block_to_one_feature(block_name)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.ImpactCodingStep(input_block, column_name, impact_coder, target_variable, output_block)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

fit_and_process(input_df, current_mf, output_ppr, generated_features_mapping)
init_resources(resources_handler)
process(input_df, current_mf, output_ppr, generated_features_mapping)
report_fit(ret_obj, core_params)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.ImpactCodingStep2(input_block, column_name, target_variable, output_block)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

fit_and_process(input_df, current_mf, output_ppr, generated_features_mapping)
init_resources(mp)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.MultipleImputeMissingFromInput(impute_map, output_block_name, keep_output_block, as_categorical)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Multi-column impute missing values. A sub-df is extracted from the input df and series are fillna-ed.

The sub-df is added as a single output block

init_resources(resources_handler)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.NumericalCategoricalInteraction(out_block, cat, num, max_features)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

fit_and_process(input_df, current_mf, output_ppr, generated_features_mapping)
init_resources(resources_handler)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.NumericalDerivativesGenerator(in_block, out_block, features)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Generate derivative features from selected numerical features in a block. Generates square, log(), sqrt

init_resources(resources_handler)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.NumericalFeaturesClusteringGenerator(preprocessing_settings, settings)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

fit_and_process(input_df, cur_mf, output_ppr, generated_features_mapping)
get_evolution_def()
get_numerical_features()
init_resources(mp)
perform_replacement(cur_mf, df, kmeans)
process(input_df, cur_mf, output_ppr, generated_features_mapping)
set_evolution_state(es)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.NumericalNumericalInteraction(out_block, column_1, column_2, rescale)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

fit_and_process(input_df, current_mf, output_ppr, generated_features_mapping)
init_resources(resources_handler)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.OutlierDetection(pca_kept_variance, min_n, min_cum_ratio, outlier_name='OUTLIERS', random_state=1337)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Performs outliers detection. Outputs a new multiframe in output. Does not touch the main multiframe

fit_and_process(input_df, cur_mf, output_ppr, generated_features_mapping)
init_resources(mp)
process(input_df, cur_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.PCAStep(pca, input_name, output_name)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

fit_and_process(input_df, cur_mf, output_ppr, generated_features_mapping)
normalize(df)
process(input_df, cur_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.PairwiseLinearCombinationsGenerator(in_block, out_block, features)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

process(input_df, current_mf, output_ppr, generated_features_mapping)
report_fit(ret_obj, core_params)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.PreprocessingPipeline(steps)

Bases: object

fit_and_process(input_df, *args, **kwargs)
generated_features_mapping
init_resources(resource_handler)
process(input_df, retain=None)
report_fit(ret_obj, core_params)
results
steps
class deepinsight.doctor.preprocessing.dataframe_preprocessing.PreprocessingResult(retain=None)

Bases: dict

class deepinsight.doctor.preprocessing.dataframe_preprocessing.QuantileBinSeries(in_block, in_col, out_block, nb_bins)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

fit_and_process(input_df, current_mf, output_ppr, generated_features_mapping)
init_resources(mp)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.RandomColumnsGenerator(n_columns)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

process(input_df, cur_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.RealignTarget(output_name=None)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.RealignWeight(output_name=None)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.RemapValueToOutput(column_name, output_name, values_map)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Remap a value from input df to an output key as a series. Used for target. Makes a deep copy

process(input_df, current_mf, output_ppr, generated_features_mapping)
values_map
class deepinsight.doctor.preprocessing.dataframe_preprocessing.RescalingProcessor2(in_block, in_col, shift=None, scale=None)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Rescale a single series in-place in a DF block

static from_avgstd(in_block, in_col, mean, standard_deviation)
static from_minmax(in_block, in_col, min_value, max_value)
init_resources(resources_handler)
process(input_df, current_mf, output_ppr, generated_features_mapping)
set_scale(scale)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.SetInputDFAsBlock(output_block_name)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.SingleColumnDropNARows(column_name)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Drop rows containing any NA value in input_df

init_resources(resources_handler)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.Step(output_name=None)

Bases: object

Since the steps are used in a pipeline, it really makes no sense to have a “fit” or “partial_fit” on them. All which must be “fitted” but that must be handled in stream is managed by preprocessing collector

static drop_rows(idx, current_mf, input_df)
fit_and_process(input_df, current_mf, output_ppr, generated_features_mapping)
init_resources(resources_handler)
process(input_df, current_mf, output_ppr, generated_features_mapping)
report_fit(ret_obj, core_params)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.TextCountVectorizerProcessor(column_name, min_df, max_df, max_features, min_gram=1, max_gram=2, stop_words=None, custom_code=None)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.BaseCountVectorizerProcessor

fit_and_process(input_df, current_mf, output_ppr, generated_features_mapping)
init_resources(mp)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.TextHashingVectorizerProcessor(column_name, n_features=1048576)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Hashing trick for text features using Bag of words. http://scikit-learn.org/stable/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick

This creates an extremely huge sparse matrix and should only be used with algorithms that support them.

It takes values directly from the input df since we don’t do other preprocessing for these features

column_name
n_features
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.TextHashingVectorizerWithSVDProcessor(column_name, svd_res, n_features=100, n_hash=200000, svd_limit=50000)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

Use a restricted version of the hashing trick. http://scikit-learn.org/stable/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick

This is designed to be used with dense matrixes. Instead of creating a huge sparse matrix, it first creates the huge sparse matrix then applies a SVD on it to only keep a small (10-50) number of features It takes values directly from the input df since we don’t do other preprocessing for these features

fit_and_process(input_df, current_mf, output_ppr, generated_features_mapping)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.TextTFIDFVectorizerProcessor(column_name, min_df, max_df, max_features, min_gram=1, max_gram=2, stop_words=None, custom_code=None)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.BaseCountVectorizerProcessor

fit_and_process(input_df, current_mf, output_ppr, generated_features_mapping)
init_resources(mp)
process(input_df, current_mf, output_ppr, generated_features_mapping)
class deepinsight.doctor.preprocessing.dataframe_preprocessing.UnfoldVectorProcessor(input_column_name, vector_length, in_block=None)

Bases: deepinsight.doctor.preprocessing.dataframe_preprocessing.Step

init_resources(resources_handler)
process(input_df, current_mf, output_ppr, generated_features_mapping)
deepinsight.doctor.preprocessing.dataframe_preprocessing.add_column_to_builder(builder, new_column, feature, series, generated_features_mapping)
deepinsight.doctor.preprocessing.dataframe_preprocessing.append_sparse_with_prefix(current_mf, prefix, input_column_name, matrix, generated_features_mapping)
deepinsight.doctor.preprocessing.dataframe_preprocessing.cubic_root(x)
deepinsight.doctor.preprocessing.dataframe_preprocessing.detect_outliers(df, pca_kept_variance=0.9, min_n=0, min_cum_ratio=0.01, random_state=1337)

deepinsight.doctor.preprocessing.impact_coding module

class deepinsight.doctor.preprocessing.impact_coding.CategoricalImpactCoding(m=10)

Bases: deepinsight.doctor.preprocessing.impact_coding.ImpactCoding

compute_impact_map(serie, target_serie)
class deepinsight.doctor.preprocessing.impact_coding.ContinuousImpactCoding(m=10, rescaling=False, scaler=None)

Bases: deepinsight.doctor.preprocessing.impact_coding.ImpactCoding

compute_impact_map(serie, target_serie)
rescaling
scaler
class deepinsight.doctor.preprocessing.impact_coding.ImpactCoding(m=10)

Bases: object

ImpactCoding is an alternative way to cope with categorical values in a regression or in a classification project.

The base idea is to replace categorical values by their overall observed impact on the target value.

For instance, let’s consider a dataset with 5000 persons. We aim at predicting their height. Their home country is a feature of the dataset, but it can take as many as 300 different values.

Impact coding consists of replacing the country information by the average height of the people in their home country. (Note that it may not be a good idea if for instance the ratio of men and woman is different in these countries.)

Because some countries may be underrepresented, we prefer to use a more robust estimate of the average. Here we simply use additive smoothing. ie, if a category is represented X times, we compute lambda = X/(X+10) and instead of CAT_AVG, we use lambda*CAT_AVG + (1-lambda) * TARGET_AVG (so when a category has very low cardinality like 2 or 3, most of its actual value is smoothened by the global average)

DEFAULT_VALUE = '__default__'
NULL = '__NULL__'
compute_impact_map(serie, target_serie)

Compact the impact coding value map.

Given a serie of values for a categorical feature, and the respective serie of target value, returns the map of impact values as a dataframe indexed by the series values.

default_value()
fit(serie, target_serie)
fit_transform(X, target)
get_reportable_map()
is_fitted()
m
transform(serie)
class deepinsight.doctor.preprocessing.impact_coding.NestedKFoldImpactCoder

Bases: object

fit(feature_series, target_series)
static impact_coding(data, feature, target)
This function does two things:
  • Directly compute the impact coded series of the feature
  • Compute the mapping to apply to test data and data to score

Notably, the train data does not use the mapping to avoid leaking information. Instead, train data is computed using nested KFold

TODO: Check if there are issues with the usage of “rsuffix” that may be buggy in Pandas+Python 2 If there are non-ascii elements (even in unicode type) in the columns of the dataframes being joined

set_data(mapping, default_mean)
transform(feature_series)
deepinsight.doctor.preprocessing.impact_coding.lambda_weight(n, m)

deepinsight.doctor.preprocessing.pca module

class deepinsight.doctor.preprocessing.pca.PCA(kept_variance, normalize=False, prefix='factor_')

Bases: object

Implements PCA for DataFrames.

Supports pre-normalization given frozen parameters. A normalization step can be included before performing the PCA.

do_normalize
fit(df)
fit_transform(df)
get_stats(df, column_name)
input_columns
kept_variance
n_components
normalize(df)
output_columns
pca
prefix
stats
transform(df)
class deepinsight.doctor.preprocessing.pca.PCA2(kept_variance, prefix='factor_', random_state=1337)

Bases: object

Implements PCA for named np arrays. Does not pre-normalize

fit(npa, names)
fit_transform(df)
transform(npa, names)

Module contents