deepinsight.doctor.preprocessing package¶
Submodules¶
deepinsight.doctor.preprocessing.dataframe_preprocessing module¶
Preprocessing takes a dataframe as an input, and returns a dataframe as an output.
At the end of the pipeline, the matrix underlying the dataframe should be ready to use for scikit-learn’s ML algorithm.
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
AddReferenceInOutput
(output_name_from, output_name_to)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Add an alias in output
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
AllInteractionFeaturesGenerator
(in_block, out_block, features)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Generates all polynomial interaction features from the imputed input numericals
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
report_fit
(ret_obj, core_params)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
BaseCountVectorizerProcessor
(column_name, min_df, max_df, max_features, min_gram, max_gram, stop_words=None)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
gen_voc
(vec)¶
-
init_resources
(mp)¶
-
report_fit
(ret_obj, core_params)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
BinarizeSeries
(in_block, in_col, out_block, threshold)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Rescale a single series in-place in a DF block
-
init_resources
(resources_handler)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
BlockStdRescalingProcessor
(in_block)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
A avg/std rescaler that needs to be fit. Operates on a whole DF block
-
fit_and_process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
init_resources
(mp)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
CategoricalCategoricalInteraction
(out_block, column_1, column_2, max_features)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
fit_and_process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
init_resources
(resources_handler)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
CategoricalFeatureHashingProcessor
(input_block, column_name, n_features=1048576)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Hashing trick for category features . This creates an extremely huge sparse matrix and should only be used with algorithms that support them. . It takes values from an input block
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
CategoricalsCountTransformerGenerator
(preprocessing_settings, settings)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
fit_and_process
(input_df, cur_mf, output_ppr, generated_features_mapping)¶
-
get_evolution_def
()¶
-
get_input_features
()¶
-
init_resources
(mp)¶
-
process
(input_df, cur_mf, output_ppr, generated_features_mapping)¶
-
set_evolution_state
(es)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
CategoricalsImpactCodingTransformerGenerator
(output_name=None)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
CopyMultipleColumnsFromInput
(columns, output_block_name)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
CustomPreprocessingStep
(input_col, code, wants_matrix, fit_and_process_only_fits=False, accepts_tensor=False)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
fit_and_process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
init_resources
(resources_handler)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
DropNARows
(output_name=None)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Drop rows containing any NA value in all DataFrame and np array blocks of the current multiframe.
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
DropRowsWhereNoTarget
(output_name=None, allow_empty_mf=False)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Drop rows for which the target is na (probably because it was an unknown class)
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
DropRowsWhereNoTargetNorWeight
(output_name=None, allow_empty_mf=False)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Drop rows for which the target is na or the weight is na (probably because it was an unknown class)
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
DumpFullMF
(name)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
DumpInputDF
(name)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
DumpMFDetails
(name)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
DumpPipelineState
(name)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
EmitCurrentMFAsResult
(output_name)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Emits the current multi frame in the result object and optionally injects a brand new multiframe in the pipeline
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
ExtractColumn
(column_name, output_name)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Extracts a single column from the current multiframe and puts it as a Series in result
-
column_name
¶
-
output_name
¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
FastSparseDummifyProcessor
(input_block, input_column_name, values, should_drop)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
init_resources
(resources_handler)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
FeatureSelectorOutputExecStep
(selector)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Used if feature selection was already trained
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
FeatureSelectorOutputTrainStep
(selector)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Used if feature selection was not already trained
-
fit_and_process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
FileFunctionPreprocessing
(input_col, code, file_reader, func_name, fit_and_process_only_fits=True)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
fit_and_process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
FlagMissingValue2
(feature, output_block_name)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
init_resources
(resources_handler)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
FlushDFBuilder
(block_name)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
GeneratedFeaturesMapping
¶ Bases:
object
-
ONE_FEATURE_PER_COLUMN
= 'one_feature_per_column'¶
-
TO_ONE_FEATURE
= 'to_one_feature'¶
-
add_per_column_mapping
(block_name, original_name, new_name)¶
-
add_whole_block_mapping
(block_name, original_name)¶
-
get_per_column_original
(block_name, new_name)¶
-
get_whole_block_original
(block_name)¶
-
should_send_block_to_one_feature
(block_name)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
ImpactCodingStep
(input_block, column_name, impact_coder, target_variable, output_block)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
fit_and_process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
init_resources
(resources_handler)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
report_fit
(ret_obj, core_params)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
ImpactCodingStep2
(input_block, column_name, target_variable, output_block)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
fit_and_process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
init_resources
(mp)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
MultipleImputeMissingFromInput
(impute_map, output_block_name, keep_output_block, as_categorical)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Multi-column impute missing values. A sub-df is extracted from the input df and series are fillna-ed.
The sub-df is added as a single output block
-
init_resources
(resources_handler)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
NumericalCategoricalInteraction
(out_block, cat, num, max_features)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
fit_and_process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
init_resources
(resources_handler)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
NumericalDerivativesGenerator
(in_block, out_block, features)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Generate derivative features from selected numerical features in a block. Generates square, log(), sqrt
-
init_resources
(resources_handler)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
NumericalFeaturesClusteringGenerator
(preprocessing_settings, settings)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
fit_and_process
(input_df, cur_mf, output_ppr, generated_features_mapping)¶
-
get_evolution_def
()¶
-
get_numerical_features
()¶
-
init_resources
(mp)¶
-
perform_replacement
(cur_mf, df, kmeans)¶
-
process
(input_df, cur_mf, output_ppr, generated_features_mapping)¶
-
set_evolution_state
(es)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
NumericalNumericalInteraction
(out_block, column_1, column_2, rescale)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
fit_and_process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
init_resources
(resources_handler)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
OutlierDetection
(pca_kept_variance, min_n, min_cum_ratio, outlier_name='OUTLIERS', random_state=1337)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Performs outliers detection. Outputs a new multiframe in output. Does not touch the main multiframe
-
fit_and_process
(input_df, cur_mf, output_ppr, generated_features_mapping)¶
-
init_resources
(mp)¶
-
process
(input_df, cur_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
PCAStep
(pca, input_name, output_name)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
fit_and_process
(input_df, cur_mf, output_ppr, generated_features_mapping)¶
-
normalize
(df)¶
-
process
(input_df, cur_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
PairwiseLinearCombinationsGenerator
(in_block, out_block, features)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
report_fit
(ret_obj, core_params)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
PreprocessingPipeline
(steps)¶ Bases:
object
-
fit_and_process
(input_df, *args, **kwargs)¶
-
generated_features_mapping
¶
-
init_resources
(resource_handler)¶
-
process
(input_df, retain=None)¶
-
report_fit
(ret_obj, core_params)¶
-
results
¶
-
steps
¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
PreprocessingResult
(retain=None)¶ Bases:
dict
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
QuantileBinSeries
(in_block, in_col, out_block, nb_bins)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
fit_and_process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
init_resources
(mp)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
RandomColumnsGenerator
(n_columns)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
process
(input_df, cur_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
RealignTarget
(output_name=None)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
RealignWeight
(output_name=None)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
RemapValueToOutput
(column_name, output_name, values_map)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Remap a value from input df to an output key as a series. Used for target. Makes a deep copy
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
values_map
¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
RescalingProcessor2
(in_block, in_col, shift=None, scale=None)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Rescale a single series in-place in a DF block
-
static
from_avgstd
(in_block, in_col, mean, standard_deviation)¶
-
static
from_minmax
(in_block, in_col, min_value, max_value)¶
-
init_resources
(resources_handler)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
set_scale
(scale)¶
-
static
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
SetInputDFAsBlock
(output_block_name)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
SingleColumnDropNARows
(column_name)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Drop rows containing any NA value in input_df
-
init_resources
(resources_handler)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
Step
(output_name=None)¶ Bases:
object
Since the steps are used in a pipeline, it really makes no sense to have a “fit” or “partial_fit” on them. All which must be “fitted” but that must be handled in stream is managed by preprocessing collector
-
static
drop_rows
(idx, current_mf, input_df)¶
-
fit_and_process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
init_resources
(resources_handler)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
report_fit
(ret_obj, core_params)¶
-
static
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
TextCountVectorizerProcessor
(column_name, min_df, max_df, max_features, min_gram=1, max_gram=2, stop_words=None, custom_code=None)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.BaseCountVectorizerProcessor
-
fit_and_process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
init_resources
(mp)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
TextHashingVectorizerProcessor
(column_name, n_features=1048576)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Hashing trick for text features using Bag of words. http://scikit-learn.org/stable/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick
This creates an extremely huge sparse matrix and should only be used with algorithms that support them.
It takes values directly from the input df since we don’t do other preprocessing for these features
-
column_name
¶
-
n_features
¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
TextHashingVectorizerWithSVDProcessor
(column_name, svd_res, n_features=100, n_hash=200000, svd_limit=50000)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
Use a restricted version of the hashing trick. http://scikit-learn.org/stable/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick
This is designed to be used with dense matrixes. Instead of creating a huge sparse matrix, it first creates the huge sparse matrix then applies a SVD on it to only keep a small (10-50) number of features It takes values directly from the input df since we don’t do other preprocessing for these features
-
fit_and_process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
TextTFIDFVectorizerProcessor
(column_name, min_df, max_df, max_features, min_gram=1, max_gram=2, stop_words=None, custom_code=None)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.BaseCountVectorizerProcessor
-
fit_and_process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
init_resources
(mp)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
class
deepinsight.doctor.preprocessing.dataframe_preprocessing.
UnfoldVectorProcessor
(input_column_name, vector_length, in_block=None)¶ Bases:
deepinsight.doctor.preprocessing.dataframe_preprocessing.Step
-
init_resources
(resources_handler)¶
-
process
(input_df, current_mf, output_ppr, generated_features_mapping)¶
-
-
deepinsight.doctor.preprocessing.dataframe_preprocessing.
add_column_to_builder
(builder, new_column, feature, series, generated_features_mapping)¶
-
deepinsight.doctor.preprocessing.dataframe_preprocessing.
append_sparse_with_prefix
(current_mf, prefix, input_column_name, matrix, generated_features_mapping)¶
-
deepinsight.doctor.preprocessing.dataframe_preprocessing.
cubic_root
(x)¶
-
deepinsight.doctor.preprocessing.dataframe_preprocessing.
detect_outliers
(df, pca_kept_variance=0.9, min_n=0, min_cum_ratio=0.01, random_state=1337)¶
deepinsight.doctor.preprocessing.impact_coding module¶
-
class
deepinsight.doctor.preprocessing.impact_coding.
CategoricalImpactCoding
(m=10)¶ Bases:
deepinsight.doctor.preprocessing.impact_coding.ImpactCoding
-
compute_impact_map
(serie, target_serie)¶
-
-
class
deepinsight.doctor.preprocessing.impact_coding.
ContinuousImpactCoding
(m=10, rescaling=False, scaler=None)¶ Bases:
deepinsight.doctor.preprocessing.impact_coding.ImpactCoding
-
compute_impact_map
(serie, target_serie)¶
-
rescaling
¶
-
scaler
¶
-
-
class
deepinsight.doctor.preprocessing.impact_coding.
ImpactCoding
(m=10)¶ Bases:
object
ImpactCoding is an alternative way to cope with categorical values in a regression or in a classification project.
The base idea is to replace categorical values by their overall observed impact on the target value.
For instance, let’s consider a dataset with 5000 persons. We aim at predicting their height. Their home country is a feature of the dataset, but it can take as many as 300 different values.
Impact coding consists of replacing the country information by the average height of the people in their home country. (Note that it may not be a good idea if for instance the ratio of men and woman is different in these countries.)
Because some countries may be underrepresented, we prefer to use a more robust estimate of the average. Here we simply use additive smoothing. ie, if a category is represented X times, we compute lambda = X/(X+10) and instead of CAT_AVG, we use lambda*CAT_AVG + (1-lambda) * TARGET_AVG (so when a category has very low cardinality like 2 or 3, most of its actual value is smoothened by the global average)
-
DEFAULT_VALUE
= '__default__'¶
-
NULL
= '__NULL__'¶
-
compute_impact_map
(serie, target_serie)¶ Compact the impact coding value map.
Given a serie of values for a categorical feature, and the respective serie of target value, returns the map of impact values as a dataframe indexed by the series values.
-
default_value
()¶
-
fit
(serie, target_serie)¶
-
fit_transform
(X, target)¶
-
get_reportable_map
()¶
-
is_fitted
()¶
-
m
¶
-
transform
(serie)¶
-
-
class
deepinsight.doctor.preprocessing.impact_coding.
NestedKFoldImpactCoder
¶ Bases:
object
-
fit
(feature_series, target_series)¶
-
static
impact_coding
(data, feature, target)¶ - This function does two things:
- Directly compute the impact coded series of the feature
- Compute the mapping to apply to test data and data to score
Notably, the train data does not use the mapping to avoid leaking information. Instead, train data is computed using nested KFold
TODO: Check if there are issues with the usage of “rsuffix” that may be buggy in Pandas+Python 2 If there are non-ascii elements (even in unicode type) in the columns of the dataframes being joined
-
set_data
(mapping, default_mean)¶
-
transform
(feature_series)¶
-
-
deepinsight.doctor.preprocessing.impact_coding.
lambda_weight
(n, m)¶
deepinsight.doctor.preprocessing.pca module¶
-
class
deepinsight.doctor.preprocessing.pca.
PCA
(kept_variance, normalize=False, prefix='factor_')¶ Bases:
object
Implements PCA for DataFrames.
Supports pre-normalization given frozen parameters. A normalization step can be included before performing the PCA.
-
do_normalize
¶
-
fit
(df)¶
-
fit_transform
(df)¶
-
get_stats
(df, column_name)¶
-
input_columns
¶
-
kept_variance
¶
-
n_components
¶
-
normalize
(df)¶
-
output_columns
¶
-
pca
¶
-
prefix
¶
-
stats
¶
-
transform
(df)¶
-