deepinsight.core package

Submodules

deepinsight.core.base module

class deepinsight.core.base.Computable

Bases: object

add_read_partitions(spec)

Add a partition or range of partitions to read.

The spec argument must be given in the partition spec format. You cannot manually set partitions when running inside a Python recipe. They are automatically set using the dependencies.

full_name
set_write_partition(spec)

Sets which partition of the dataset gets written to when you create a DatasetWriter. Setting the write partition is not allowed in Python recipes, where write is controlled by the Flow.

deepinsight.core.base.get_data_home()
deepinsight.core.base.get_shared_secret()

deepinsight.core.dataset module

class deepinsight.core.dataset.Dataset(name, project_id=None, ignore_flow=False)

Bases: object

This is a handle to obtain readers and writers on a deepinsight Dataset. From this Dataset class, you can:

  • Read a dataset as a Pandas dataframe
  • Read a dataset as a chunked Pandas dataframe
  • Read a dataset row-by-row
  • Write a pandas dataframe to a dataset
  • Write a series of chunked Pandas dataframes to a dataset
  • Write to a dataset row-by-row
  • Edit the schema of a dataset
add_read_partitions(spec)

Add a partition or range of partitions to read.

The spec argument must be given in the partition spec format. You cannot manually set partitions when running inside a Python recipe. They are automatically set using the dependencies.

full_name
get_config()
get_dataframe(columns=None, sampling='head', sampling_column=None, limit=None, ratio=None, infer_with_pandas=True, parse_dates=True, bool_as_str=False, float_precision=None)

Read the dataset (or its selected partitions, if applicable) as a Pandas dataframe.

Pandas dataframes are fully in-memory, so you need to make sure that your dataset will fit in RAM before using this.

Keywords arguments:

  • columns – When not None, returns only the given list of columns (default None)

  • limit – Limits the number of rows returned (default None)

  • sampling – Sampling method, if:

    • ‘head’ returns the first rows of the dataset. Incompatible with ratio parameter.
    • ‘random’ returns a random sample of the dataset
    • ‘random-column’ returns a random sample of the dataset. Incompatible with limit parameter.
  • sampling_column – Select the column used for “columnwise-random” sampling (default None)

  • ratio – Limits the ratio to at n% of the dataset. (default None)

  • infer_with_pandas – uses the types detected by pandas rather than the dataset schema as detected in DeepInsight. (default True)

  • parse_dates – Date column in dataset schema are parsed (default True)

  • bool_as_str – Leave boolean values as strings (default False)

Inconsistent sampling parameter raise ValueError.

Note about encoding:

  • Column labels are “unicode” objects
  • When a column is of string type, the content is made of utf-8 encoded “str” objects
static get_dataframe_schema_st(schema, columns=None, parse_dates=True, infer_with_pandas=False, bool_as_str=False)
get_files_info(partitions=[])
get_last_metric_values(partition='')

Get the set of last values of the metrics on this dataset, as a deepinsight.ComputedMetrics object

get_location_info(sensitive_info=False)
get_metric_history(metric_lookup, partition='')

Get the set of all values a given metric took on this dataset

Parameters:
  • metric_lookup – metric name or unique identifier
  • partition – optionally, the partition for which the values are to be fetched
get_writer()

Get a stream writer for this dataset (or its target partition, if applicable). The writer must be closed as soon as you don’t need it.

The schema of the dataset MUST be set before using this. If you don’t set the schema of the dataset, your data will generally not be stored by the output writers
iter_dataframes(chunksize=10000, infer_with_pandas=True, sampling='head', sampling_column=None, parse_dates=True, limit=None, ratio=None, columns=None, bool_as_str=False, float_precision=None)

Read the dataset to Pandas dataframes by chunks of fixed size.

Returns a generator over pandas dataframes.

Useful is the dataset doesn’t fit in RAM.

iter_dataframes_forced_types(names, dtypes, parse_date_columns, chunksize=10000, sampling='head', sampling_column=None, limit=None, ratio=None, float_precision=None)
iter_rows(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=-1, timeout=30, columns=None)

Returns a generator on the rows (as a dict-like object) of the data (or its selected partitions, if applicable)

Keyword arguments: * limit – maximum number of rows to be emitted * log_every – print out the number of rows read on stdout

Field values are casted according to their types. String are parsed into “unicode” values.

iter_tuples(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=-1, timeout=30, columns=None)

Returns the rows of the dataset as tuples. The order and type of the values are the same are matching the dataset’s parameter

Keyword arguments:

  • limit – maximum number of rows to be emitted
  • log_every – print out the number of rows read on stdout
  • timeout – time (in seconds) of inactivity after which we want to close the generator if nothing has been read. Without it notebooks typically tend to leak “DMC” processes.

Field values are casted according to their types. String are parsed into “unicode” values.

static list(project_id=None)

Lists the names of datasets. If project_id is None, the current project id is used.

list_partitions(raise_if_empty=True)

List the partitions of this dataset, as an array of partition specifications

raw_formatted_data(sampling=None, columns=None, format='tsv-excel-noheader', format_params=None)

Get a stream of raw bytes from a dataset as a file-like object, formatted in a supported output format.

You MUST close the file handle. Failure to do so will result in resource leaks.

read_metadata()

Reads the dataset metadata object

read_schema(raise_if_empty=True)

Gets the schema of this dataset, as an array of objects like this one: { ‘type’: ‘string’, ‘name’: ‘foo’, ‘maxLength’: 1000 }. There is more information for the map, array and object types.

set_preparation_steps(steps, requested_output_schema)
set_write_partition(spec)

Sets which partition of the dataset gets written to when you create a DatasetWriter. Setting the write partition is not allowed in Python recipes, where write is controlled by the Flow.

write_from_dataframe(df, infer_schema=False, dropAndCreate=False)

Writes this dataset (or its target partition, if applicable) from a single Pandas dataframe.

This variant does not edit the schema of the output dataset, so you must take care to only write dataframes that have a compatible schema. Also see “write_with_schema”.

Encoding note: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

arguments: df – input panda dataframe.

write_metadata(meta)

Writes the dataset metadata object

write_schema(columns, dropAndCreate=False)

Write the dataset schema into the dataset JSON definition file.

Sometimes, the schema of a dataset being written is known only by the code of the Python script itself. In that case, it can be useful for the Python script to actually modify the schema of the dataset. Obviously, this must be used with caution. ‘columns’ must be an array of dicts like { ‘name’ : ‘column name’, ‘type’ : ‘column type’}

write_schema_from_dataframe(df, dropAndCreate=False)
write_with_schema(df, dropAndCreate=False)

Writes this dataset (or its target partition, if applicable) from a single Pandas dataframe.

This variant replaces the schema of the output dataset with the schema of the dataframe.

Encoding node: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

class deepinsight.core.dataset.DatasetCursor(val, col_names, col_idx)

Bases: object

A dataset cursor that helps iterating on rows.

column_id(name)
get(col_name, default_value=None)
items()
keys()
values()
class deepinsight.core.dataset.IteratorWithTimeOut(iterator, timeout=-1)

Bases: object

check_timeout()
generator
iterate()
iterator
state
timeout
touched
wake_me_up
class deepinsight.core.dataset.Schema(data)

Bases: list

deepinsight.core.dataset.create_sampling_argument(sampling='head', sampling_column=None, limit=None, ratio=None)
deepinsight.core.dataset.none_if_throws(f)
deepinsight.core.dataset.parse_local_date(s)
deepinsight.core.dataset.unique(g)

deepinsight.core.dataset_write module

class deepinsight.core.dataset_write.DatasetWriter(dataset)

Bases: object

Handle to write to a dataset. Use Dataset.get_writer() to obtain a DatasetWriter.

Very important: a DatasetWriter MUST be closed after usage. Failure to close a DatasetWriter will lead to incomplete or no data being written to the output dataset

active_writers = {}
static atexit_handler()
close()

Closes this dataset writer

write_dataframe(df)

Appends a Pandas dataframe to the dataset being written.

This method can be called multiple times (especially when you have been using iter_dataframes to read from an input dataset)

Encoding node: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

write_row_array(row)
write_row_dict(row_dict)

Write a single row from a dict of column name -> column value.

Some columns can be omitted, empty values will be inserted instead.

Note: The schema of the dataset MUST be set before using this.

Encoding note: strings MUST be given as Unicode object. Giving str objects will fail.

write_tuple(row)

Write a single row from a tuple or list of column values. Columns must be given in the order of the dataset schema.

Note: The schema of the dataset MUST be set before using this.

Encoding note: strings MUST be given as Unicode object. Giving str objects will fail.

class deepinsight.core.dataset_write.FakeDatasetWriter(dataset)

Bases: object

For tests only

write_dataframe(df)
class deepinsight.core.dataset_write.RemoteStreamWriter(id, waiter)

Bases: threading.Thread

close()
flush()
read()
run()
write(data)
class deepinsight.core.dataset_write.StreamingAPI

Bases: object

init_write_session(request)
push_data(id, generator)
wait_write_session(id)
exception deepinsight.core.dataset_write.TimeoutExpired

Bases: Exception

class deepinsight.core.dataset_write.TimeoutableQueue(size)

Bases: queue.Queue

join_with_timeout(timeout)
class deepinsight.core.dataset_write.WriteSessionWaiter(session_id, session_init_message)

Bases: threading.Thread

is_still_alive()
raise_on_failure()
run()
wait_end()

deepinsight.core.debugging module

deepinsight.core.debugging.debug_sighandler(sig, frame)

Interrupt running process, and provide a python prompt for interactive debugging.

deepinsight.core.debugging.install_handler()

deepinsight.core.dt_pandas_csv module

class deepinsight.core.dt_pandas_csv.DTCSVFormatter(obj, path_or_buf=None, sep=', ', na_rep='', float_format=None, cols=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression=None, quoting=None, line_terminator='n', chunksize=None, tupleize_cols=False, quotechar='"', date_format=None, doublequote=True, escapechar=None, decimal='.')

Bases: object

save()
class deepinsight.core.dt_pandas_csv.Python2UTF8Writer(f, dialect=None, **kwds)

Bases: object

A CSV writer which will write rows to CSV file “f” while ensuring UTF8 write. Supports both str-as-utf8 and unicode as input

Does not handle dates specially

writerow(row)
writerows(rows)
class deepinsight.core.dt_pandas_csv.Python3UTF8Writer(f, dialect=None, **kwds)

Bases: object

A CSV writer which will write rows to CSV file “f” while ensuring UTF8 write. Does not handle dates specially

writerow(row)
writerows(rows)

deepinsight.core.dtio module

class deepinsight.core.dtio.PipeToGeneratorThread(id, consumer)

Bases: threading.Thread

close()
flush()
new_buffer()
run()
wait_for_completion()
write(data)
class deepinsight.core.dtio.Python2UTF8CSVReader(f, **kwds)

Bases: object

A CSV reader which will iterate over lines in the CSV file-like binary object “f”, which is encoded in UTF-8.

next()
class deepinsight.core.dtio.Python2UTF8CSVWriter(f, **kwds)

Bases: object

A CSV writer which will write rows to binary CSV file “f”, encoded in UTF-8.

It also encodes dates

writerow(row)
writerows(rows)
class deepinsight.core.dtio.Python3UTF8CSVReader(f, **kwds)

Bases: object

A CSV reader which will iterate over lines in the CSV file-like binary object “f”, which is encoded in UTF-8.

class deepinsight.core.dtio.Python3UTF8CSVWriter(f, **kwds)

Bases: object

A CSV writer which will write rows to binary CSV file “f”, encoded in UTF-8.

It also encodes dates

writerow(row)
writerows(rows)
exception deepinsight.core.dtio.TimeoutExpired

Bases: Exception

class deepinsight.core.dtio.TimeoutableQueue(size)

Bases: queue.Queue

join_with_timeout(timeout)
deepinsight.core.dtio.new_bytesoriented_io(data=None)
deepinsight.core.dtio.new_utf8_csv_reader(f, **kwargs)
deepinsight.core.dtio.new_utf8_csv_writer(f, **kwargs)

deepinsight.core.dtjson module

class deepinsight.core.dtjson.JSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)

Bases: json.encoder.JSONEncoder

default(obj)
deepinsight.core.dtjson.dump(f, obj)

Write human readable json

We first serialize the object to avoid corrupting the file if the object is not serializable.

deepinsight.core.dtjson.dump_to_filepath(filepath, obj)

Write human readable json

We first serialize the object to avoid corrupting the file if the object is not serializable.

deepinsight.core.dtjson.dumps(*args, **kvargs)
deepinsight.core.dtjson.load_from_filepath(filepath)
deepinsight.core.dtjson.set_default_decorator(fn, **param)

deepinsight.core.flow module

deepinsight.core.flow.load_flow_spec()

deepinsight.core.intercom module

deepinsight.core.intercom.backend_api_get_call(path, data, **kwargs)

For read-only calls that can go directly to the backend

deepinsight.core.intercom.backend_api_post_call(path, data, **kwargs)

For read-only calls that can go directly to the backend

deepinsight.core.intercom.backend_api_put_call(path, data, **kwargs)

For read-only calls that can go directly to the backend

deepinsight.core.intercom.backend_get_call(path, data=None, err_msg=None, **kwargs)
deepinsight.core.intercom.backend_json_call(path, data=None, err_msg=None, **kwargs)
deepinsight.core.intercom.backend_put_call(path, data=None, err_msg=None, **kwargs)
deepinsight.core.intercom.backend_stream_call(path, data=None, err_msg=None, **kwargs)
deepinsight.core.intercom.backend_void_call(path, data=None, err_msg=None, **kwargs)
deepinsight.core.intercom.create_session_if_needed()
deepinsight.core.intercom.get_auth_headers()
deepinsight.core.intercom.get_backend_url()
deepinsight.core.intercom.get_jek_url()
deepinsight.core.intercom.get_location_data()
deepinsight.core.intercom.has_a_jek()
deepinsight.core.intercom.jek_api_get_call(path, data, **kwargs)

For read-only calls that can go directly to the jek

deepinsight.core.intercom.jek_api_post_call(path, data, **kwargs)

For read-only calls that go directly to the jek

deepinsight.core.intercom.jek_get_call(path, data=None, err_msg=None, **kwargs)
deepinsight.core.intercom.jek_json_call(path, data=None, err_msg=None, **kwargs)
deepinsight.core.intercom.jek_or_backend_get_call(path, data=None, err_msg=None, **kwargs)
deepinsight.core.intercom.jek_or_backend_json_call(path, data=None, err_msg=None, **kwargs)
deepinsight.core.intercom.jek_or_backend_stream_call(path, data=None, err_msg=None, **kwargs)
deepinsight.core.intercom.jek_or_backend_void_call(path, data=None, err_msg=None, **kwargs)
deepinsight.core.intercom.jek_stream_call(path, data=None, err_msg=None, **kwargs)
deepinsight.core.intercom.jek_void_call(path, data=None, err_msg=None, **kwargs)
deepinsight.core.intercom.set_remote_dt(url, api_key, no_check_certificate=False)

deepinsight.core.metrics module

class deepinsight.core.metrics.ComputedMetrics(raw)

Bases: object

Handle to the metrics of a DeepInsight object and their last computed value

get_all_ids()

Get the identifiers of all metrics defined in this object

get_data(metric_id)

Get the global value point of a given metric, or throws.

For a partitioned dataset, the global value is the value of the metric computed on the whole dataset (coded as partition ‘ALL’).

Parameters:metric_id – unique identifier of the metric
get_metric_by_id(metric_id)

Retrive the info for a given metric

Parameters:metric_id – unique identifier of the metric
get_partition_data(metric_id, partition)

Get the value point of a given metric for a given partition, or throws.

Parameters:
  • metric_id – unique identifier of the metric
  • partition – partition identifier
get_partition_value(metric_id, partition)

Get the value of a given metric for a given partition, or throws.

Parameters:
  • metric_id – unique identifier of the metric
  • partition – partition identifier
get_value(metric_id)

Get the global value of a given metric, or throws.

For a partitioned dataset, the global value is the value of the metric computed on the whole dataset (coded as partition ‘ALL’).

Parameters:metric_id – unique identifier of the metric
static get_value_from_data(data)

Retrieves the value from a metric point, cast in the appropriate type (str, int or float).

For other types, the value is not cast and left as a string.

Parameters:data – a value point for a metric, retrieved with deepinsight.ComputedMetrics.get_data() or deepinsight.ComputedMetrics.get_partition_data()
class deepinsight.core.metrics.MetricDataPoint(raw)

Bases: object

A value of a metric, on a partition

get_compute_time()

Returns the time at which the value was computed

get_metric()

Returns the metric as a JSON object

get_metric_id()

Returns the metric’s id

get_partition()

Returns the partition on which the value was computed

get_type()

Returns the type of the value

get_value()

Returns the value of the metric, as a string

deepinsight.core.pandasutils module

deepinsight.core.pandasutils.getSeriesNonzero(series)
deepinsight.core.pandasutils.split_train_valid(df, prop=0.8, seed=None)

A function that takes an input data frame df and splits it into 2 other data frames based on prop (defaults to 80% for the first one)

deepinsight.core.saved_model module

class deepinsight.core.saved_model.BasePredictor(params, clf)

Bases: object

Object allowing to preprocess and make predictions on a dataframe.

get_classes()

Returns the classes from which this model will predict if a classifier, None if a regressor

get_conditional_output_names()

Returns the name of all conditional outputs defined for this model (note: limited to binary classifiers)

get_proba_columns()

Returns the names of the probability columns if a classifier, None if a regressor

class deepinsight.core.saved_model.EnsemblePredictor(params, clf)

Bases: deepinsight.core.saved_model.BasePredictor

A predictor for Ensemble models. Unlike regular models, they do not have a preprocessing and do not have feature names (various models use different features and preprocessings). Attempted calls to preprocess, get_preprocessing and get_features will therefore raise an AttributeError

get_prediction_dataframe(input_df, with_prediction, with_probas, with_conditional_outputs, with_proba_percentile)
predict(df, with_input_cols=False, with_prediction=True, with_probas=True, with_conditional_outputs=False, with_proba_percentile=False)

Predict a dataframe. The results are returned as a dataframe with prediction columns added.

class deepinsight.core.saved_model.KerasPredictor(params, preprocessing, model, modeling_params, batch_size=100)

Bases: deepinsight.core.saved_model.Predictor

class deepinsight.core.saved_model.KerasPreprocessing(pipeline, modeling_params, per_feature)

Bases: deepinsight.core.saved_model.Preprocessing

preprocess(df)
class deepinsight.core.saved_model.Model(lookup, project_id=None, ignore_flow=False)

Bases: deepinsight.core.base.Computable

This is a handle to interact with a saved model

activate_version(version_id)

Activate a version in the model

Parameters:version_id – the unique identifier of the version to activate
get_definition()
get_id()

Get the unique identifier of the model

get_info()
get_name()

Get the name of the model

get_predictor(version_id=None)

Returns a Predictor for the given version of this Saved Model. If no version is specified, the current active version will be used.

get_type()

Get the type of the model, prediction or clustering

static list_models(project_id=None)

Retrieve the list of saved models

Parameters:project_id – key of the project from which to list models
list_versions()

List the versions this saved model contains

class deepinsight.core.saved_model.ModelParams(model_type, modeling_params, preprocessing_params, core_params, schema, user_meta, model_perf, conditional_outputs, cluster_name_map)

Bases: object

class deepinsight.core.saved_model.Predictor(params, preprocessing, features, clf)

Bases: deepinsight.core.saved_model.BasePredictor

Object allowing to preprocess and make predictions on a dataframe.

get_features()

Returns the feature names generated by this predictor’s preprocessing

get_preprocessing()
predict(df, with_input_cols=False, with_prediction=True, with_probas=True, with_conditional_outputs=False, with_proba_percentile=False)

Predict a dataframe. The results are returned as a dataframe with columns corresponding to the various prediction information.

Parameters:
  • with_input_cols – whether the input columns should also be present in the output
  • with_prediction – whether the prediction column should be present
  • with_probas – whether the probability columns should be present
  • with_conditional_outputs – whether the conditional outputs for this model should be present (binary classif)
  • with_proba_percentile – whether the percentile of the probability should be present (binary classif)
preprocess(df)

Preprocess a dataframe. The results are returned as a numpy 2-dimensional matrix (which may be sparse). The columns of this matrix correspond to the generated features, which can be listed by the get_features property of this Predictor.

class deepinsight.core.saved_model.Preprocessing(pipeline, modeling_params)

Bases: object

preprocess(df)
deepinsight.core.saved_model.build_predictor(model_type, model_folder, preprocessing_folder, conditional_outputs, core_params, split_desc)
deepinsight.core.saved_model.build_predictor_for_saved_model(model_folder, model_type, conditional_outputs)
deepinsight.core.saved_model.is_model_prediction(model_type)

deepinsight.core.schema_handling module

deepinsight.core.schema_handling.get_schema_from_df(df)

A simple function that returns a DeepInsight schema from a Pandas dataframe, to be used when writing to a dataset from a data frame

deepinsight.core.schema_handling.pandas_dt_type(dtype)

Return the DeepInsight type for a Pandas dtype

deepinsight.core.schema_handling.parse_local_date(s)
deepinsight.core.schema_handling.str_to_bool(s)

deepinsight.core.sql module

class deepinsight.core.sql.HiveExecutor(dataset=None, database='default', connection=None)

Bases: deepinsight.core.sql._HiveLikeExecutor

static exec_recipe_fragment(query, overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False, metastore_handling=None, extra_conf={})
query_to_df(query, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None)
query_to_iter(query, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None)
class deepinsight.core.sql.ImpalaExecutor(dataset=None, database='default', connection=None)

Bases: deepinsight.core.sql._HiveLikeExecutor

static exec_recipe_fragment(output_dataset, query, overwrite_output_schema=True, use_stream_mode=True)
query_to_df(query, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None)
query_to_iter(query, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None)
class deepinsight.core.sql.QueryReader(connection, query, find_connection_from_dataset=False, db_type='sql', extra_conf={}, timeOut=600000, script_steps=None, script_input_schema=None, script_output_schema=None)

Bases: object

get_schema()
iter_tuples(log_every=-1, no_header=False)
class deepinsight.core.sql.SQLExecutor(connection=None, dataset=None)

Bases: object

static exec_recipe_fragment(output_dataset, query, overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False)
query_to_df(query, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None)
query_to_iter(query, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None)
class deepinsight.core.sql.SparkExecutor(dataset=None, database='default', connection=None)

Bases: deepinsight.core.sql._HiveLikeExecutor

query_to_df(query, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None)
query_to_iter(query, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None)

Module contents

deepinsight.core.default_project_id()