deepinsight package¶
Subpackages¶
- deepinsight.apinode package
- deepinsight.base package
- deepinsight.core package
- Submodules
- deepinsight.core.base module
- deepinsight.core.dataset module
- deepinsight.core.dataset_write module
- deepinsight.core.debugging module
- deepinsight.core.dt_pandas_csv module
- deepinsight.core.dtio module
- deepinsight.core.dtjson module
- deepinsight.core.flow module
- deepinsight.core.intercom module
- deepinsight.core.metrics module
- deepinsight.core.pandasutils module
- deepinsight.core.saved_model module
- deepinsight.core.schema_handling module
- deepinsight.core.sql module
- Module contents
- deepinsight.doctor package
- Subpackages
- deepinsight.doctor.clustering package
- Submodules
- deepinsight.doctor.clustering.anomaly_detection module
- deepinsight.doctor.clustering.clustering_fit module
- deepinsight.doctor.clustering.clustering_scorer module
- deepinsight.doctor.clustering.common module
- deepinsight.doctor.clustering.reg_cluster_recipe module
- deepinsight.doctor.clustering.reg_scoring_recipe module
- deepinsight.doctor.clustering.reg_train_recipe module
- deepinsight.doctor.clustering.two_step_clustering module
- Module contents
- deepinsight.doctor.crossval package
- deepinsight.doctor.deep_learning package
- Submodules
- deepinsight.doctor.deep_learning.gpu module
- deepinsight.doctor.deep_learning.keras_callbacks module
- deepinsight.doctor.deep_learning.keras_support module
- deepinsight.doctor.deep_learning.keras_utils module
- deepinsight.doctor.deep_learning.load_model module
- deepinsight.doctor.deep_learning.preprocessing module
- deepinsight.doctor.deep_learning.sequences module
- deepinsight.doctor.deep_learning.shared_variables module
- Module contents
- deepinsight.doctor.posttraining package
- deepinsight.doctor.prediction package
- Submodules
- deepinsight.doctor.prediction.classification_fit module
- deepinsight.doctor.prediction.classification_scoring module
- deepinsight.doctor.prediction.common module
- deepinsight.doctor.prediction.dt_xgboost module
- deepinsight.doctor.prediction.ensembles module
- deepinsight.doctor.prediction.feature_selection module
- deepinsight.doctor.prediction.keras_evaluation_recipe module
- deepinsight.doctor.prediction.keras_scoring_recipe module
- deepinsight.doctor.prediction.lars module
- deepinsight.doctor.prediction.prediction_model_serialization module
- deepinsight.doctor.prediction.reg_evaluation_recipe module
- deepinsight.doctor.prediction.reg_scoring_recipe module
- deepinsight.doctor.prediction.reg_train_recipe module
- deepinsight.doctor.prediction.regression_fit module
- deepinsight.doctor.prediction.regression_scoring module
- deepinsight.doctor.prediction.scoring_base module
- Module contents
- deepinsight.doctor.preprocessing package
- deepinsight.doctor.utils package
- Submodules
- deepinsight.doctor.utils.calibration module
- deepinsight.doctor.utils.crossval module
- deepinsight.doctor.utils.dataframe_cache module
- deepinsight.doctor.utils.interrupt_optimization module
- deepinsight.doctor.utils.lift_curve module
- deepinsight.doctor.utils.listener module
- deepinsight.doctor.utils.magic_main module
- deepinsight.doctor.utils.metrics module
- deepinsight.doctor.utils.split module
- deepinsight.doctor.utils.subsampler module
- Module contents
- deepinsight.doctor.clustering package
- Submodules
- deepinsight.doctor.clustering_entrypoints module
- deepinsight.doctor.commands module
- deepinsight.doctor.constants module
- deepinsight.doctor.dtapi module
- deepinsight.doctor.exception module
- deepinsight.doctor.forest module
- deepinsight.doctor.multiframe module
- deepinsight.doctor.notebook_builder module
- deepinsight.doctor.prediction_entrypoints module
- deepinsight.doctor.preprocessing_collector module
- deepinsight.doctor.preprocessing_handler module
- deepinsight.doctor.server module
- Module contents
- Subpackages
- deepinsight.insights package
- deepinsight.spark package
- deepinsight.timer package
Module contents¶
-
class
deepinsight.
Dataset
(name, project_id=None, ignore_flow=False)¶ Bases:
object
This is a handle to obtain readers and writers on a deepinsight Dataset. From this Dataset class, you can:
- Read a dataset as a Pandas dataframe
- Read a dataset as a chunked Pandas dataframe
- Read a dataset row-by-row
- Write a pandas dataframe to a dataset
- Write a series of chunked Pandas dataframes to a dataset
- Write to a dataset row-by-row
- Edit the schema of a dataset
-
add_read_partitions
(spec)¶ Add a partition or range of partitions to read.
The spec argument must be given in the partition spec format. You cannot manually set partitions when running inside a Python recipe. They are automatically set using the dependencies.
-
full_name
¶
-
get_config
()¶
-
get_dataframe
(columns=None, sampling='head', sampling_column=None, limit=None, ratio=None, infer_with_pandas=True, parse_dates=True, bool_as_str=False, float_precision=None)¶ Read the dataset (or its selected partitions, if applicable) as a Pandas dataframe.
Pandas dataframes are fully in-memory, so you need to make sure that your dataset will fit in RAM before using this.
Keywords arguments:
columns – When not None, returns only the given list of columns (default None)
limit – Limits the number of rows returned (default None)
sampling – Sampling method, if:
- ‘head’ returns the first rows of the dataset. Incompatible with ratio parameter.
- ‘random’ returns a random sample of the dataset
- ‘random-column’ returns a random sample of the dataset. Incompatible with limit parameter.
sampling_column – Select the column used for “columnwise-random” sampling (default None)
ratio – Limits the ratio to at n% of the dataset. (default None)
infer_with_pandas – uses the types detected by pandas rather than the dataset schema as detected in DeepInsight. (default True)
parse_dates – Date column in dataset schema are parsed (default True)
bool_as_str – Leave boolean values as strings (default False)
Inconsistent sampling parameter raise ValueError.
Note about encoding:
- Column labels are “unicode” objects
- When a column is of string type, the content is made of utf-8 encoded “str” objects
-
static
get_dataframe_schema_st
(schema, columns=None, parse_dates=True, infer_with_pandas=False, bool_as_str=False)¶
-
get_files_info
(partitions=[])¶
-
get_last_metric_values
(partition='')¶ Get the set of last values of the metrics on this dataset, as a
deepinsight.ComputedMetrics
object
-
get_location_info
(sensitive_info=False)¶
-
get_metric_history
(metric_lookup, partition='')¶ Get the set of all values a given metric took on this dataset
Parameters: - metric_lookup – metric name or unique identifier
- partition – optionally, the partition for which the values are to be fetched
-
get_writer
()¶ Get a stream writer for this dataset (or its target partition, if applicable). The writer must be closed as soon as you don’t need it.
The schema of the dataset MUST be set before using this. If you don’t set the schema of the dataset, your data will generally not be stored by the output writers
-
iter_dataframes
(chunksize=10000, infer_with_pandas=True, sampling='head', sampling_column=None, parse_dates=True, limit=None, ratio=None, columns=None, bool_as_str=False, float_precision=None)¶ Read the dataset to Pandas dataframes by chunks of fixed size.
Returns a generator over pandas dataframes.
Useful is the dataset doesn’t fit in RAM.
-
iter_dataframes_forced_types
(names, dtypes, parse_date_columns, chunksize=10000, sampling='head', sampling_column=None, limit=None, ratio=None, float_precision=None)¶
-
iter_rows
(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=-1, timeout=30, columns=None)¶ Returns a generator on the rows (as a dict-like object) of the data (or its selected partitions, if applicable)
Keyword arguments: * limit – maximum number of rows to be emitted * log_every – print out the number of rows read on stdout
Field values are casted according to their types. String are parsed into “unicode” values.
-
iter_tuples
(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=-1, timeout=30, columns=None)¶ Returns the rows of the dataset as tuples. The order and type of the values are the same are matching the dataset’s parameter
Keyword arguments:
- limit – maximum number of rows to be emitted
- log_every – print out the number of rows read on stdout
- timeout – time (in seconds) of inactivity after which we want to close the generator if nothing has been read. Without it notebooks typically tend to leak “DMC” processes.
Field values are casted according to their types. String are parsed into “unicode” values.
-
static
list
(project_id=None)¶ Lists the names of datasets. If project_id is None, the current project id is used.
-
list_partitions
(raise_if_empty=True)¶ List the partitions of this dataset, as an array of partition specifications
-
raw_formatted_data
(sampling=None, columns=None, format='tsv-excel-noheader', format_params=None)¶ Get a stream of raw bytes from a dataset as a file-like object, formatted in a supported output format.
You MUST close the file handle. Failure to do so will result in resource leaks.
-
read_metadata
()¶ Reads the dataset metadata object
-
read_schema
(raise_if_empty=True)¶ Gets the schema of this dataset, as an array of objects like this one: { ‘type’: ‘string’, ‘name’: ‘foo’, ‘maxLength’: 1000 }. There is more information for the map, array and object types.
-
set_preparation_steps
(steps, requested_output_schema)¶
-
set_write_partition
(spec)¶ Sets which partition of the dataset gets written to when you create a DatasetWriter. Setting the write partition is not allowed in Python recipes, where write is controlled by the Flow.
-
write_from_dataframe
(df, infer_schema=False, dropAndCreate=False)¶ Writes this dataset (or its target partition, if applicable) from a single Pandas dataframe.
This variant does not edit the schema of the output dataset, so you must take care to only write dataframes that have a compatible schema. Also see “write_with_schema”.
Encoding note: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.
arguments: df – input panda dataframe.
-
write_metadata
(meta)¶ Writes the dataset metadata object
-
write_schema
(columns, dropAndCreate=False)¶ Write the dataset schema into the dataset JSON definition file.
Sometimes, the schema of a dataset being written is known only by the code of the Python script itself. In that case, it can be useful for the Python script to actually modify the schema of the dataset. Obviously, this must be used with caution. ‘columns’ must be an array of dicts like { ‘name’ : ‘column name’, ‘type’ : ‘column type’}
-
write_schema_from_dataframe
(df, dropAndCreate=False)¶
-
write_with_schema
(df, dropAndCreate=False)¶ Writes this dataset (or its target partition, if applicable) from a single Pandas dataframe.
This variant replaces the schema of the output dataset with the schema of the dataframe.
Encoding node: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.
-
deepinsight.
default_project_id
()¶
-
deepinsight.
set_remote_dt
(url, api_key, no_check_certificate=False)¶
-
deepinsight.
get_schema_from_df
(df)¶ A simple function that returns a DeepInsight schema from a Pandas dataframe, to be used when writing to a dataset from a data frame