abacusai.dataset

Module Contents

Classes

Dataset

A dataset reference

class abacusai.dataset.Dataset(client, datasetId=None, name=None, sourceType=None, dataSource=None, createdAt=None, ignoreBefore=None, ephemeral=None, lookbackDays=None, databaseConnectorId=None, databaseConnectorConfig=None, connectorType=None, featureGroupTableName=None, applicationConnectorId=None, applicationConnectorConfig=None, incremental=None, isAboveDataLimitingThreshold=None, isCdsAvailable=None, isCdsActive=None, schema={}, refreshSchedules={}, latestDatasetVersion={})

Bases: abacusai.return_class.AbstractApiClass

A dataset reference

Parameters
  • client (ApiClient) – An authenticated API Client instance

  • datasetId (str) – The unique identifier of the dataset.

  • name (str) – The user-friendly name of the dataset.

  • sourceType (str) – The source of the Dataset. EXTERNAL_SERVICE, UPLOAD, or STREAMING.

  • dataSource (str) – Location of data. It may be a URI such as an s3 bucket or the database table.

  • createdAt (str) – The timestamp at which this dataset was created.

  • ignoreBefore (str) – The timestamp at which all previous events are ignored when training.

  • ephemeral (bool) – The dataset is ephemeral and not used for training.

  • lookbackDays (int) – Specific to streaming datasets, this specifies how many days worth of data to include when generating a snapshot. Value of 0 indicates leaves this selection to the system.

  • databaseConnectorId (str) – The Database Connector used.

  • databaseConnectorConfig (dict) – The database connector query used to retrieve data.

  • connectorType (str) – The type of connector used to get this dataset FILE or DATABASE.

  • featureGroupTableName (str) – The table name of the dataset’s feature group

  • applicationConnectorId (str) – The Application Connector used.

  • applicationConnectorConfig (dict) – The application connector query used to retrieve data.

  • incremental (bool) – If dataset is an incremental dataset.

  • isAboveDataLimitingThreshold (bool) – Boolean indicating whether dataset has data (rows) above the threshold limit (1 million)

  • isCdsAvailable (bool) – Boolean indicating whether a custom dataserver (CDS) is available to be deployed

  • isCdsActive (bool) – Boolean indicating whether a custom dataserver (CDS) is present

  • latestDatasetVersion (DatasetVersion) – The latest version of this dataset.

  • schema (DatasetColumn) – List of resolved columns.

  • refreshSchedules (RefreshSchedule) – List of schedules that determines when the next version of the dataset will be created.

__repr__(self)

Return repr(self).

to_dict(self)

Get a dict representation of the parameters in this class

Returns

The dict value representation of the class parameters

Return type

dict

create_version_from_file_connector(self, location=None, file_format=None, csv_delimiter=None)

Creates a new version of the specified dataset.

Parameters
  • location (str) – A new external URI to import the dataset from. If not specified, the last location will be used.

  • file_format (str) – The fileFormat to be used. If not specified, the service will try to detect the file format.

  • csv_delimiter (str) – If the file format is CSV, use a specific csv delimiter.

Returns

The new Dataset Version created.

Return type

DatasetVersion

create_version_from_database_connector(self, object_name=None, columns=None, query_arguments=None, sql_query=None)

Creates a new version of the specified dataset

Parameters
  • object_name (str) – If applicable, the name/id of the object in the service to query. If not specified, the last name will be used.

  • columns (str) – The columns to query from the external service object. If not specified, the last columns will be used.

  • query_arguments (str) – Additional query arguments to filter the data. If not specified, the last arguments will be used.

  • sql_query (str) – The full SQL query to use when fetching data. If present, this parameter will override objectName, columns, and queryArguments

Returns

The new Dataset Version created.

Return type

DatasetVersion

create_version_from_application_connector(self, object_id=None, start_timestamp=None, end_timestamp=None)

Creates a new version of the specified dataset

Parameters
  • object_id (str) – If applicable, the id of the object in the service to query. If not specified, the last name will be used.

  • start_timestamp (int) – The Unix timestamp of the start of the period that will be queried.

  • end_timestamp (int) – The Unix timestamp of the end of the period that will be queried.

Returns

The new Dataset Version created.

Return type

DatasetVersion

create_version_from_upload(self, file_format=None)

Creates a new version of the specified dataset using a local file upload.

Parameters

file_format (str) – The file_format to be used. If not specified, the service will try to detect the file format.

Returns

A token to be used when uploading file parts.

Return type

Upload

snapshot_streaming_data(self)

Snapshots the current data in the streaming dataset for training.

Parameters

dataset_id (str) – The unique ID associated with the dataset.

Returns

The new Dataset Version created.

Return type

DatasetVersion

set_column_data_type(self, column, data_type)

Set a column’s type in a specified dataset.

Parameters
  • column (str) – The name of the column.

  • data_type (str) – The type of the data in the column. INTEGER, FLOAT, STRING, DATE, DATETIME, BOOLEAN, LIST, STRUCT Refer to the (guide on data types)[https://api.abacus.ai/app/help/class/DataType] for more information. Note: Some ColumnMappings will restrict the options or explicity set the DataType.

Returns

The dataset and schema after the data_type has been set

Return type

Dataset

set_streaming_retention_policy(self, retention_hours=None, retention_row_count=None)

Sets the streaming retention policy

Parameters
  • retention_hours (int) – The number of hours to retain streamed data in memory

  • retention_row_count (int) – The number of rows to retain streamed data in memory

get_schema(self)

Retrieves the column schema of a dataset

Parameters

dataset_id (str) – The Dataset schema to lookup.

Returns

List of Column schema definitions

Return type

DatasetColumn

refresh(self)

Calls describe and refreshes the current object’s fields

Returns

The current object

Return type

Dataset

describe(self)

Retrieves a full description of the specified dataset, with attributes such as its ID, name, source type, etc.

Parameters

dataset_id (str) – The unique ID associated with the dataset.

Returns

The dataset.

Return type

Dataset

list_versions(self, limit=100, start_after_version=None)

Retrieves a list of all dataset versions for the specified dataset.

Parameters
  • limit (int) – The max length of the list of all dataset versions.

  • start_after_version (str) – The id of the version after which the list starts.

Returns

A list of dataset versions.

Return type

DatasetVersion

attach_to_project(self, project_id, dataset_type)

[DEPRECATED] Attaches the dataset to the project.

Use this method to attach a dataset that is already in the organization to another project. The dataset type is required to let the AI engine know what type of schema should be used.

Parameters
  • project_id (str) – The project to attach the dataset to.

  • dataset_type (str) – The dataset has to be a type that is associated with the use case of your project. Please see (Use Case Documentation)[https://api.abacus.ai/app/help/useCases] for the datasetTypes that are supported per use case.

Returns

An array of columns descriptions.

Return type

Schema

remove_from_project(self, project_id)

[DEPRECATED] Removes a dataset from a project.

Parameters

project_id (str) – The unique ID associated with the project.

rename(self, name)

Rename a dataset.

Parameters

name (str) – The new name for the dataset.

delete(self)

Deletes the specified dataset from the organization.

The dataset cannot be deleted if it is currently attached to a project.

Parameters

dataset_id (str) – The dataset to delete.

wait_for_import(self, timeout=900)

A waiting call until dataset is imported.

Parameters

timeout (int, optional) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out. Default value given is 900 milliseconds.

wait_for_inspection(self, timeout=None)

A waiting call until dataset is completely inspected.

Parameters

timeout (int, optional) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.

get_status(self)

Gets the status of the latest dataset version.

Returns

A string describing the status of a dataset (importing, inspecting, complete, etc.).

Return type

str

describe_feature_group(self)

Gets the feature group attached to the dataset.

Returns

A feature group object.

Return type

FeatureGroup

create_refresh_policy(self, cron)

To create a refresh policy for a dataset.

Parameters

cron (str) – A cron style string to set the refresh time.

Returns

The refresh policy object.

Return type

RefreshPolicy

list_refresh_policies(self)

Gets the refresh policies in a list.

Returns

A list of refresh policy objects.

Return type

List[RefreshPolicy]