abacusai.dataset
Module Contents
Classes
A dataset reference |
- class abacusai.dataset.Dataset(client, datasetId=None, name=None, sourceType=None, dataSource=None, createdAt=None, ignoreBefore=None, ephemeral=None, lookbackDays=None, databaseConnectorId=None, databaseConnectorConfig=None, connectorType=None, featureGroupTableName=None, applicationConnectorId=None, applicationConnectorConfig=None, incremental=None, isAboveDataLimitingThreshold=None, isCdsAvailable=None, isCdsActive=None, schema={}, refreshSchedules={}, latestDatasetVersion={})
Bases:
abacusai.return_class.AbstractApiClass
A dataset reference
- Parameters
client (ApiClient) – An authenticated API Client instance
datasetId (str) – The unique identifier of the dataset.
name (str) – The user-friendly name of the dataset.
sourceType (str) – The source of the Dataset. EXTERNAL_SERVICE, UPLOAD, or STREAMING.
dataSource (str) – Location of data. It may be a URI such as an s3 bucket or the database table.
createdAt (str) – The timestamp at which this dataset was created.
ignoreBefore (str) – The timestamp at which all previous events are ignored when training.
ephemeral (bool) – The dataset is ephemeral and not used for training.
lookbackDays (int) – Specific to streaming datasets, this specifies how many days worth of data to include when generating a snapshot. Value of 0 indicates leaves this selection to the system.
databaseConnectorId (str) – The Database Connector used.
databaseConnectorConfig (dict) – The database connector query used to retrieve data.
connectorType (str) – The type of connector used to get this dataset FILE or DATABASE.
featureGroupTableName (str) – The table name of the dataset’s feature group
applicationConnectorId (str) – The Application Connector used.
applicationConnectorConfig (dict) – The application connector query used to retrieve data.
incremental (bool) – If dataset is an incremental dataset.
isAboveDataLimitingThreshold (bool) – Boolean indicating whether dataset has data (rows) above the threshold limit (1 million)
isCdsAvailable (bool) – Boolean indicating whether a custom dataserver (CDS) is available to be deployed
isCdsActive (bool) – Boolean indicating whether a custom dataserver (CDS) is present
latestDatasetVersion (DatasetVersion) – The latest version of this dataset.
schema (DatasetColumn) – List of resolved columns.
refreshSchedules (RefreshSchedule) – List of schedules that determines when the next version of the dataset will be created.
- __repr__(self)
Return repr(self).
- to_dict(self)
Get a dict representation of the parameters in this class
- Returns
The dict value representation of the class parameters
- Return type
- create_version_from_file_connector(self, location=None, file_format=None, csv_delimiter=None)
Creates a new version of the specified dataset.
- Parameters
location (str) – A new external URI to import the dataset from. If not specified, the last location will be used.
file_format (str) – The fileFormat to be used. If not specified, the service will try to detect the file format.
csv_delimiter (str) – If the file format is CSV, use a specific csv delimiter.
- Returns
The new Dataset Version created.
- Return type
- create_version_from_database_connector(self, object_name=None, columns=None, query_arguments=None, sql_query=None)
Creates a new version of the specified dataset
- Parameters
object_name (str) – If applicable, the name/id of the object in the service to query. If not specified, the last name will be used.
columns (str) – The columns to query from the external service object. If not specified, the last columns will be used.
query_arguments (str) – Additional query arguments to filter the data. If not specified, the last arguments will be used.
sql_query (str) – The full SQL query to use when fetching data. If present, this parameter will override objectName, columns, and queryArguments
- Returns
The new Dataset Version created.
- Return type
- create_version_from_application_connector(self, object_id=None, start_timestamp=None, end_timestamp=None)
Creates a new version of the specified dataset
- Parameters
object_id (str) – If applicable, the id of the object in the service to query. If not specified, the last name will be used.
start_timestamp (int) – The Unix timestamp of the start of the period that will be queried.
end_timestamp (int) – The Unix timestamp of the end of the period that will be queried.
- Returns
The new Dataset Version created.
- Return type
- create_version_from_upload(self, file_format=None)
Creates a new version of the specified dataset using a local file upload.
- snapshot_streaming_data(self)
Snapshots the current data in the streaming dataset for training.
- Parameters
dataset_id (str) – The unique ID associated with the dataset.
- Returns
The new Dataset Version created.
- Return type
- set_column_data_type(self, column, data_type)
Set a column’s type in a specified dataset.
- Parameters
column (str) – The name of the column.
data_type (str) – The type of the data in the column. INTEGER, FLOAT, STRING, DATE, DATETIME, BOOLEAN, LIST, STRUCT Refer to the (guide on data types)[https://api.abacus.ai/app/help/class/DataType] for more information. Note: Some ColumnMappings will restrict the options or explicity set the DataType.
- Returns
The dataset and schema after the data_type has been set
- Return type
- set_streaming_retention_policy(self, retention_hours=None, retention_row_count=None)
Sets the streaming retention policy
- get_schema(self)
Retrieves the column schema of a dataset
- Parameters
dataset_id (str) – The Dataset schema to lookup.
- Returns
List of Column schema definitions
- Return type
- refresh(self)
Calls describe and refreshes the current object’s fields
- Returns
The current object
- Return type
- describe(self)
Retrieves a full description of the specified dataset, with attributes such as its ID, name, source type, etc.
- list_versions(self, limit=100, start_after_version=None)
Retrieves a list of all dataset versions for the specified dataset.
- Parameters
- Returns
A list of dataset versions.
- Return type
- attach_to_project(self, project_id, dataset_type)
[DEPRECATED] Attaches the dataset to the project.
Use this method to attach a dataset that is already in the organization to another project. The dataset type is required to let the AI engine know what type of schema should be used.
- Parameters
project_id (str) – The project to attach the dataset to.
dataset_type (str) – The dataset has to be a type that is associated with the use case of your project. Please see (Use Case Documentation)[https://api.abacus.ai/app/help/useCases] for the datasetTypes that are supported per use case.
- Returns
An array of columns descriptions.
- Return type
- remove_from_project(self, project_id)
[DEPRECATED] Removes a dataset from a project.
- Parameters
project_id (str) – The unique ID associated with the project.
- delete(self)
Deletes the specified dataset from the organization.
The dataset cannot be deleted if it is currently attached to a project.
- Parameters
dataset_id (str) – The dataset to delete.
- wait_for_import(self, timeout=900)
A waiting call until dataset is imported.
- Parameters
timeout (int, optional) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out. Default value given is 900 milliseconds.
- wait_for_inspection(self, timeout=None)
A waiting call until dataset is completely inspected.
- Parameters
timeout (int, optional) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.
- get_status(self)
Gets the status of the latest dataset version.
- Returns
A string describing the status of a dataset (importing, inspecting, complete, etc.).
- Return type
- describe_feature_group(self)
Gets the feature group attached to the dataset.
- Returns
A feature group object.
- Return type
- create_refresh_policy(self, cron)
To create a refresh policy for a dataset.
- Parameters
cron (str) – A cron style string to set the refresh time.
- Returns
The refresh policy object.
- Return type
- list_refresh_policies(self)
Gets the refresh policies in a list.
- Returns
A list of refresh policy objects.
- Return type
List[RefreshPolicy]