kedro.io.CSVS3DataSet

class kedro.io.CSVS3DataSet(filepath, bucket_name, credentials=None, load_args=None, save_args=None, version=None)[source]

CSVS3DataSet loads and saves data to a file in S3. It uses s3fs to read and write from S3 and pandas to handle the csv file.

Example:

from kedro.io import CSVS3DataSet
import pandas as pd

data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
                     'col3': [5, 6]})

data_set = CSVS3DataSet(filepath="test.csv",
                        bucket_name="test_bucket",
                        load_args=None,
                        save_args={"index": False})
data_set.save(data)
reloaded = data_set.load()

assert data.equals(reloaded)
__init__(filepath, bucket_name, credentials=None, load_args=None, save_args=None, version=None)[source]

Creates a new instance of CSVS3DataSet pointing to a concrete csv file on S3.

Parameters:
  • filepath (str) – Path to a csv file.
  • bucket_name (str) – S3 bucket name.
  • credentials (Optional[Dict[str, Any]]) – Credentials to access the S3 bucket, such as aws_access_key_id, aws_secret_access_key.
  • load_args (Optional[Dict[str, Any]]) – Pandas options for loading csv files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html All defaults are preserved.
  • save_args (Optional[Dict[str, Any]]) – Pandas options for saving csv files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html All defaults are preserved, but “index”, which is set to False.
  • version (Optional[Version]) – If specified, should be an instance of kedro.io.core.Version. If its load attribute is None, the latest version will be loaded. If its save attribute is None, save version will be autogenerated.
Return type:

None

Methods

__init__(filepath, bucket_name[, …]) Creates a new instance of CSVS3DataSet pointing to a concrete csv file on S3.
exists() Checks whether a data set’s output already exists by calling the provided _exists() method.
from_config(name, config[, load_version, …]) Create a data set instance using the configuration provided.
load() Loads data by delegation to the provided load method.
save(data) Saves data by delegation to the provided save method.