Kedro Logo
0.17

Introduction

  • What is Kedro?
    • Learn how to use Kedro
    • Assumptions
      • Official Python programming language website
      • List of free programming books and tutorials

Get Started

  • Installation prerequisites
    • Virtual environments
      • conda
      • venv (instead of conda)
      • pipenv (instead of conda)
  • Install Kedro
    • Install a development version
  • A “Hello World” example
    • Node
    • Pipeline
    • DataCatalog
    • Runner
    • Hello Kedro!
  • Create a new project
    • Create a new project interactively
    • Create a new project from a configuration file
    • Initialise a git repository
  • Iris dataset example project
    • Create the example project
      • Project directory structure
        • conf/
        • data
        • src
      • What best practice should I follow to avoid leaking confidential data?
    • Run the example project
    • Under the hood: Pipelines and nodes
  • Kedro starters
    • How to use Kedro starters
      • Starter aliases
    • List of official starters
    • Starter versioning
    • Use a starter in interactive mode
    • Use a starter with a configuration file

Tutorial

  • Kedro spaceflights tutorial
    • Kedro project development workflow
      • 1. Set up the project template
      • 2. Set up the data
      • 3. Create the pipeline
      • 4. Package the project
    • Optional: Git workflow
      • Creating a project repository
      • Submitting your changes to GitHub
  • Set up the spaceflights project
    • Create a new project
    • Install project dependencies
      • Add and remove project-specific dependencies
    • kedro install
    • Configure the project
  • Set up the data
    • Add your datasets to data
      • reviews.csv
      • companies.csv
      • shuttles.xlsx
    • Register the datasets
      • csv
      • xlsx
    • Custom data
  • Create a pipeline
    • Data engineering pipeline
      • Node functions
      • Assemble nodes into the data engineering pipeline
      • Update the project pipeline
      • Test the example
      • Persist pre-processed data
      • Extend the data engineering pipeline
      • Test the example
    • Data science pipeline
      • Update dependencies
      • Create a data science node
      • Configure the input parameters
      • Register the dataset
      • Assemble the data science pipeline
      • Update the project pipeline
      • Test the pipelines
    • Kedro runners
    • Slice a pipeline
  • Packaging a project
    • Add documentation to your project
    • Package your project
      • Docker and Airflow
  • Visualise pipelines
    • Use Kedro-Viz
      • Install Kedro-Viz
      • Visualise a whole pipeline
      • Exit an open visualisation
      • Interact with Data Engineering Convention
      • Share a pipeline

Kedro Project Setup

  • Dependencies
    • Project-specific dependencies
    • kedro install
    • Workflow dependencies
      • Install dependencies related to the Data Catalog
        • Install dependencies at a group-level
        • Install dependencies at a type-level
  • Configuration
    • Local and base configuration
    • Loading
    • Additional configuration environments
    • Templating configuration
      • Jinja2 support
    • Parameters
      • Loading parameters
      • Specifying parameters at runtime
      • Using parameters
    • Credentials
      • AWS credentials
    • Configuring kedro run arguments
  • Lifecycle management with KedroSession
    • Overview
    • Create a session
  • The mini-kedro Kedro starter
    • Introduction
    • Usage
    • Content

Data Catalog

  • The Data Catalog
    • Using the Data Catalog within Kedro configuration
    • Specifying the location of the dataset
    • Data Catalog *_args parameters
    • Using the Data Catalog with the YAML API
    • Creating a Data Catalog YAML configuration file via CLI
    • Adding parameters
    • Feeding in credentials
    • Loading multiple datasets that have similar configuration
    • Transcoding datasets
      • A typical example of transcoding
      • How does transcoding work?
    • Transforming datasets
      • Applying built-in transformers
      • Transformer scope
    • Versioning datasets and ML models
    • Using the Data Catalog with the Code API
      • Configuring a Data Catalog
      • Loading datasets
        • Behind the scenes
      • Viewing the available data sources
      • Saving data
        • Saving data to memory
        • Saving data to a SQL database for querying
        • Saving data in Parquet
  • Kedro IO
    • Error handling
    • AbstractDataSet
    • Versioning
      • version namedtuple
      • Versioning using the YAML API
      • Versioning using the Code API
      • Supported datasets
    • Partitioned dataset
      • Partitioned dataset definition
        • Dataset definition
        • Partitioned dataset credentials
      • Partitioned dataset load
      • Partitioned dataset save
      • Incremental loads with IncrementalDataSet
        • Incremental dataset load
        • Incremental dataset save
        • Incremental dataset confirm
        • Checkpoint configuration
        • Special checkpoint config keys

Nodes and Pipelines

  • Nodes
    • How to create a node
      • Node definition syntax
      • Syntax for input variables
      • Syntax for output variables
    • How to tag a node
    • How to run a node
  • Pipelines
    • How to build a pipeline
      • How to tag a pipeline
      • How to merge multiple pipelines
      • Information about the nodes in a pipeline
      • Information about pipeline inputs and outputs
    • Bad pipelines
      • Pipeline with bad nodes
      • Pipeline with circular dependencies
  • Modular pipelines
    • What are modular pipelines?
    • How do I create a modular pipeline?
    • Recommendations
    • How to share a modular pipeline
      • Package a modular pipeline
      • Pull a modular pipeline
    • A modular pipeline example template
      • Configuration
      • Datasets
    • How to connect existing pipelines
    • How to use a modular pipeline twice
    • How to use a modular pipeline with different parameters
    • How to clean up a modular pipeline
  • Run a pipeline
    • Runners
      • SequentialRunner
      • ParallelRunner
        • Multiprocessing
        • Multithreading
    • Custom runners
    • Load and save asynchronously
    • Run a pipeline by name
    • Run pipelines with IO
    • Output to a file
  • Slice a pipeline
    • Slice a pipeline by providing inputs
    • Slice a pipeline by specifying nodes
    • Slice a pipeline by specifying final nodes
    • Slice a pipeline with tagged nodes
    • Slice a pipeline by running specified nodes
    • How to recreate missing outputs

Extend Kedro

  • Common use cases
    • Use Case 1: How to add extra behaviour to Kedro’s execution timeline
    • Use Case 2: How to integrate Kedro with additional data sources
    • Use Case 3: How to add CLI commands that are reusable across projects
    • Use Case 4: How to customise the initial boilerplate of your project
  • Hooks
    • Introduction
    • Concepts
      • Hook specification
        • Execution timeline Hooks
        • Registration Hooks
      • Hook implementation
        • Registering your Hook implementations with Kedro
        • Disable auto-registered plugins’ Hooks
    • Common use cases
      • Use Hooks to extend a node’s behaviour
      • Use Hooks to customise the dataset load and save methods
    • Under the hood
    • Hooks examples
  • Custom datasets
    • Scenario
    • Project setup
    • The anatomy of a dataset
    • Implement the _load method with fsspec
    • Implement the _save method with fsspec
    • Implement the _describe method
    • The complete example
    • Integration with PartitionedDataSet
    • Versioning
    • Thread-safety
    • How to handle credentials and different filesystems
    • How to contribute a custom dataset implementation
  • Kedro plugins
    • Overview
    • Example of a simple plugin
    • Working with click
    • Project context
    • Initialisation
    • global and project commands
    • Suggested command convention
    • Hooks
    • Contributing process
    • Supported Kedro plugins
    • Community-developed plugins
  • Create a Kedro starter
    • How to create a Kedro starter
    • Configuration variables
      • Example Kedro starter
  • Dataset transformers (deprecated)
    • Develop your own dataset transformer
  • Decorators (deprecated)
    • How to apply a decorator to nodes
    • How to apply multiple decorators to nodes
    • How to apply a decorator to a pipeline
    • Kedro decorators

Logging

  • Logging
    • Configure logging
    • Use logging
    • Logging for anyconfig
  • Journal
    • Overview
      • Context journal record
      • Dataset journal record
    • Steps to manually reproduce your code and run the previous pipeline

Development

  • Set up Visual Studio Code
    • Advanced: For those using venv / virtualenv
    • Setting up tasks
    • Debugging
      • Advanced: Remote Interpreter / Debugging
    • Configuring the Kedro catalog validation schema
  • Set up PyCharm
    • Set up Run configurations
    • Debugging
    • Advanced: Remote SSH interpreter
    • Configuring the Kedro catalog validation schema
  • Kedro’s command line interface
    • Autocompletion (optional)
    • Invoke Kedro CLI from Python (optional)
    • Kedro commands
    • Global Kedro commands
      • Get help on Kedro commands
      • Confirm the Kedro version
      • Confirm Kedro information
      • Create a new Kedro project
      • Open the Kedro documentation in your browser
    • Project-specific Kedro commands
      • Project setup
        • Build the project’s dependency tree
        • Install all package dependencies
      • Run the project
        • Modifying a kedro run
      • Deploy the project
      • Pull a modular pipeline
      • Project quality
        • Build the project documentation
        • Lint your project
        • Test your project
      • Project development
        • Modular pipelines
        • Datasets
        • Data Catalog
        • Notebooks
  • Linting your Kedro project
  • Debugging
    • Introduction
    • Debugging Node
    • Debugging Pipeline

Deployment

  • Deployment guide
    • Deployment choices
  • Single-machine deployment
    • Container based
      • How to use container registry
    • Package based
    • CLI based
      • Use GitHub workflow to copy your project
      • Install and run the Kedro project
  • Distributed deployment
    • 1. Containerise the pipeline
    • 2. Convert your Kedro pipeline into targeted platform’s primitives
    • 3. Parameterise the runs
    • 4. (Optional) Create starters
  • Deployment with Argo Workflows
    • Why would you use Argo Workflows?
    • Prerequisites
    • How to run your Kedro pipeline using Argo Workflows
      • Containerise your Kedro project
      • Create Argo Workflows spec
      • Submit Argo Workflows spec to Kubernetes
      • Kedro-Argo plugin
  • Deployment with Prefect
    • Prerequisites
    • How to run your Kedro pipeline using Prefect
      • Convert your Kedro pipeline to Prefect flow
      • Run Prefect flow
  • Deployment with Kubeflow Pipelines
    • Why would you use Kubeflow Pipelines?
    • Prerequisites
    • How to run your Kedro pipeline using Kubeflow Pipelines
      • Containerise your Kedro project
      • Create a workflow spec
      • Authenticate Kubeflow Pipelines
      • Upload workflow spec and execute runs
  • Deployment with AWS Batch
    • Why would you use AWS Batch?
    • Prerequisites
    • How to run a Kedro pipeline using AWS Batch
      • Containerise your Kedro project
      • Provision resources
        • Create IAM Role
        • Create AWS Batch job definition
        • Create AWS Batch compute environment
        • Create AWS Batch job queue
      • Configure the credentials
      • Submit AWS Batch jobs
        • Create a custom runner
        • Set up Batch-related configuration
        • Update CLI implementation
      • Deploy
  • Deployment to a Databricks cluster
    • Prerequisites
    • Run the Kedro project with Databricks Connect
      • 1. Project setup
      • 2. Install dependencies and run locally
      • 3. Create a Databricks cluster
      • 4. Install Databricks Connect
      • 5. Configure Databricks Connect
      • 6. Copy local data into DBFS
      • 7. Run the project
    • Run Kedro project from a Databricks notebook
      • Extra requirements
      • 1. Create Kedro project
      • 2. Create GitHub personal access token
      • 3. Create a GitHub repository
      • 4. Push Kedro project to the GitHub repository
      • 5. Configure the Databricks cluster
      • 6. Run your Kedro project from the Databricks notebook
  • How to integrate Amazon SageMaker into your Kedro pipeline
    • Why would you use Amazon SageMaker?
    • Prerequisites
    • Prepare the environment
      • Install SageMaker package dependencies
      • Create SageMaker execution role
      • Create S3 bucket
    • Update the Kedro project
      • Create the configuration environment
      • Update the project hooks
      • Update the Data Science pipeline
        • Create node functions
        • Update the pipeline definition
      • Create the SageMaker entry point
    • Run the project
    • Cleanup

Tools Integration

  • Build a Kedro pipeline with PySpark
    • Centralise Spark configuration in conf/base/spark.yml
    • Initialise a SparkSession in custom project context class
    • Use Kedro’s built-in Spark datasets to load and save raw data
      • spark.SparkDataSet
      • spark.SparkJDBCDataSet
      • spark.SparkHiveDataSet
    • Use MemoryDataSet for intermediary DataFrame
    • Use MemoryDataSet with copy_mode="assign" for non-DataFrame Spark objects
    • Tips for maximising concurrency using ThreadRunner
  • Use Kedro with IPython and Jupyter Notebooks/Lab
    • Why use a Notebook?
    • Kedro and IPython
      • Load DataCatalog in IPython
        • Dataset versioning
    • Kedro and Jupyter
    • How to use context
      • Run the pipeline
      • Parameters
      • Load/Save DataCatalog in Jupyter
      • Additional parameters for session.run()
    • Global variables
    • Convert functions from Jupyter Notebooks into Kedro nodes
    • IPython loader
      • Installation
      • Prerequisites
      • Troubleshooting and FAQs
        • How can I stop my notebook terminating?
        • Why can’t I run kedro jupyter notebook?
        • How can I reload the session, context, catalog and startup_error variables?

FAQs

  • Frequently asked questions
    • What is Kedro?
    • Who maintains Kedro?
    • What are the primary advantages of Kedro?
    • How does Kedro compare to other projects?
    • What is data engineering convention?
    • How do I upgrade Kedro?
    • How can I use a development version of Kedro?
    • How can I find out more about Kedro?
    • How can I get my question answered?
  • Kedro architecture overview
    • Building blocks
      • Project
        • cli.py
        • run.py
        • pyproject.toml
        • settings.py
        • 00-kedro-init.py
      • Framework
        • kedro cli
        • kedro/cli/cli.py
        • plugins
        • get_project_context()
        • load_context()
        • KedroSession
        • KedroContext
      • Library
        • ConfigLoader
        • Pipeline
        • AbstractRunner
        • DataCatalog
        • AbstractDataSet

Resources

  • Images and icons
    • White background
      • Icon
      • Icon with text
    • Black background
      • Icon
      • Icon with text

API Docs

  • kedro
    • kedro.config
      • kedro.config.ConfigLoader
      • kedro.config.TemplatedConfigLoader
    • kedro.framework.hooks
      • Data Catalog Hooks
        • kedro.framework.hooks.specs.DataCatalogSpecs
      • Node Hooks
        • kedro.framework.hooks.specs.NodeSpecs
      • Pipeline Hooks
        • kedro.framework.hooks.specs.PipelineSpecs
    • kedro.io
      • Data Catalog
        • kedro.io.DataCatalog
      • Data Sets
        • kedro.io.LambdaDataSet
        • kedro.io.MemoryDataSet
        • kedro.io.PartitionedDataSet
        • kedro.io.IncrementalDataSet
        • kedro.io.CachedDataSet
        • kedro.io.DataCatalogWithDefault
      • Errors
        • kedro.io.DataSetAlreadyExistsError
        • kedro.io.DataSetError
        • kedro.io.DataSetNotFoundError
      • Base Classes
        • kedro.io.AbstractDataSet
        • kedro.io.AbstractVersionedDataSet
        • kedro.io.AbstractTransformer
        • kedro.io.Version
    • kedro.pipeline
      • kedro.pipeline.Pipeline
      • kedro.pipeline.node.Node
      • kedro.pipeline.node
      • kedro.pipeline.decorators.log_time
    • kedro.runner
      • kedro.runner.AbstractRunner
      • kedro.runner.SequentialRunner
      • kedro.runner.ParallelRunner
      • kedro.runner.ThreadRunner
    • kedro.framework.context
      • Base Classes
        • kedro.framework.context.KedroContext
      • Functions
        • kedro.framework.context.load_context
      • Errors
        • kedro.framework.context.KedroContextError
    • kedro.framework.cli
      • kedro.framework.cli.get_project_context
    • kedro.framework.startup
      • Base Classes
        • kedro.framework.startup.ProjectMetadata
    • kedro.versioning
      • Base Classes
        • kedro.versioning.Journal
      • Modules
      • Errors
    • kedro.extras.datasets
      • Data Sets
        • kedro.extras.datasets.api.APIDataSet
        • kedro.extras.datasets.biosequence.BioSequenceDataSet
        • kedro.extras.datasets.dask.ParquetDataSet
        • kedro.extras.datasets.geopandas.GeoJSONDataSet
        • kedro.extras.datasets.matplotlib.MatplotlibWriter
        • kedro.extras.datasets.holoviews.HoloviewsWriter
        • kedro.extras.datasets.networkx.NetworkXDataSet
        • kedro.extras.datasets.pandas.CSVDataSet
        • kedro.extras.datasets.pandas.ExcelDataSet
        • kedro.extras.datasets.pandas.AppendableExcelDataSet
        • kedro.extras.datasets.pandas.FeatherDataSet
        • kedro.extras.datasets.pandas.GBQTableDataSet
        • kedro.extras.datasets.pandas.HDFDataSet
        • kedro.extras.datasets.pandas.JSONDataSet
        • kedro.extras.datasets.pandas.ParquetDataSet
        • kedro.extras.datasets.pandas.SQLQueryDataSet
        • kedro.extras.datasets.pandas.SQLTableDataSet
        • kedro.extras.datasets.pickle.PickleDataSet
        • kedro.extras.datasets.pillow.ImageDataSet
        • kedro.extras.datasets.spark.SparkDataSet
        • kedro.extras.datasets.spark.SparkHiveDataSet
        • kedro.extras.datasets.spark.SparkJDBCDataSet
        • kedro.extras.datasets.tensorflow.TensorFlowModelDataset
        • kedro.extras.datasets.text.TextDataSet
        • kedro.extras.datasets.yaml.YAMLDataSet
    • kedro.extras.decorators
      • kedro.extras.decorators.retry_node.retry
      • kedro.extras.decorators.memory_profiler.mem_profile
    • kedro.extras.transformers
      • kedro.extras.transformers.memory_profiler.ProfileMemoryTransformer
      • kedro.extras.transformers.time_profiler.ProfileTimeTransformer
    • kedro.extras.logging
      • kedro.extras.logging.ColorHandler
Kedro
  • Docs »
  • Search
  • Edit on GitHub


© Copyright 2020, QuantumBlack Visual Analytics Limited

Built with Sphinx using a theme provided by Read the Docs.