A “Hello World” example¶
To learn how basic Kedro projects work, you can create a project interactively and explore it as you read this section. Feel free to name your project as you like, but this guide will assume the project is named getting-started
.
Be sure to enter Y
to include Kedro’s example so your new project template contains the well-known Iris dataset, to get you started.
The Iris dataset, generated in 1936 by the British statistician and biologist Ronald Fisher, is a simple, but frequently-referenced dataset. It contains 150 samples in total, comprising 50 samples of 3 different species of Iris plant (Iris Setosa, Iris Versicolour and Iris Virginica). For each sample, the flower measurements are recorded for the sepal length, sepal width, petal length and petal width.
Classification is a method, within the context of machine learning, to determine what group some object belongs to based on known categorisation of similar objects.
The Iris dataset can be used by a machine learning model to illustrate classification. The classification algorithm, once trained on data with known values of species, takes an input of sepal and petal measurements, and compares them to the values it has stored from its training data. It will then output a predictive classification of the Iris species.
Project directory structure¶
The project directory will be structured as shown. You are free to adapt the folder structure to your project’s needs, but the example shows a convenient starting point and some best-practices:
getting-started # Parent directory of the template
├── .gitignore # Prevent staging of unnecessary files to git
├── kedro_cli.py # A collection of Kedro command line interface (CLI) commands
├── .kedro.yml # Path to discover project context
├── README.md # Project README
├── .ipython # IPython startup scripts
├── conf # Project configuration files
├── data # Local project data (not committed to version control)
├── docs # Project documentation
├── logs # Project output logs (not committed to version control)
├── notebooks # Project related Jupyter notebooks
├── references # Sharable project references (tables, pdfs, etc.)
├── results # Shareable results
└── src # Project source code
If you opted to include Kedro’s built-in example when you created the project then the conf/
, data/
and src/
directories will be pre-populated with an example configuration, input data and Python source code respectively.
Project source code¶
The project’s source code can be found in the src
directory. It contains 2 subfolders:
getting_started/
- this is the Python package for your project:pipelines/data_engineering/nodes.py
andpipelines/data_science/nodes.py
- Example node functions, which perform the actual operations on the data (more on this in the Example pipeline below)pipelines/data_engineering/pipeline.py
andpipelines/data_science/pipeline.py
- Where each individual pipeline is created from the above nodes to form the business logic flowpipeline.py
- Where the project’s main pipelines are collated and namedrun.py
- The main entry point of the project, which brings all the components together and runs the pipeline
tests/
: This is where you should keep the project unit tests. Newly generated projects are preconfigured to run these tests usingpytest
. To kick off project testing, simply run the following from the project’s root directory:
kedro test
Writing code¶
Use the notebooks
folder for experimental code and move the code to src/
as it develops.
Project components¶
A kedro
project consists of the following main components:
Component | Description |
---|---|
Data Catalog | A collection of datasets that can be used to form the data pipeline.
Each dataset provides load and save capabilities for
a specific data type, e.g. CSVS3DataSet loads and saves data
to a csv file in S3. |
Pipeline | A collection of nodes. A pipeline takes care of node dependencies and execution order. |
Node | A Python function which executes some business logic, e.g. data cleaning, dropping columns, validation, model training, scoring, etc. |
Runner | An object that runs the kedro pipeline using the specified
data catalog. Currently kedro supports 2 runner types:
SequentialRunner and ParallelRunner . |
Data¶
You can store data under the appropriate layer in the data
folder. We recommend that all raw data should go into raw
and processed data should move to other layers according to data engineering convention.
Example pipeline¶
The getting-started
project contains two pipelines: a data_engineering
pipeline and data_science
pipeline, found in src/getting_started/pipelines
, with relevant example node functions pertaining to each of them. The following data-engineering nodes are provided in src/getting_started/pipelines/data_engineering/nodes.py
:
Node | Description | Node Function Name |
---|---|---|
Split data | Splits the example Iris dataset into train and test samples | split_data |
As well as data-science nodes in src/getting_started/pipelines/data_science/nodes.py
:
Node | Description | Node Function Name |
---|---|---|
Train model | Trains a simple multi-class logistic regression model | train_model |
Predict | Makes class predictions given a pre-trained model and a test set | predict |
Report accuracy | Reports the accuracy of the predictions performed by the previous node | report_accuracy |
Node execution order is determined by resolving the input and output data dependencies between the nodes and not by the order in which the nodes were passed into the pipeline.
Configuration¶
There are two default folders for adding configuration - conf/base/
and conf/local/
:
conf/base/
- Used for project-specific configurationconf/local/
- Used for access credentials, personal IDE configuration or other sensitive / personal content
Project-specific configuration¶
There are three files used for project-specific configuration:
catalog.yml
- The Data Catalog allows you to define the file paths and loading / saving configuration required for different datasetslogging.yml
- Uses Python’s defaultlogging
library to set up loggingparameters.yml
- Allows you to define parameters for machine learning experiments e.g. train / test split and number of iterations
Sensitive or personal configuration¶
As we described above, any access credentials, personal IDE configuration or other sensitive and personal content should be stored in conf/local/
. By default, credentials.yml
is generated in conf/base/
(because conf/local/
is ignored by git
) and to populate and use the file, you should first move it to conf/local/
. Further safeguards for preventing sensitive information from being leaked onto git
are discussed in the FAQs.
Running the example¶
In order to run the getting-started
project, simply execute the following from the root project directory:
kedro run
This command calls the run()
method on the ProjectContext
class defined in src/getting_started/run.py
, which in turn does the following:
- Instantiates
ProjectContext
class:- Reads relevant configuration
- Configures Python
logging
- Instantiates the
DataCatalog
and feeds a dictionary containingparameters
config
- Instantiates the pipeline
- Instantiates the
SequentialRunner
and runs it by passing the following arguments:Pipeline
objectDataCatalog
object
Upon successful completion, you should see the following log message in your console:
2019-02-13 16:59:26,293 - kedro.runner.sequential_runner - INFO - Completed 4 out of 4 tasks
2019-02-13 16:59:26,293 - kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully.
Summary¶
Congratulations! In this chapter you have set up Kedro and used it to create a first example project, which has illustrated the basic concepts of using nodes to form a pipeline, a Data Catalog and the project configuration. This example uses a simple and familiar dataset, to keep your first experience very basic and easy to follow. In the next chapter, we will revisit the core concepts in more detail and walk through a more complex example.