Working with Databricks¶
Note: This documentation is based onKedro 0.15.1
, if you spot anything that is incorrect then please create an issue or pull request.
Databricks Connect (recommended)¶
We recommend using Databricks Connect to easily execute your Kedro pipeline on a Databricks cluster.
Databricks Connect connects your favourite IDE (IntelliJ, Eclipse, VS Code and PyCharm), notebook server (Zeppelin, Jupyter), and other custom applications to Databricks clusters to run Spark code.
You can setup Databricks Connect according to the instructions listed here.
Note: You will need to uninstall PySpark, as Databricks Connect will install it for you. This method only works for 5.x versions of Databricks clusters and disables use of Databricks Notebook.
GitHub workflow with Databricks¶
This workflow posits that development of the Kedro project is done on a local environment under version control by Git. Commits are pushed to a remote server (e.g. GitHub, GitLab, Bitbucket, etc.).
Deployment of the (latest) code on the Databricks driver is accomplished
through cloning
and the periodic pulling
of changes from the Git remote.
The pipeline is then executed on the Databricks cluster.
While this example uses GitHub Personal Access Tokens (or equivalents for Bitbucket, GitLab, etc.) you should be able to use your GitHub password as well (although this is less secure).
Firstly, you will need to generate a GitHub Personal Access Token with the relevant privileges.
Add your username and token to the environment variables of your running Databricks environment (all the following commands should be run inside a Notebook):
import os
os.environ["GITHUB_USER"] = "YOUR_USERNAME"
os.environ["GITHUB_TOKEN"] = "YOUR_TOKEN"
Then clone your project to a directory of your choosing:
%sh mkdir -vp ~/projects/ && cd ~/projects/ &&
git clone https://${GITHUB_USER}:${GITHUB_TOKEN}@github.com/**/your_project.git
And, cd
into your project directory:
cd ~/projects/your_project
You’ll need to add the src
directory to path using:
import sys
import os.path
sys.path.append(os.path.abspath("./src")
Then, import and execute the run
module to run your pipeline:
import your_project.run as run
run.main()
To pull in updates to your code run from your project directory:
%sh git pull
Detach and re-attach your Notebook or re-import the run module for changes to be picked up:
import importlib
run = importlib.reload(run)