Basic Usage

[1]:
%load_ext autoreload
%autoreload 2
The central data structure provided by the library is the BlobPath type
This type would abstract away the internals of how the file is stored and works in a cloud agnostic manner

Note that you would need to install the aws extra to work with S3 paths:

pip install 'blob-path[aws]'
[2]:
from blob_path.backends.s3 import S3BlobPath
from pathlib import PurePath

bucket_name = "narang-public-s3"
object_key = PurePath("hello_world.txt")
region = "us-east-1"
blob_path = S3BlobPath(bucket_name, region, object_key)
The blob path is simply a path representation, like pathlib.Path, its not required that the file should exist or not
You can check for existence using exists
[3]:
blob_path.exists()
[3]:
True
The main method that BlobPath provides is open, it mimicks the builtin open function to some extent
This method is the central abstraction, many operations are handled in a generic way using this method

Lets write something to the object in our bucket

[4]:
with blob_path.open("w") as f:
    f.write("hello world")

# the file would exist in S3 now, you should check it out
blob_path.exists()
[4]:
True
S3 and other cloud storage blob paths can be fully serialised and deserialised.
You can pass around these path objects across processes (and servers) and easily locate the file
[5]:
# a single blob path can be serialised using the method `serialise`
blob_path.serialise()
[5]:
{'kind': 'blob-path-aws',
 'payload': {'bucket': 'narang-public-s3',
  'region': 'us-east-1',
  'object_key': ['hello_world.txt']}}
[6]:
# lets deserialise them
# deserialise is a separate function and you can pass it any kind of blob path and it would correctly deserialise it

from blob_path.deserialise import deserialise

deserialised_s3_blob = deserialise(
    {
        "kind": "blob-path-aws",
        "payload": {
            "bucket": "narang-public-s3",
            "region": "us-east-1",
            "object_key": ["hello_world.txt"],
        },
    }
)

deserialised_s3_blob
[6]:
kind=blob-path-aws bucket=narang-public-s3 region=us-east-1 object_key=hello_world.txt
Lets try another path backend, the LocalRelativeBlobPath, this path models a local FS relative path, which is always rooted at a single root directory
Consider you store all the application files inside a single path “/tmp/my-apps-files”
In this case, instead of using pathlib.Path, you could use LocalRelativeBlobPath (this allows you to easily switch between using a cloud storage or a local storage for your files)
[7]:
from blob_path.backends.local_relative import LocalRelativeBlobPath

# PurePath is a simple path representation, but it does not care whether its actually a path or not in your FS
# Its useful for logically representing various data structures, as an example, you could represent S3 object keys as `PurePaths`
from pathlib import PurePath

relpath = PurePath("local") / "storage.txt"
local_blob = LocalRelativeBlobPath(relpath)
[8]:
local_blob.exists()
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 local_blob.exists()

File ~/Desktop/personal/blob-path/src/blob_path/backends/local_relative.py:74, in LocalRelativeBlobPath.exists(self)
     73 def exists(self) -> bool:
---> 74     return (self._p()).exists()

File ~/Desktop/personal/blob-path/src/blob_path/backends/local_relative.py:94, in LocalRelativeBlobPath._p(self)
     93 def _p(self) -> Path:
---> 94     return _get_implicit_base_path() / self._relpath

File ~/Desktop/personal/blob-path/src/blob_path/backends/local_relative.py:110, in _get_implicit_base_path()
    109 def _get_implicit_base_path() -> Path:
--> 110     base_path = Path(get_implicit_var(BASE_VAR))
    111     base_path.mkdir(exist_ok=True, parents=True)
    112     return base_path

File ~/Desktop/personal/blob-path/src/blob_path/implicit.py:30, in get_implicit_var(var)
     28 result = _PROVIDER(var)
     29 if result is None:
---> 30     raise Exception(
     31         "tried fetching implicit variable from environment "
     32         + f"but the var os.environ['{var}'] does not exist"
     33     )
     34 return result

Exception: tried fetching implicit variable from environment but the var os.environ['IMPLICIT_BLOB_PATH_LOCAL_RELATIVE_BASE_DIR'] does not exist

Uh oh, we got an error, that too really early ;_; It says that we have not defined IMPLICIT_BLOB_PATH_LOCAL_RELATIVE_BASE_DIR in our environment

This environment variable stores the root directory of your relative paths

[9]:
from pathlib import Path
import os

os.environ["IMPLICIT_BLOB_PATH_LOCAL_RELATIVE_BASE_DIR"] = str(
    Path.home() / "tmp" / "local_fs_root"
)

# it passes now, and says that the file does not exist
local_blob.exists()
[9]:
True
So why is LocalRelativeBlobPath taking the root directory as an environment variable? Could we pass it in __init__?
We could argue about this, but then the path is pretty much the same as any absolute path. Even the serialised representation of LocalRelativeBlobPath leaves out the root directory (its not part of the path representation)

Implict variables

These variables which modify the behavior of BlobPath are called implicit variables. They are by default, picked from the environment
Fetching the root directory from environment has multiple benefits
  • You could mount the same path between multiple containers at different mount points and still pass around the serialised representation correctly (assuming you provide the implicit variables correctly)

  • Same for servers mounted with an NFS

  • This also works well for presigned URLs, where you can simply start an nginx server and pass that server’s base URL as an implicit variable to the path

Implicit variables will change the behavior and location of your blobs implicitly (hah! perfect naming). Every implicit variable follows the naming convention: IMPLICIT_BLOB_PATH_<BACKEND>_...
Currently, only LocalRelativeBlobPath has implicit variables

Let’s do a simple copy operation between an S3 path and a local path

[10]:
import shutil

# the long way
with deserialised_s3_blob.open("r") as fr:
    with local_blob.open("w") as fw:
        shutil.copyfileobj(fr, fw)

with local_blob.open("r") as f:
    print(f.read())
hello world
Lets use a shortcut now.
Whenever possible, prefer shortcuts from the library for your operations
Currently, they only provide ease-of-use, but we can later optimise away special cases (like copying between two S3 blobs can be triggered using a remote copy with boto3, without copying data in your local machine)
[11]:
# delete first for the example
local_blob.delete()

deserialised_s3_blob.cp(local_blob)
with local_blob.open("r") as f:
    print("local blob content copied from s3:", f.read())


# using a shortcut from the library
# this shortcut provides more convenience, any of the `src` or `dest` can be `pathlib.Path` too
# this makes it easy to deal with normal paths in your FS
from blob_path.shortcuts import cp

local_blob.delete()
cp(deserialised_s3_blob, local_blob)
with local_blob.open("r") as f:
    print("copied using shortcut:", f.read())
local blob content copied from s3: hello world
copied using shortcut: hello world
Lets play a bit with an Azure path now, if you want, you can change it to any of the other paths, this to simply show that everything works with same with Azure paths
We will copy data from the S3 path to the Azure path now

You will need to install the azure extra

pip install 'blob-path[azure]'
[12]:
from blob_path.backends.azure_blob_storage import AzureBlobPath
from pathlib import PurePath

destination = AzureBlobPath("narang99blobstore", "testcontainer", PurePath("copied") / "from" / "s3.txt")
[13]:
deserialised_s3_blob.cp(destination)
destination.exists()
[13]:
True