pypeln.process module
The process
module lets you create pipelines using objects from python's multiprocessing module according to Pypeline's general architecture. Use this module when you are in need of true parallelism for CPU heavy operations but be aware of its implications (continue reading).
from pypeln import process as pr import time from random import random def slow_add1(x): time.sleep(random()) # <= some slow computation return x + 1 def slow_gt3(x): time.sleep(random()) # <= some slow computation return x > 3 data = range(10) # [0, 1, 2, ..., 9] stage = pr.map(slow_add1, data, workers = 3, maxsize = 4) stage = pr.filter(slow_gt3, stage, workers = 2) data = list(stage) # e.g. [5, 6, 9, 4, 8, 10, 7]
Stage
All functions from this module return a private pypeln.process._Stage
object. Stages are lazy, that is, a _Stage
objects merely contains the information needed to perform the computation of itself and the Stages it depends on. Stages are iterables i.e. they implement __iter__
, to actually execute the pipeline you can directly iterable them or iterate over the generator returned by to_iterable
which gives you more control.
from pypeln import process as pr import time from random import random def slow_add1(x): time.sleep(random()) # <= some slow computation return x + 1 data = range(10) # [0, 1, 2, ..., 9] stage = pr.map(slow_add1, data, workers = 3, maxsize = 4) for x in stage: print(x) # e.g. 2, 1, 5, 6, 3, 4, 7, 8, 9, 10
This iterable API makes Pypeline a very intuitive/pythonic to use and compatible with most other python code.
Workers
The worker type of this module is a multiprocessing.Process. Each worker process is instantiated with daemon = True
. Creating each process is slow and consumes a lot of memory. Since processes are technically separate programs managed by the OS they are great for doing operations in parallel by avoiding the GIL (or rather having their separate GIL).
The number of workers on each stage can usually be controled by the workers
parameter on pypeln.process
's various functions. Try not to create more processes than the number of cores you have on your machine or else they will end up fighting for resources and computation will be suboptimal.
Queue
The queue type of this module is a multiprocessing.Queue. Since processes don't share memory, all information passed between them through these queues must first be serialized (pickled) which is slow, be aware of this and try to avoid sending large objects.
The number of elements each stage can hold usually be controled by the maxsize
parameter on pypeln.process
's various functions. When passed this parameter sets a maximum size for the input Queue of the stage, this serves as a backpressure mechanism because any stages pushing data to a Queue that becomes full (reaches its maxsize
) will have to stop their computation until space becomes available, thus, potentially preveting OutOfMemeory
errors due to overpressure from faster stages.
Resource Management
There are many occasions where you need to create some resource objects (e.g. http or database sessions) that for efficiency are expected to last the whole lifespan of each worker. To handle such objects many functions have the on_start
and on_done
arguments which expect some callback functions.
When a worker is created it calls the on_start
function, this functions should create and return the resource objects. These object will be passed as extra arguments to the main function and also to the on_end
function.
from pypeln import process as pr def x(): http_session = get_http_session() db_session = get_db_session() return http_session, db_session def on_end(_stage_status, http_session, db_session): http_session.close() db_session.close() def f(x, http_session, db_session): # some logic return y stage = pr.map(f, stage, workers = 3, on_start = on_start)
A few notes:
- The resource objects are created per worker.
on_start
should return a object other thatNone
or a tuple of resource objects.- If
on_start
returns some arguments then they must be accepted byf
andon_end
. on_end
receives aStageStatus
object followed by the resource objects created byon_start
.
Pipe Operator
Functions that accept a stage
parameter return a Partial
instead of a new stage when stage
is not given. These Partial
s are callables that accept the missing stage
parameter and return the full output of the original function. For example
pr.map(f, stage, **kwargs) = pr.map(f, **kwargs)(stage)
The important thing about partials is that they implement the pipe |
operator as
x | partial = partial(x)
This allows you to define pipelines in the following way:
from pypenl import process as pr data = ( range(10) | pr.map(slow_add1, workers = 3, maxsize = 4) | pr.filter(slow_gt3, workers = 2) | list )
Recomendations
Creating processes and doing Inter-Process Communication (IPC) is expensive, therefore we recommend the following:
- Minimize the number of stages based on this module.
- Tune the number of workers based on the number of cores.
- When processing large datasets set the maxsize of the stage to prevent
OutOfMemory
errors. - If possible don't send large objects.
- If you just need a single stage to perform a task over a collection in parallel use the
each
function.
""" The `process` module lets you create pipelines using objects from python's [multiprocessing](https://docs.python.org/3.4/library/multiprocessing.html) module according to Pypeline's general [architecture](https://cgarciae.gitbook.io/pypeln/#architecture). Use this module when you are in need of true parallelism for CPU heavy operations but be aware of its implications (continue reading). from pypeln import process as pr import time from random import random def slow_add1(x): time.sleep(random()) # <= some slow computation return x + 1 def slow_gt3(x): time.sleep(random()) # <= some slow computation return x > 3 data = range(10) # [0, 1, 2, ..., 9] stage = pr.map(slow_add1, data, workers = 3, maxsize = 4) stage = pr.filter(slow_gt3, stage, workers = 2) data = list(stage) # e.g. [5, 6, 9, 4, 8, 10, 7] ## Stage All functions from this module return a private `pypeln.process._Stage` object. Stages are lazy, that is, a `_Stage` objects merely contains the information needed to perform the computation of itself and the Stages it depends on. Stages are [iterables](https://docs.python.org/3/glossary.html#term-iterable) i.e. they implement `__iter__`, to actually execute the pipeline you can directly iterable them or iterate over the generator returned by `pypeln.process.to_iterable` which gives you more control. from pypeln import process as pr import time from random import random def slow_add1(x): time.sleep(random()) # <= some slow computation return x + 1 data = range(10) # [0, 1, 2, ..., 9] stage = pr.map(slow_add1, data, workers = 3, maxsize = 4) for x in stage: print(x) # e.g. 2, 1, 5, 6, 3, 4, 7, 8, 9, 10 This iterable API makes Pypeline a very intuitive/pythonic to use and compatible with most other python code. ## Workers The worker type of this module is a [multiprocessing.Process](https://docs.python.org/3.4/library/multiprocessing.html#multiprocessing.Process). Each worker process is instantiated with `daemon = True`. Creating each process is slow and consumes a lot of memory. Since processes are technically separate programs managed by the OS they are great for doing operations in parallel by avoiding the [GIL](https://realpython.com/python-gil) (or rather having their separate GIL). The number of workers on each stage can usually be controled by the `workers` parameter on `pypeln.process`'s various functions. Try not to create more processes than the number of cores you have on your machine or else they will end up fighting for resources and computation will be suboptimal. ## Queue The queue type of this module is a [multiprocessing.Queue](https://docs.python.org/3.4/library/multiprocessing.html#multiprocessing.Queue). Since processes don't share memory, all information passed between them through these queues must first be serialized (pickled) which is slow, be aware of this and try to avoid sending large objects. The number of elements each stage can hold usually be controled by the `maxsize` parameter on `pypeln.process`'s various functions. When passed this parameter sets a maximum size for the input Queue of the stage, this serves as a [backpressure](https://www.quora.com/What-is-backpressure-in-the-context-of-data-streaming) mechanism because any stages pushing data to a Queue that becomes full (reaches its `maxsize`) will have to stop their computation until space becomes available, thus, potentially preveting `OutOfMemeory` errors due to overpressure from faster stages. ## Resource Management There are many occasions where you need to create some resource objects (e.g. http or database sessions) that for efficiency are expected to last the whole lifespan of each worker. To handle such objects many functions have the `on_start` and `on_done` arguments which expect some callback functions. When a worker is created it calls the `on_start` function, this functions should create and return the resource objects. These object will be passed as extra arguments to the main function and also to the `on_end` function. from pypeln import process as pr def x(): http_session = get_http_session() db_session = get_db_session() return http_session, db_session def on_end(_stage_status, http_session, db_session): http_session.close() db_session.close() def f(x, http_session, db_session): # some logic return y stage = pr.map(f, stage, workers = 3, on_start = on_start) A few notes: * The resource objects are created per worker. * `on_start` should return a object other that `None` or a tuple of resource objects. * If `on_start` returns some arguments then they must be accepted by `f` and `on_end`. * `on_end` receives a `pypeln.process.StageStatus` object followed by the resource objects created by `on_start`. ## Pipe Operator Functions that accept a `stage` parameter return a `Partial` instead of a new stage when `stage` is not given. These `Partial`s are callables that accept the missing `stage` parameter and return the full output of the original function. For example pr.map(f, stage, **kwargs) = pr.map(f, **kwargs)(stage) The important thing about partials is that they implement the pipe `|` operator as x | partial = partial(x) This allows you to define pipelines in the following way: from pypenl import process as pr data = ( range(10) | pr.map(slow_add1, workers = 3, maxsize = 4) | pr.filter(slow_gt3, workers = 2) | list ) ## Recomendations Creating processes and doing Inter-Process Communication (IPC) is expensive, therefore we recommend the following: * Minimize the number of stages based on this module. * Tune the number of workers based on the number of cores. * When processing large datasets set the maxsize of the stage to prevent `OutOfMemory` errors. * If possible don't send large objects. * If you just need a single stage to perform a task over a collection in parallel use the `pypeln.process.each` function. """ from __future__ import absolute_import, print_function import functools from collections import namedtuple from . import utils import sys import traceback import types import inspect ############# # imports pr ############# from multiprocessing import Process as WORKER from multiprocessing import Manager, Lock, Queue from multiprocessing.queues import Full, Empty from threading import Thread from collections import namedtuple from . import utils _MANAGER = Manager() def _get_namespace(): return _MANAGER.Namespace() ############# # imports th ############# # from threading import Thread as WORKER # from threading import Thread # from.utils import Namespace # from six.moves.queue import Queue, Empty, Full # from threading import Lock # def _get_namespace(): # return Namespace() #################### # classes #################### class _Stage(utils.BaseStage): def __init__(self, worker_constructor, workers, maxsize, on_start, on_done, target, args, dependencies): self.worker_constructor = worker_constructor self.workers = workers self.maxsize = maxsize self.on_start = on_start self.on_done = on_done self.target = target self.args = args self.dependencies = dependencies def __iter__(self): return to_iterable(self) def __repr__(self): return "_Stage(worker_constructor = {worker_constructor}, workers = {workers}, maxsize = {maxsize}, target = {target}, args = {args}, dependencies = {dependencies})".format( worker_constructor = self.worker_constructor, workers = self.workers, maxsize = self.maxsize, target = self.target, args = self.args, dependencies = len(self.dependencies), ) class _StageParams(namedtuple("_StageParams", [ "input_queue", "output_queues", "on_start", "on_done", "stage_namespace", "stage_lock", "pipeline_namespace", "pipeline_error_queue", "index", ])): pass WorkerInfo = namedtuple("WorkerInfo", ["index"]) class StageStatus(object): """ Object passed to various `on_done` callbacks. It contains information about the stage in case book keeping is needed. """ def __init__(self, namespace, lock): self._namespace = namespace self._lock = lock @property def done(self): """ `bool` : `True` if all workers finished. """ with self._lock: return self._namespace.active_workers == 0 @property def active_workers(self): """ `int` : Number of active workers. """ with self._lock: return self._namespace.active_workers def __str__(self): return "StageStatus(done = {done}, active_workers = {active_workers})".format( done = self.done, active_workers = self.active_workers, ) class _InputQueue(object): def __init__(self, maxsize, total_done, pipeline_namespace, **kwargs): self.queue = Queue(maxsize = maxsize, **kwargs) self.lock = Lock() self.namespace = _get_namespace() self.namespace.remaining = total_done self.pipeline_namespace = pipeline_namespace def __iter__(self): while not self.is_done(): x = self.get() if self.pipeline_namespace.error: return if not utils.is_continue(x): yield x def get(self): try: x = self.queue.get(timeout = utils.TIMEOUT) except (Empty, Full): return utils.CONTINUE if not utils.is_done(x): return x else: with self.lock: self.namespace.remaining -= 1 return utils.CONTINUE def is_done(self): return self.namespace.remaining == 0 and self.queue.empty() def put(self, x): self.queue.put(x) def done(self): self.queue.put(utils.DONE) class _OutputQueues(list): def put(self, x): for queue in self: queue.put(x) def done(self): for queue in self: queue.put(utils.DONE) def _handle_exceptions(params): def handle_exceptions(f): @functools.wraps(f) def wrapper(*args, **kwargs): try: return f(*args, **kwargs) except BaseException as e: params.pipeline_error_queue.put((type(e), e, "".join(traceback.format_exception(*sys.exc_info())))) params.pipeline_namespace.error = True return wrapper return handle_exceptions def _run_task(f_task, params): try: if params.on_start is not None: n_args = len(inspect.getargspec(params.on_start).args) if n_args == 0: args = params.on_start() elif n_args == 1: worker_info = WorkerInfo( index = params.index ) args = params.on_start(worker_info) else: args = None else: args = None if args is None: args = () elif not isinstance(args, tuple): args = (args,) if params.input_queue: for x in params.input_queue: f_task(x, args) else: f_task(args) params.output_queues.done() if params.on_done is not None: with params.stage_lock: params.stage_namespace.active_workers -= 1 stage_status = StageStatus( namespace = params.stage_namespace, lock = params.stage_lock, ) params.on_done(stage_status, *args) except BaseException as e: try: params.pipeline_error_queue.put((type(e), e, "".join(traceback.format_exception(*sys.exc_info())))) params.pipeline_namespace.error = True except BaseException as e: print(e) ########### # map ########### def _map(f, params): @_handle_exceptions(params) def f_task(x, args): y = f(x, *args) params.output_queues.put(y) _run_task(f_task, params) def map(f, stage = utils.UNDEFINED, workers = 1, maxsize = 0, on_start = None, on_done = None): """ Creates a stage that maps a function `f` over the data. Its intended to behave like python's built-in `map` function but with the added concurrency. from pypeln import process as pr import time from random import random def slow_add1(x): time.sleep(random()) # <= some slow computation return x + 1 data = range(10) # [0, 1, 2, ..., 9] stage = pr.map(slow_add1, data, workers = 3, maxsize = 4) data = list(stage) # e.g. [2, 1, 5, 6, 3, 4, 7, 8, 9, 10] Note that because of concurrency order is not guaranteed. # **Args** * **`f`** : a function with signature `f(x, *args) -> y`, where `args` is the return of `on_start` if present, else the signature is just `f(x) -> y`. * **`stage = Undefined`** : a stage or iterable. * **`workers = 1`** : the number of workers the stage should contain. * **`maxsize = 0`** : the maximum number of objects the stage can hold simultaneously, if set to `0` (default) then the stage can grow unbounded. * **`on_start = None`** : a function with signature `on_start() -> args`, where `args` can be any object different than `None` or a tuple of objects. The returned `args` are passed to `f` and `on_done`. This function is executed once per worker at the beggining. * **`on_done = None`** : a function with signature `on_done(stage_status, *args)`, where `args` is the return of `on_start` if present, else the signature is just `on_done(stage_status)`, and `stage_status` is of type `pypeln.process.StageStatus`. This function is executed once per worker when the worker is done. # **Returns** * If the `stage` parameters is given then this function returns a new stage, else it returns a `Partial`. """ if utils.is_undefined(stage): return utils.Partial(lambda stage: map(f, stage, workers=workers, maxsize=maxsize, on_start=on_start, on_done=on_done)) stage = _to_stage(stage) return _Stage( worker_constructor = WORKER, workers = workers, maxsize = maxsize, on_start = on_start, on_done = on_done, target = _map, args = (f,), dependencies = [stage], ) ########### # flat_map ########### def _flat_map(f, params): @_handle_exceptions(params) def f_task(x, args): for y in f(x, *args): params.output_queues.put(y) _run_task(f_task, params) def flat_map(f, stage = utils.UNDEFINED, workers = 1, maxsize = 0, on_start = None, on_done = None): """ Creates a stage that maps a function `f` over the data, however unlike `pypeln.process.map` in this case `f` returns an iterable. As its name implies, `flat_map` will flatten out these iterables so the resulting stage just contains their elements. from pypeln import process as pr import time from random import random def slow_integer_pair(x): time.sleep(random()) # <= some slow computation if x == 0: yield x else: yield x yield -x data = range(10) # [0, 1, 2, ..., 9] stage = pr.flat_map(slow_integer_pair, data, workers = 3, maxsize = 4) list(stage) # e.g. [2, -2, 3, -3, 0, 1, -1, 6, -6, 4, -4, ...] Note that because of concurrency order is not guaranteed. Also, `flat_map` is more general than both `pypeln.process.map` and `pypeln.process.filter`, as such these expressions are equivalent: from pypeln import process as pr pr.map(f, stage) = pr.flat_map(lambda x: [f(x)], stage) pr.filter(f, stage) = pr.flat_map(lambda x: [x] if f(x) else [], stage) Using `flat_map` with a generator function is very useful as we are able to filter out unwanted elements when e.g. there are exceptions, missing data, etc. # **Args** * **`f`** : a function with signature `f(x, *args) -> [y]`, where `args` is the return of `on_start` if present, else the signature is just `f(x) -> [y]`. * **`stage = Undefined`** : a stage or iterable. * **`workers = 1`** : the number of workers the stage should contain. * **`maxsize = 0`** : the maximum number of objects the stage can hold simultaneously, if set to `0` (default) then the stage can grow unbounded. * **`on_start = None`** : a function with signature `on_start() -> args`, where `args` can be any object different than `None` or a tuple of objects. The returned `args` are passed to `f` and `on_done`. This function is executed once per worker at the beggining. * **`on_done = None`** : a function with signature `on_done(stage_status, *args)`, where `args` is the return of `on_start` if present, else the signature is just `on_done(stage_status)`, and `stage_status` is of type `pypeln.process.StageStatus`. This function is executed once per worker when the worker is done. # **Returns** * If the `stage` parameters is given then this function returns a new stage, else it returns a `Partial`. """ if utils.is_undefined(stage): return utils.Partial(lambda stage: flat_map(f, stage, workers=workers, maxsize=maxsize, on_start=on_start, on_done=on_done)) stage = _to_stage(stage) if isinstance(f, types.GeneratorType): _f = f f = lambda *args, **kwargs: _f(*args, **kwargs) return _Stage( worker_constructor = WORKER, workers = workers, maxsize = maxsize, on_start = on_start, on_done = on_done, target = _flat_map, args = (f,), dependencies = [stage], ) ########### # filter ########### def _filter(f, params): @_handle_exceptions(params) def f_task(x, args): if f(x, *args): params.output_queues.put(x) _run_task(f_task, params) def filter(f, stage = utils.UNDEFINED, workers = 1, maxsize = 0, on_start = None, on_done = None): """ Creates a stage that filter the data given a predicate function `f`. It is intended to behave like python's built-in `filter` function but with the added concurrency. from pypeln import process as pr import time from random import random def slow_gt3(x): time.sleep(random()) # <= some slow computation return x > 3 data = range(10) # [0, 1, 2, ..., 9] stage = pr.filter(slow_gt3, data, workers = 3, maxsize = 4) data = list(stage) # e.g. [5, 6, 3, 4, 7, 8, 9] Note that because of concurrency order is not guaranteed. # **Args** * **`f`** : a function with signature `f(x, *args) -> bool`, where `args` is the return of `on_start` if present, else the signature is just `f(x)`. * **`stage = Undefined`** : a stage or iterable. * **`workers = 1`** : the number of workers the stage should contain. * **`maxsize = 0`** : the maximum number of objects the stage can hold simultaneously, if set to `0` (default) then the stage can grow unbounded. * **`on_start = None`** : a function with signature `on_start() -> args`, where `args` can be any object different than `None` or a tuple of objects. The returned `args` are passed to `f` and `on_done`. This function is executed once per worker at the beggining. * **`on_done = None`** : a function with signature `on_done(stage_status, *args)`, where `args` is the return of `on_start` if present, else the signature is just `on_done(stage_status)`, and `stage_status` is of type `pypeln.process.StageStatus`. This function is executed once per worker when the worker is done. # **Returns** * If the `stage` parameters is given then this function returns a new stage, else it returns a `Partial`. """ if utils.is_undefined(stage): return utils.Partial(lambda stage: filter(f, stage, workers=workers, maxsize=maxsize, on_start=on_start, on_done=on_done)) stage = _to_stage(stage) return _Stage( worker_constructor = WORKER, workers = workers, maxsize = maxsize, on_start = on_start, on_done = on_done, target = _filter, args = (f,), dependencies = [stage], ) ########### # each ########### def _each(f, params): @_handle_exceptions(params) def f_task(x, args): f(x, *args) _run_task(f_task, params) def each(f, stage = utils.UNDEFINED, workers = 1, maxsize = 0, on_start = None, on_done = None, run = False): """ Creates a stage that runs the function `f` for each element in the data but the stage itself yields no elements. Its useful for sink stages that perform certain actions such as writting to disk, saving to a database, etc, and dont produce any results. For example: from pypeln import process as pr def process_image(image_path): image = load_image(image_path) image = transform_image(image) save_image(image_path, image) files_paths = get_file_paths() stage = pr.each(process_image, file_paths, workers = 4) pr.run(stage) or alternatively files_paths = get_file_paths() pr.each(process_image, file_paths, workers = 4, run = True) Note that because of concurrency order is not guaranteed. # **Args** * **`f`** : a function with signature `f(x, *args) -> None`, where `args` is the return of `on_start` if present, else the signature is just `f(x)`. * **`stage = Undefined`** : a stage or iterable. * **`workers = 1`** : the number of workers the stage should contain. * **`maxsize = 0`** : the maximum number of objects the stage can hold simultaneously, if set to `0` (default) then the stage can grow unbounded. * **`on_start = None`** : a function with signature `on_start() -> args`, where `args` can be any object different than `None` or a tuple of objects. The returned `args` are passed to `f` and `on_done`. This function is executed once per worker at the beggining. * **`on_done = None`** : a function with signature `on_done(stage_status, *args)`, where `args` is the return of `on_start` if present, else the signature is just `on_done(stage_status)`, and `stage_status` is of type `pypeln.process.StageStatus`. This function is executed once per worker when the worker is done. * **`run = False`** : specify whether to run the stage immediately. # **Returns** * If the `stage` parameters is not given then this function returns a `Partial`, else if `return = False` (default) it return a new stage, if `run = True` then it runs the stage and returns `None`. """ if utils.is_undefined(stage): return utils.Partial(lambda stage: each(f, stage, workers=workers, maxsize=maxsize, on_start=on_start, on_done=on_done)) stage = _to_stage(stage) stage = _Stage( worker_constructor = WORKER, workers = workers, maxsize = maxsize, on_start = on_start, on_done = on_done, target = _each, args = (f,), dependencies = [stage], ) if not run: return stage for _ in stage: pass ########### # concat ########### def _concat(params): def f_task(x, args): params.output_queues.put(x) _run_task(f_task, params) def concat(stages, maxsize = 0): """ Concatenates / merges many stages into a single one by appending elements from each stage as they come, order is not preserved. from pypeln import process as pr stage_1 = [1, 2, 3] stage_2 = [4, 5, 6, 7] stage_3 = pr.concat([stage_1, stage_2]) # e.g. [1, 4, 5, 2, 6, 3, 7] # **Args** * **`stages`** : a list of stages or iterables. * **`maxsize = 0`** : the maximum number of objects the stage can hold simultaneously, if set to `0` (default) then the stage can grow unbounded. # **Returns** * A stage object. """ stages = [ _to_stage(s) for s in stages ] return _Stage( worker_constructor = WORKER, workers = 1, maxsize = maxsize, on_start = None, on_done = None, target = _concat, args = tuple(), dependencies = stages, ) ################ # run ################ def run(stages, maxsize = 0): """ Iterates over one or more stages until their iterators run out of elements. from pypeln import process as pr data = get_data() stage = pr.each(slow_fn, data, workers = 6) # execute pipeline pr.run(stage) # **Args** * **`stages`** : a stage/iterable or list of stages/iterables to be iterated over. If a list is passed, stages are first merged using `pypeln.process.concat` before iterating. * **`maxsize = 0`** : the maximum number of objects the stage can hold simultaneously, if set to `0` (default) then the stage can grow unbounded. # **Returns** * `None` """ if isinstance(stages, list) and len(stages) == 0: raise ValueError("Expected at least 1 stage to run") elif isinstance(stages, list): stage = concat(stages, maxsize = maxsize) else: stage = stages stage = to_iterable(stage, maxsize = maxsize) for _ in stages: pass ################ # _to_stage ################ def _to_stage(obj): if isinstance(obj, _Stage): return obj elif hasattr(obj, "__iter__"): return from_iterable(obj) else: raise ValueError("Object {obj} is not iterable".format(obj = obj)) ################ # from_iterable ################ def _from_iterable(iterable, params): def f_task(args): for x in iterable: params.output_queues.put(x) _run_task(f_task, params) def from_iterable(iterable = utils.UNDEFINED, maxsize = None, worker_constructor = Thread): """ Creates a stage from an iterable. This function gives you more control of how a stage is created through the `worker_constructor` parameter which can be either: * `threading.Thread`: (default) is efficient for iterables that already have the data in memory like lists or numpy arrays because threads can share memory so no serialization is needed. * `multiprocessing.Process`: is efficient for iterables who's data is not in memory like arbitrary generators and benefit from escaping the GIL. This is inefficient for iterables which have data in memory because they have to be serialized when sent to the background process. All functions that accept stages or iterables use this function when an iterable is passed to convert it into a stage using the default arguments. # **Args** * **`iterable`** : a source iterable. * **`maxsize = None`** : this parameter is not used and only kept for API compatibility with the other modules. * **`worker_constructor = threading.Thread`** : defines the worker type for the producer stage. # **Returns** * If the `iterable` parameters is given then this function returns a new stage, else it returns a `Partial`. """ if utils.is_undefined(iterable): return utils.Partial(lambda iterable: from_iterable(iterable, maxsize=maxsize, worker_constructor=worker_constructor)) return _Stage( worker_constructor = worker_constructor, workers = 1, maxsize = None, on_start = None, on_done = None, target = _from_iterable, args = (iterable,), dependencies = [], ) ############## # to_iterable ############## def _build_queues(stage, stage_input_queue, stage_output_queues, visited, pipeline_namespace): if stage in visited: return stage_input_queue, stage_output_queues else: visited.add(stage) if len(stage.dependencies) > 0: total_done = sum([ s.workers for s in stage.dependencies ]) input_queue = _InputQueue(stage.maxsize, total_done, pipeline_namespace) stage_input_queue[stage] = input_queue for _stage in stage.dependencies: if _stage not in stage_output_queues: stage_output_queues[_stage] = _OutputQueues([input_queue]) else: stage_output_queues[_stage].append(input_queue) stage_input_queue, stage_output_queues = _build_queues( _stage, stage_input_queue, stage_output_queues, visited, pipeline_namespace = pipeline_namespace, ) return stage_input_queue, stage_output_queues def _create_worker(f, args, output_queues, input_queue): kwargs = dict( output_queues = output_queues) if input_queue is not None: kwargs.update(input_queue = input_queue) return WORKER(target = f, args = args, kwargs = kwargs) def _to_iterable(stage, maxsize): pipeline_namespace = _get_namespace() pipeline_namespace.error = False pipeline_error_queue = Queue() input_queue = _InputQueue(maxsize, stage.workers, pipeline_namespace) stage_input_queue, stage_output_queues = _build_queues( stage = stage, stage_input_queue = dict(), stage_output_queues = dict(), visited = set(), pipeline_namespace = pipeline_namespace, ) stage_output_queues[stage] = _OutputQueues([ input_queue ]) processes = [] for _stage in stage_output_queues: if _stage.on_done is not None: stage_lock = Lock() stage_namespace = _get_namespace() stage_namespace.active_workers = _stage.workers else: stage_lock = None stage_namespace = None for index in range(_stage.workers): stage_params = _StageParams( output_queues = stage_output_queues[_stage], input_queue = stage_input_queue.get(_stage, None), on_start = _stage.on_start, on_done = _stage.on_done, stage_lock = stage_lock, stage_namespace = stage_namespace, pipeline_namespace = pipeline_namespace, pipeline_error_queue = pipeline_error_queue, index = index, ) process = _stage.worker_constructor( target = _stage.target, args = _stage.args + (stage_params,) ) processes.append(process) for p in processes: p.daemon = True p.start() try: for x in input_queue: yield x if pipeline_namespace.error: error_class, _, trace = pipeline_error_queue.get() raise error_class("\n\nOriginal {trace}".format(trace = trace)) for p in processes: p.join() except: for q in stage_input_queue.values(): q.done() raise def to_iterable(stage = utils.UNDEFINED, maxsize = 0): """ Creates an iterable from a stage. This function is used by the stage's `__iter__` method with the default arguments. # **Args** * **`stage`** : a stage object. * **`maxsize = 0`** : the maximum number of objects the stage can hold simultaneously, if set to `0` (default) then the stage can grow unbounded. # **Returns** * If the `stage` parameters is given then this function returns an iterable, else it returns a `Partial`. """ if utils.is_undefined(stage): return utils.Partial(lambda stage: _to_iterable(stage, maxsize)) else: return _to_iterable(stage, maxsize)
Functions
def concat(
stages, maxsize=0)
Concatenates / merges many stages into a single one by appending elements from each stage as they come, order is not preserved.
from pypeln import process as pr stage_1 = [1, 2, 3] stage_2 = [4, 5, 6, 7] stage_3 = pr.concat([stage_1, stage_2]) # e.g. [1, 4, 5, 2, 6, 3, 7]
Args
stages
: a list of stages or iterables.maxsize = 0
: the maximum number of objects the stage can hold simultaneously, if set to0
(default) then the stage can grow unbounded.
Returns
- A stage object.
def concat(stages, maxsize = 0): """ Concatenates / merges many stages into a single one by appending elements from each stage as they come, order is not preserved. from pypeln import process as pr stage_1 = [1, 2, 3] stage_2 = [4, 5, 6, 7] stage_3 = pr.concat([stage_1, stage_2]) # e.g. [1, 4, 5, 2, 6, 3, 7] # **Args** * **`stages`** : a list of stages or iterables. * **`maxsize = 0`** : the maximum number of objects the stage can hold simultaneously, if set to `0` (default) then the stage can grow unbounded. # **Returns** * A stage object. """ stages = [ _to_stage(s) for s in stages ] return _Stage( worker_constructor = WORKER, workers = 1, maxsize = maxsize, on_start = None, on_done = None, target = _concat, args = tuple(), dependencies = stages, )
def each(
f, stage=Undefined, workers=1, maxsize=0, on_start=None, on_done=None, run=False)
Creates a stage that runs the function f
for each element in the data but the stage itself yields no elements. Its useful for sink stages that perform certain actions such as writting to disk, saving to a database, etc, and dont produce any results. For example:
from pypeln import process as pr def process_image(image_path): image = load_image(image_path) image = transform_image(image) save_image(image_path, image) files_paths = get_file_paths() stage = pr.each(process_image, file_paths, workers = 4) pr.run(stage)
or alternatively
files_paths = get_file_paths() pr.each(process_image, file_paths, workers = 4, run = True)
Note that because of concurrency order is not guaranteed.
Args
f
: a function with signaturef(x, *args) -> None
, whereargs
is the return ofon_start
if present, else the signature is justf(x)
.stage = Undefined
: a stage or iterable.workers = 1
: the number of workers the stage should contain.maxsize = 0
: the maximum number of objects the stage can hold simultaneously, if set to0
(default) then the stage can grow unbounded.on_start = None
: a function with signatureon_start() -> args
, whereargs
can be any object different thanNone
or a tuple of objects. The returnedargs
are passed tof
andon_done
. This function is executed once per worker at the beggining.on_done = None
: a function with signatureon_done(stage_status, *args)
, whereargs
is the return ofon_start
if present, else the signature is juston_done(stage_status)
, andstage_status
is of typeStageStatus
. This function is executed once per worker when the worker is done.run = False
: specify whether to run the stage immediately.
Returns
- If the
stage
parameters is not given then this function returns aPartial
, else ifreturn = False
(default) it return a new stage, ifrun = True
then it runs the stage and returnsNone
.
def each(f, stage = utils.UNDEFINED, workers = 1, maxsize = 0, on_start = None, on_done = None, run = False): """ Creates a stage that runs the function `f` for each element in the data but the stage itself yields no elements. Its useful for sink stages that perform certain actions such as writting to disk, saving to a database, etc, and dont produce any results. For example: from pypeln import process as pr def process_image(image_path): image = load_image(image_path) image = transform_image(image) save_image(image_path, image) files_paths = get_file_paths() stage = pr.each(process_image, file_paths, workers = 4) pr.run(stage) or alternatively files_paths = get_file_paths() pr.each(process_image, file_paths, workers = 4, run = True) Note that because of concurrency order is not guaranteed. # **Args** * **`f`** : a function with signature `f(x, *args) -> None`, where `args` is the return of `on_start` if present, else the signature is just `f(x)`. * **`stage = Undefined`** : a stage or iterable. * **`workers = 1`** : the number of workers the stage should contain. * **`maxsize = 0`** : the maximum number of objects the stage can hold simultaneously, if set to `0` (default) then the stage can grow unbounded. * **`on_start = None`** : a function with signature `on_start() -> args`, where `args` can be any object different than `None` or a tuple of objects. The returned `args` are passed to `f` and `on_done`. This function is executed once per worker at the beggining. * **`on_done = None`** : a function with signature `on_done(stage_status, *args)`, where `args` is the return of `on_start` if present, else the signature is just `on_done(stage_status)`, and `stage_status` is of type `pypeln.process.StageStatus`. This function is executed once per worker when the worker is done. * **`run = False`** : specify whether to run the stage immediately. # **Returns** * If the `stage` parameters is not given then this function returns a `Partial`, else if `return = False` (default) it return a new stage, if `run = True` then it runs the stage and returns `None`. """ if utils.is_undefined(stage): return utils.Partial(lambda stage: each(f, stage, workers=workers, maxsize=maxsize, on_start=on_start, on_done=on_done)) stage = _to_stage(stage) stage = _Stage( worker_constructor = WORKER, workers = workers, maxsize = maxsize, on_start = on_start, on_done = on_done, target = _each, args = (f,), dependencies = [stage], ) if not run: return stage for _ in stage: pass
def filter(
f, stage=Undefined, workers=1, maxsize=0, on_start=None, on_done=None)
Creates a stage that filter the data given a predicate function f
. It is intended to behave like python's built-in filter
function but with the added concurrency.
from pypeln import process as pr import time from random import random def slow_gt3(x): time.sleep(random()) # <= some slow computation return x > 3 data = range(10) # [0, 1, 2, ..., 9] stage = pr.filter(slow_gt3, data, workers = 3, maxsize = 4) data = list(stage) # e.g. [5, 6, 3, 4, 7, 8, 9]
Note that because of concurrency order is not guaranteed.
Args
f
: a function with signaturef(x, *args) -> bool
, whereargs
is the return ofon_start
if present, else the signature is justf(x)
.stage = Undefined
: a stage or iterable.workers = 1
: the number of workers the stage should contain.maxsize = 0
: the maximum number of objects the stage can hold simultaneously, if set to0
(default) then the stage can grow unbounded.on_start = None
: a function with signatureon_start() -> args
, whereargs
can be any object different thanNone
or a tuple of objects. The returnedargs
are passed tof
andon_done
. This function is executed once per worker at the beggining.on_done = None
: a function with signatureon_done(stage_status, *args)
, whereargs
is the return ofon_start
if present, else the signature is juston_done(stage_status)
, andstage_status
is of typeStageStatus
. This function is executed once per worker when the worker is done.
Returns
- If the
stage
parameters is given then this function returns a new stage, else it returns aPartial
.
def filter(f, stage = utils.UNDEFINED, workers = 1, maxsize = 0, on_start = None, on_done = None): """ Creates a stage that filter the data given a predicate function `f`. It is intended to behave like python's built-in `filter` function but with the added concurrency. from pypeln import process as pr import time from random import random def slow_gt3(x): time.sleep(random()) # <= some slow computation return x > 3 data = range(10) # [0, 1, 2, ..., 9] stage = pr.filter(slow_gt3, data, workers = 3, maxsize = 4) data = list(stage) # e.g. [5, 6, 3, 4, 7, 8, 9] Note that because of concurrency order is not guaranteed. # **Args** * **`f`** : a function with signature `f(x, *args) -> bool`, where `args` is the return of `on_start` if present, else the signature is just `f(x)`. * **`stage = Undefined`** : a stage or iterable. * **`workers = 1`** : the number of workers the stage should contain. * **`maxsize = 0`** : the maximum number of objects the stage can hold simultaneously, if set to `0` (default) then the stage can grow unbounded. * **`on_start = None`** : a function with signature `on_start() -> args`, where `args` can be any object different than `None` or a tuple of objects. The returned `args` are passed to `f` and `on_done`. This function is executed once per worker at the beggining. * **`on_done = None`** : a function with signature `on_done(stage_status, *args)`, where `args` is the return of `on_start` if present, else the signature is just `on_done(stage_status)`, and `stage_status` is of type `pypeln.process.StageStatus`. This function is executed once per worker when the worker is done. # **Returns** * If the `stage` parameters is given then this function returns a new stage, else it returns a `Partial`. """ if utils.is_undefined(stage): return utils.Partial(lambda stage: filter(f, stage, workers=workers, maxsize=maxsize, on_start=on_start, on_done=on_done)) stage = _to_stage(stage) return _Stage( worker_constructor = WORKER, workers = workers, maxsize = maxsize, on_start = on_start, on_done = on_done, target = _filter, args = (f,), dependencies = [stage], )
def flat_map(
f, stage=Undefined, workers=1, maxsize=0, on_start=None, on_done=None)
Creates a stage that maps a function f
over the data, however unlike map
in this case f
returns an iterable. As its name implies, flat_map
will flatten out these iterables so the resulting stage just contains their elements.
from pypeln import process as pr import time from random import random def slow_integer_pair(x): time.sleep(random()) # <= some slow computation if x == 0: yield x else: yield x yield -x data = range(10) # [0, 1, 2, ..., 9] stage = pr.flat_map(slow_integer_pair, data, workers = 3, maxsize = 4) list(stage) # e.g. [2, -2, 3, -3, 0, 1, -1, 6, -6, 4, -4, ...]
Note that because of concurrency order is not guaranteed. Also, flat_map
is more general than both map
and filter
, as such these expressions are equivalent:
from pypeln import process as pr pr.map(f, stage) = pr.flat_map(lambda x: [f(x)], stage) pr.filter(f, stage) = pr.flat_map(lambda x: [x] if f(x) else [], stage)
Using flat_map
with a generator function is very useful as we are able to filter out unwanted elements when e.g. there are exceptions, missing data, etc.
Args
f
: a function with signaturef(x, *args) -> [y]
, whereargs
is the return ofon_start
if present, else the signature is justf(x) -> [y]
.stage = Undefined
: a stage or iterable.workers = 1
: the number of workers the stage should contain.maxsize = 0
: the maximum number of objects the stage can hold simultaneously, if set to0
(default) then the stage can grow unbounded.on_start = None
: a function with signatureon_start() -> args
, whereargs
can be any object different thanNone
or a tuple of objects. The returnedargs
are passed tof
andon_done
. This function is executed once per worker at the beggining.on_done = None
: a function with signatureon_done(stage_status, *args)
, whereargs
is the return ofon_start
if present, else the signature is juston_done(stage_status)
, andstage_status
is of typeStageStatus
. This function is executed once per worker when the worker is done.
Returns
- If the
stage
parameters is given then this function returns a new stage, else it returns aPartial
.
def flat_map(f, stage = utils.UNDEFINED, workers = 1, maxsize = 0, on_start = None, on_done = None): """ Creates a stage that maps a function `f` over the data, however unlike `pypeln.process.map` in this case `f` returns an iterable. As its name implies, `flat_map` will flatten out these iterables so the resulting stage just contains their elements. from pypeln import process as pr import time from random import random def slow_integer_pair(x): time.sleep(random()) # <= some slow computation if x == 0: yield x else: yield x yield -x data = range(10) # [0, 1, 2, ..., 9] stage = pr.flat_map(slow_integer_pair, data, workers = 3, maxsize = 4) list(stage) # e.g. [2, -2, 3, -3, 0, 1, -1, 6, -6, 4, -4, ...] Note that because of concurrency order is not guaranteed. Also, `flat_map` is more general than both `pypeln.process.map` and `pypeln.process.filter`, as such these expressions are equivalent: from pypeln import process as pr pr.map(f, stage) = pr.flat_map(lambda x: [f(x)], stage) pr.filter(f, stage) = pr.flat_map(lambda x: [x] if f(x) else [], stage) Using `flat_map` with a generator function is very useful as we are able to filter out unwanted elements when e.g. there are exceptions, missing data, etc. # **Args** * **`f`** : a function with signature `f(x, *args) -> [y]`, where `args` is the return of `on_start` if present, else the signature is just `f(x) -> [y]`. * **`stage = Undefined`** : a stage or iterable. * **`workers = 1`** : the number of workers the stage should contain. * **`maxsize = 0`** : the maximum number of objects the stage can hold simultaneously, if set to `0` (default) then the stage can grow unbounded. * **`on_start = None`** : a function with signature `on_start() -> args`, where `args` can be any object different than `None` or a tuple of objects. The returned `args` are passed to `f` and `on_done`. This function is executed once per worker at the beggining. * **`on_done = None`** : a function with signature `on_done(stage_status, *args)`, where `args` is the return of `on_start` if present, else the signature is just `on_done(stage_status)`, and `stage_status` is of type `pypeln.process.StageStatus`. This function is executed once per worker when the worker is done. # **Returns** * If the `stage` parameters is given then this function returns a new stage, else it returns a `Partial`. """ if utils.is_undefined(stage): return utils.Partial(lambda stage: flat_map(f, stage, workers=workers, maxsize=maxsize, on_start=on_start, on_done=on_done)) stage = _to_stage(stage) if isinstance(f, types.GeneratorType): _f = f f = lambda *args, **kwargs: _f(*args, **kwargs) return _Stage( worker_constructor = WORKER, workers = workers, maxsize = maxsize, on_start = on_start, on_done = on_done, target = _flat_map, args = (f,), dependencies = [stage], )
def from_iterable(
iterable=Undefined, maxsize=None, worker_constructor=<class 'threading.Thread'>)
Creates a stage from an iterable. This function gives you more control of how a stage is created through the worker_constructor
parameter which can be either:
threading.Thread
: (default) is efficient for iterables that already have the data in memory like lists or numpy arrays because threads can share memory so no serialization is needed.multiprocessing.Process
: is efficient for iterables who's data is not in memory like arbitrary generators and benefit from escaping the GIL. This is inefficient for iterables which have data in memory because they have to be serialized when sent to the background process.
All functions that accept stages or iterables use this function when an iterable is passed to convert it into a stage using the default arguments.
Args
iterable
: a source iterable.maxsize = None
: this parameter is not used and only kept for API compatibility with the other modules.worker_constructor = threading.Thread
: defines the worker type for the producer stage.
Returns
- If the
iterable
parameters is given then this function returns a new stage, else it returns aPartial
.
def from_iterable(iterable = utils.UNDEFINED, maxsize = None, worker_constructor = Thread): """ Creates a stage from an iterable. This function gives you more control of how a stage is created through the `worker_constructor` parameter which can be either: * `threading.Thread`: (default) is efficient for iterables that already have the data in memory like lists or numpy arrays because threads can share memory so no serialization is needed. * `multiprocessing.Process`: is efficient for iterables who's data is not in memory like arbitrary generators and benefit from escaping the GIL. This is inefficient for iterables which have data in memory because they have to be serialized when sent to the background process. All functions that accept stages or iterables use this function when an iterable is passed to convert it into a stage using the default arguments. # **Args** * **`iterable`** : a source iterable. * **`maxsize = None`** : this parameter is not used and only kept for API compatibility with the other modules. * **`worker_constructor = threading.Thread`** : defines the worker type for the producer stage. # **Returns** * If the `iterable` parameters is given then this function returns a new stage, else it returns a `Partial`. """ if utils.is_undefined(iterable): return utils.Partial(lambda iterable: from_iterable(iterable, maxsize=maxsize, worker_constructor=worker_constructor)) return _Stage( worker_constructor = worker_constructor, workers = 1, maxsize = None, on_start = None, on_done = None, target = _from_iterable, args = (iterable,), dependencies = [], )
def map(
f, stage=Undefined, workers=1, maxsize=0, on_start=None, on_done=None)
Creates a stage that maps a function f
over the data. Its intended to behave like python's built-in map
function but with the added concurrency.
from pypeln import process as pr import time from random import random def slow_add1(x): time.sleep(random()) # <= some slow computation return x + 1 data = range(10) # [0, 1, 2, ..., 9] stage = pr.map(slow_add1, data, workers = 3, maxsize = 4) data = list(stage) # e.g. [2, 1, 5, 6, 3, 4, 7, 8, 9, 10]
Note that because of concurrency order is not guaranteed.
Args
f
: a function with signaturef(x, *args) -> y
, whereargs
is the return ofon_start
if present, else the signature is justf(x) -> y
.stage = Undefined
: a stage or iterable.workers = 1
: the number of workers the stage should contain.maxsize = 0
: the maximum number of objects the stage can hold simultaneously, if set to0
(default) then the stage can grow unbounded.on_start = None
: a function with signatureon_start() -> args
, whereargs
can be any object different thanNone
or a tuple of objects. The returnedargs
are passed tof
andon_done
. This function is executed once per worker at the beggining.on_done = None
: a function with signatureon_done(stage_status, *args)
, whereargs
is the return ofon_start
if present, else the signature is juston_done(stage_status)
, andstage_status
is of typeStageStatus
. This function is executed once per worker when the worker is done.
Returns
- If the
stage
parameters is given then this function returns a new stage, else it returns aPartial
.
def map(f, stage = utils.UNDEFINED, workers = 1, maxsize = 0, on_start = None, on_done = None): """ Creates a stage that maps a function `f` over the data. Its intended to behave like python's built-in `map` function but with the added concurrency. from pypeln import process as pr import time from random import random def slow_add1(x): time.sleep(random()) # <= some slow computation return x + 1 data = range(10) # [0, 1, 2, ..., 9] stage = pr.map(slow_add1, data, workers = 3, maxsize = 4) data = list(stage) # e.g. [2, 1, 5, 6, 3, 4, 7, 8, 9, 10] Note that because of concurrency order is not guaranteed. # **Args** * **`f`** : a function with signature `f(x, *args) -> y`, where `args` is the return of `on_start` if present, else the signature is just `f(x) -> y`. * **`stage = Undefined`** : a stage or iterable. * **`workers = 1`** : the number of workers the stage should contain. * **`maxsize = 0`** : the maximum number of objects the stage can hold simultaneously, if set to `0` (default) then the stage can grow unbounded. * **`on_start = None`** : a function with signature `on_start() -> args`, where `args` can be any object different than `None` or a tuple of objects. The returned `args` are passed to `f` and `on_done`. This function is executed once per worker at the beggining. * **`on_done = None`** : a function with signature `on_done(stage_status, *args)`, where `args` is the return of `on_start` if present, else the signature is just `on_done(stage_status)`, and `stage_status` is of type `pypeln.process.StageStatus`. This function is executed once per worker when the worker is done. # **Returns** * If the `stage` parameters is given then this function returns a new stage, else it returns a `Partial`. """ if utils.is_undefined(stage): return utils.Partial(lambda stage: map(f, stage, workers=workers, maxsize=maxsize, on_start=on_start, on_done=on_done)) stage = _to_stage(stage) return _Stage( worker_constructor = WORKER, workers = workers, maxsize = maxsize, on_start = on_start, on_done = on_done, target = _map, args = (f,), dependencies = [stage], )
def run(
stages, maxsize=0)
Iterates over one or more stages until their iterators run out of elements.
from pypeln import process as pr data = get_data() stage = pr.each(slow_fn, data, workers = 6) # execute pipeline pr.run(stage)
Args
stages
: a stage/iterable or list of stages/iterables to be iterated over. If a list is passed, stages are first merged usingconcat
before iterating.maxsize = 0
: the maximum number of objects the stage can hold simultaneously, if set to0
(default) then the stage can grow unbounded.
Returns
None
def run(stages, maxsize = 0): """ Iterates over one or more stages until their iterators run out of elements. from pypeln import process as pr data = get_data() stage = pr.each(slow_fn, data, workers = 6) # execute pipeline pr.run(stage) # **Args** * **`stages`** : a stage/iterable or list of stages/iterables to be iterated over. If a list is passed, stages are first merged using `pypeln.process.concat` before iterating. * **`maxsize = 0`** : the maximum number of objects the stage can hold simultaneously, if set to `0` (default) then the stage can grow unbounded. # **Returns** * `None` """ if isinstance(stages, list) and len(stages) == 0: raise ValueError("Expected at least 1 stage to run") elif isinstance(stages, list): stage = concat(stages, maxsize = maxsize) else: stage = stages stage = to_iterable(stage, maxsize = maxsize) for _ in stages: pass
def to_iterable(
stage=Undefined, maxsize=0)
Creates an iterable from a stage. This function is used by the stage's __iter__
method with the default arguments.
Args
stage
: a stage object.maxsize = 0
: the maximum number of objects the stage can hold simultaneously, if set to0
(default) then the stage can grow unbounded.
Returns
- If the
stage
parameters is given then this function returns an iterable, else it returns aPartial
.
def to_iterable(stage = utils.UNDEFINED, maxsize = 0): """ Creates an iterable from a stage. This function is used by the stage's `__iter__` method with the default arguments. # **Args** * **`stage`** : a stage object. * **`maxsize = 0`** : the maximum number of objects the stage can hold simultaneously, if set to `0` (default) then the stage can grow unbounded. # **Returns** * If the `stage` parameters is given then this function returns an iterable, else it returns a `Partial`. """ if utils.is_undefined(stage): return utils.Partial(lambda stage: _to_iterable(stage, maxsize)) else: return _to_iterable(stage, maxsize)
Classes
class StageStatus
Object passed to various on_done
callbacks. It contains information about the stage in case book keeping is needed.
class StageStatus(object): """ Object passed to various `on_done` callbacks. It contains information about the stage in case book keeping is needed. """ def __init__(self, namespace, lock): self._namespace = namespace self._lock = lock @property def done(self): """ `bool` : `True` if all workers finished. """ with self._lock: return self._namespace.active_workers == 0 @property def active_workers(self): """ `int` : Number of active workers. """ with self._lock: return self._namespace.active_workers def __str__(self): return "StageStatus(done = {done}, active_workers = {active_workers})".format( done = self.done, active_workers = self.active_workers, )
Ancestors (in MRO)
- StageStatus
- builtins.object
Static methods
def __init__(
self, namespace, lock)
Initialize self. See help(type(self)) for accurate signature.
def __init__(self, namespace, lock): self._namespace = namespace self._lock = lock
Instance variables
var active_workers
int
: Number of active workers.
var done
bool
: True
if all workers finished.
class WorkerInfo
WorkerInfo(index,)
Ancestors (in MRO)
- WorkerInfo
- builtins.tuple
- builtins.object
Instance variables
var index
Alias for field number 0