kedro.contrib.decorators.pyspark.pandas_to_spark¶
-
kedro.contrib.decorators.pyspark.
pandas_to_spark
(spark)[source]¶ Inspects the decorated function’s inputs and converts all pandas DataFrame inputs to spark DataFrames.
Note that in the example below we have enabled
spark.sql.execution.arrow.enabled
. For this to work, you should firstpip install pyarrow
and addpyarrow
torequirements.txt
. Enabling this option makes the convertion between pyspark <-> DataFrames much faster.Parameters: spark ( SparkSession
) –The spark session singleton object to use for the creation of the pySpark DataFrames. A possible pattern you can use here is the following:
spark.py
from pyspark.sql import SparkSession def get_spark(): return ( SparkSession.builder .master("local[*]") .appName("kedro") .config("spark.driver.memory", "4g") .config("spark.driver.maxResultSize", "3g") .config("spark.sql.execution.arrow.enabled", "true") .getOrCreate() )
nodes.py
from spark import get_spark @pandas_to_spark(get_spark()) def node_1(data): data.show() # data is pyspark.sql.DataFrame
Return type: Callable
Returns: The original function with any pandas DF inputs translated to spark.