You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Linar Savion (Jira)" <ji...@apache.org> on 2020/09/10 11:57:00 UTC

[jira] [Created] (SPARK-32846) Support createDataFrame from an RDD of pd.DataFrames

Linar Savion created SPARK-32846:
------------------------------------

             Summary: Support createDataFrame from an RDD of pd.DataFrames
                 Key: SPARK-32846
                 URL: https://issues.apache.org/jira/browse/SPARK-32846
             Project: Spark
          Issue Type: New Feature
          Components: PySpark
    Affects Versions: 3.0.1
            Reporter: Linar Savion


Add support to createDataFrame from a distributed collection of pandas.DataFrames by converting the RDD of pd.DFs to an RDD of arrow records batches, then directly creating the spark DataFrame from it.

 

Performance is significantly better (vectorized) than creating a spark DF by converting each df to a list of rows, similar to the improvement of SPARK-20791.

 

Initial example & benchmark for older spark versions: [https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5|https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5,]

 

I'm currently working on a PR and will post it soon.

 

Extends the work done in: 

https://issues.apache.org/jira/browse/SPARK-20791 

https://issues.apache.org/jira/browse/SPARK-23030 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org