You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Linar Savion (Jira)" <ji...@apache.org> on 2020/09/10 11:57:00 UTC
[jira] [Created] (SPARK-32846) Support createDataFrame from an RDD
of pd.DataFrames
Linar Savion created SPARK-32846:
------------------------------------
Summary: Support createDataFrame from an RDD of pd.DataFrames
Key: SPARK-32846
URL: https://issues.apache.org/jira/browse/SPARK-32846
Project: Spark
Issue Type: New Feature
Components: PySpark
Affects Versions: 3.0.1
Reporter: Linar Savion
Add support to createDataFrame from a distributed collection of pandas.DataFrames by converting the RDD of pd.DFs to an RDD of arrow records batches, then directly creating the spark DataFrame from it.
Performance is significantly better (vectorized) than creating a spark DF by converting each df to a list of rows, similar to the improvement of SPARK-20791.
Initial example & benchmark for older spark versions: [https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5|https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5,]
I'm currently working on a PR and will post it soon.
Extends the work done in:
https://issues.apache.org/jira/browse/SPARK-20791
https://issues.apache.org/jira/browse/SPARK-23030
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org