You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Linar Savion <li...@jether-energy.com> on 2018/07/08 11:13:01 UTC

[SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

We've created a snippet that creates a Spark DF from a RDD of many pandas
DFs in a distributed manner that does not require the driver to collect the
entire dataset.

Early tests show a performance improvement of x6-x10 over using
pandasDF->Rows>sparkDF.

I've seen that there are some open pull requests that change the way arrow
serialization work, Should I open a pull request to add this functionality
to SparkSession? (`createFromPandasDataframesRDD`)

https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5

Thanks,
Linar

Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

Posted by Reynold Xin <rx...@databricks.com>.

Yes I would just reuse the same function.

On Sun, Jul 8, 2018 at 5:01 AM Li Jin <ic...@gmail.com> wrote:

> Hi Linar,
>
> This seems useful. But perhaps reusing the same function name is better?
>
>
> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame
>
> Currently createDataFrame takes an RDD of any kind of SQL data
> representation(e.g. row, tuple, int, boolean, etc.), or list, or
> pandas.DataFrame.
>
> Perhaps we can support taking an RDD of *pandas.DataFrame *as the "data"
> args too?
>
> What do other people think.
>
> Li
>
> On Sun, Jul 8, 2018 at 1:13 PM, Linar Savion <li...@jether-energy.com>
> wrote:
>
>> We've created a snippet that creates a Spark DF from a RDD of many pandas
>> DFs in a distributed manner that does not require the driver to collect the
>> entire dataset.
>>
>> Early tests show a performance improvement of x6-x10 over using
>> pandasDF->Rows>sparkDF.
>>
>> I've seen that there are some open pull requests that change the way
>> arrow serialization work, Should I open a pull request to add this
>> functionality to SparkSession? (`createFromPandasDataframesRDD`)
>>
>> https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
>>
>> Thanks,
>> Linar
>>
>
>

Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

Posted by Li Jin <ic...@gmail.com>.

Hi Linar,

This seems useful. But perhaps reusing the same function name is better?

http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame

Currently createDataFrame takes an RDD of any kind of SQL data
representation(e.g. row, tuple, int, boolean, etc.), or list, or
pandas.DataFrame.

Perhaps we can support taking an RDD of *pandas.DataFrame *as the "data"
args too?

What do other people think.

Li

On Sun, Jul 8, 2018 at 1:13 PM, Linar Savion <li...@jether-energy.com>
wrote:

> We've created a snippet that creates a Spark DF from a RDD of many pandas
> DFs in a distributed manner that does not require the driver to collect the
> entire dataset.
>
> Early tests show a performance improvement of x6-x10 over using
> pandasDF->Rows>sparkDF.
>
> I've seen that there are some open pull requests that change the way arrow
> serialization work, Should I open a pull request to add this functionality
> to SparkSession? (`createFromPandasDataframesRDD`)
>
> https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
>
> Thanks,
> Linar
>