You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Debabrata Ghosh <ma...@gmail.com> on 2018/03/19 05:54:18 UTC

Calling Pyspark functions in parallel

Hi,
             My dataframe is having 2000 rows. For processing each row it
consider 3 seconds and so sequentially it takes 2000 * 3 = 6000 seconds ,
which is a very high time.

              Further, I am contemplating to run the function in parallel.
For example, I would like to divide the total rows in my dataframe by 4 and
accordingly I will prepare a set of 500 rows and want to call my pyspark
function in parallel. I wanted to know if there is any library / pyspark
function which I can leverage to do this execution in parallel.

               Will really appreciate for your feedback as per your
earliest convenience. Thanks,

Debu

Re: Calling Pyspark functions in parallel

Posted by Debabrata Ghosh <ma...@gmail.com>.
Thanks Jules ! Appreciate it a lot indeed !




On Mon, Mar 19, 2018 at 7:16 PM, Jules Damji <dm...@comcast.net> wrote:

> What’s your PySpark function? Is it a UDF? If so consider using pandas UDF
> introduced in Spark 2.3.
>
> More info here: https://databricks.com/blog/2017/10/30/introducing-
> vectorized-udfs-for-pyspark.html
>
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Mar 18, 2018, at 10:54 PM, Debabrata Ghosh <ma...@gmail.com>
> wrote:
>
> Hi,
>              My dataframe is having 2000 rows. For processing each row it
> consider 3 seconds and so sequentially it takes 2000 * 3 = 6000 seconds ,
> which is a very high time.
>
>               Further, I am contemplating to run the function in parallel.
> For example, I would like to divide the total rows in my dataframe by 4 and
> accordingly I will prepare a set of 500 rows and want to call my pyspark
> function in parallel. I wanted to know if there is any library / pyspark
> function which I can leverage to do this execution in parallel.
>
>                Will really appreciate for your feedback as per your
> earliest convenience. Thanks,
>
> Debu
>
>

Re: Calling Pyspark functions in parallel

Posted by Jules Damji <dm...@comcast.net>.
What’s your PySpark function? Is it a UDF? If so consider using pandas UDF introduced in Spark 2.3. 

More info here: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html


Sent from my iPhone
Pardon the dumb thumb typos :)

> On Mar 18, 2018, at 10:54 PM, Debabrata Ghosh <ma...@gmail.com> wrote:
> 
> Hi,
>              My dataframe is having 2000 rows. For processing each row it consider 3 seconds and so sequentially it takes 2000 * 3 = 6000 seconds , which is a very high time. 
> 
>               Further, I am contemplating to run the function in parallel. For example, I would like to divide the total rows in my dataframe by 4 and accordingly I will prepare a set of 500 rows and want to call my pyspark function in parallel. I wanted to know if there is any library / pyspark function which I can leverage to do this execution in parallel.
> 
>                Will really appreciate for your feedback as per your earliest convenience. Thanks,
> 
> Debu