You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Bitfox <bi...@bitfox.top> on 2022/01/30 10:10:20 UTC

why the pyspark RDD API is so slow?

Hello list,

I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a pure
scala program. The result shows the pyspark RDD is too slow.

For the operations and dataset please see:
https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/

The result table is below.
Can you give suggestions on how to optimize the RDD operation?

Thanks a lot.


*program* *time*
scala program 49s
pyspark dataframe 56s
scala RDD 1m31s
pyspark RDD 7m15s

Re: why the pyspark RDD API is so slow?

Posted by Sebastian Piu <se...@gmail.com>.

When you operate on a dataframe from the python side you are just invoking
methods in the JVM via a proxy (py4j) so it is almost as coding in java
itself. This is as long as you don't define any udf's or any other code
that needs to invoke python for processing

Check the High Performance Spark book, the Pyspark chapter, for a good
explanation of what's going on

On Mon, 31 Jan 2022 at 09:10, Bitfox <bi...@bitfox.top> wrote:

> Hi
>
> In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why?
>
> Thanks
>
> On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov <kh...@gmail.com>
> wrote:
>
>> Your scala program does not use any Spark API hence faster that others.
>> If you write the same code in pure Python I think it will be even faster
>> than Scala program, especially taking into account these 2 programs runs on
>> a single VM.
>>
>> Regarding Dataframe and RDD I would suggest to use Dataframes anyway
>> since it's recommended approach since Spark 2.0.
>> RDD for Pyspark is slow as others said it needs to be
>> serialised/deserialised.
>>
>> One general note is that Spark is written Scala and core is running on
>> JVM and Python is wrapper around Scala API and most of PySpark APIs are
>> delegated to Scala/JVM to be executed. Hence most of big data
>> transformation tasks will complete almost at the same time as they (Scala
>> and Python) use the same API under the hood. Therefore you can also observe
>> that APIs are very similar and code is written in the same fashion.
>>
>>
>> On Sun, 30 Jan 2022, 10:10 Bitfox, <bi...@bitfox.top> wrote:
>>
>>> Hello list,
>>>
>>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
>>> pure scala program. The result shows the pyspark RDD is too slow.
>>>
>>> For the operations and dataset please see:
>>>
>>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>>>
>>> The result table is below.
>>> Can you give suggestions on how to optimize the RDD operation?
>>>
>>> Thanks a lot.
>>>
>>>
>>> *program* *time*
>>> scala program 49s
>>> pyspark dataframe 56s
>>> scala RDD 1m31s
>>> pyspark RDD 7m15s
>>>
>>

Re: why the pyspark RDD API is so slow?

Posted by Bitfox <bi...@bitfox.top>.

Hi

In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why?

Thanks

On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov <kh...@gmail.com>
wrote:

> Your scala program does not use any Spark API hence faster that others. If
> you write the same code in pure Python I think it will be even faster than
> Scala program, especially taking into account these 2 programs runs on a
> single VM.
>
> Regarding Dataframe and RDD I would suggest to use Dataframes anyway since
> it's recommended approach since Spark 2.0.
> RDD for Pyspark is slow as others said it needs to be
> serialised/deserialised.
>
> One general note is that Spark is written Scala and core is running on JVM
> and Python is wrapper around Scala API and most of PySpark APIs are
> delegated to Scala/JVM to be executed. Hence most of big data
> transformation tasks will complete almost at the same time as they (Scala
> and Python) use the same API under the hood. Therefore you can also observe
> that APIs are very similar and code is written in the same fashion.
>
>
> On Sun, 30 Jan 2022, 10:10 Bitfox, <bi...@bitfox.top> wrote:
>
>> Hello list,
>>
>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
>> pure scala program. The result shows the pyspark RDD is too slow.
>>
>> For the operations and dataset please see:
>>
>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>>
>> The result table is below.
>> Can you give suggestions on how to optimize the RDD operation?
>>
>> Thanks a lot.
>>
>>
>> *program* *time*
>> scala program 49s
>> pyspark dataframe 56s
>> scala RDD 1m31s
>> pyspark RDD 7m15s
>>
>

Re: why the pyspark RDD API is so slow?

Posted by Khalid Mammadov <kh...@gmail.com>.

Your scala program does not use any Spark API hence faster that others. If
you write the same code in pure Python I think it will be even faster than
Scala program, especially taking into account these 2 programs runs on a
single VM.

Regarding Dataframe and RDD I would suggest to use Dataframes anyway since
it's recommended approach since Spark 2.0.
RDD for Pyspark is slow as others said it needs to be
serialised/deserialised.

One general note is that Spark is written Scala and core is running on JVM
and Python is wrapper around Scala API and most of PySpark APIs are
delegated to Scala/JVM to be executed. Hence most of big data
transformation tasks will complete almost at the same time as they (Scala
and Python) use the same API under the hood. Therefore you can also observe
that APIs are very similar and code is written in the same fashion.

On Sun, 30 Jan 2022, 10:10 Bitfox, <bi...@bitfox.top> wrote:

> Hello list,
>
> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
> pure scala program. The result shows the pyspark RDD is too slow.
>
> For the operations and dataset please see:
>
> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>
> The result table is below.
> Can you give suggestions on how to optimize the RDD operation?
>
> Thanks a lot.
>
>
> *program* *time*
> scala program 49s
> pyspark dataframe 56s
> scala RDD 1m31s
> pyspark RDD 7m15s
>

RE: why the pyspark RDD API is so slow?

Posted by Theodore J Griesenbrock <te...@ibm.com>.

Any particular code sample you can suggest to review on your tips?

> On Jan 30, 2022, at 06:16, Sebastian Piu <se...@gmail.com> wrote:
> 
> 
> This Message Is From an External Sender
> This message came from outside your organization.
> It's because all data needs to be pickled back and forth between java and a spun python worker, so there is additional overhead than if you stay fully in scala. 
> 
> Your python code might make this worse too, for example if not yielding from operations
> 
> You can look at using UDFs and arrow or trying to stay as much as possible on datagrams operations only
> 
>> On Sun, 30 Jan 2022, 10:11 Bitfox, <bi...@bitfox.top> wrote:
>> Hello list,
>> 
>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a pure scala program. The result shows the pyspark RDD is too slow.
>> 
>> For the operations and dataset please see:
>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>> 
>> The result table is below.
>> Can you give suggestions on how to optimize the RDD operation?
>> 
>> Thanks a lot.
>> 
>> 
>> program	time
>> scala program	49s
>> pyspark dataframe	56s
>> scala RDD	1m31s
>> pyspark RDD	7m15s

Re: why the pyspark RDD API is so slow?

Posted by Sebastian Piu <se...@gmail.com>.

It's because all data needs to be pickled back and forth between java and a
spun python worker, so there is additional overhead than if you stay fully
in scala.

Your python code might make this worse too, for example if not yielding
from operations

You can look at using UDFs and arrow or trying to stay as much as possible
on datagrams operations only

On Sun, 30 Jan 2022, 10:11 Bitfox, <bi...@bitfox.top> wrote:

> Hello list,
>
> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
> pure scala program. The result shows the pyspark RDD is too slow.
>
> For the operations and dataset please see:
>
> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>
> The result table is below.
> Can you give suggestions on how to optimize the RDD operation?
>
> Thanks a lot.
>
>
> *program* *time*
> scala program 49s
> pyspark dataframe 56s
> scala RDD 1m31s
> pyspark RDD 7m15s
>