You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by rtshadow <pa...@gmail.com> on 2015/01/29 21:45:20 UTC

How to speed PySpark to match Scala/Java performance

Hi,

In my company, we've been trying to use PySpark to run ETLs on our data.
Alas, it turned out to be terribly slow compared to Java or Scala API (which
we ended up using to meet performance criteria). 

To be more quantitative, let's consider simple case:
I've generated test file (848MB): /seq 1 100000000 > /tmp/test/

and tried to run simple computation on it, which includes three steps: read
-> multiply each row by 2 -> take max
Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/

Here are the results of this simple benchmark:
CPython - 59s
PyPy - 26s
Scala version - 7s

I didn't dig into what exactly contributes to execution times of CPython /
PyPy, but it seems that serialization / deserialization, when sending data
to the worker may be the issue. 
I know some guys already have been asking about using Jython
(http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658,
http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html),
but it seems, that no one have really done this with Spark.
It looks like performance gain from using jython can be huge - you wouldn't
need to spawn PythonWorkers, all the code would be just executed inside
SparkExecutor JVM, using python code compiled to java bytecode. Do you think
that's possible to achieve? Do you see any obvious obstacles? Of course,
jython doesn't have C extensions, but if one doesn't need them, then it
should fit here nicely.

I'm willing to try to marry Spark with Jython and see how it goes.

What do you think about this?





--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: How to speed PySpark to match Scala/Java performance

Posted by Sasha Kacanski <sk...@gmail.com>.

thanks for quick reply, I will check the link.
Hopefully, with conversion to py3, or 3.4 we could take advantage of
asyncio and other cool new stuff ...

On Thu, Jan 29, 2015 at 7:41 PM, Reynold Xin <rx...@databricks.com> wrote:

> It is something like this:
> https://issues.apache.org/jira/browse/SPARK-5097
>
> On the master branch, we have a Pandas like API already.
>
>
> On Thu, Jan 29, 2015 at 4:31 PM, Sasha Kacanski <sk...@gmail.com>
> wrote:
>
>> Hi Reynold,
>> In my project I want to use Python API too.
>> When you mention DF's are we talking about pandas or this is something
>> internal to spark py api.
>> If you could elaborate a bit on this or point me to alternate
>> documentation.
>> Thanks much --sasha
>>
>> On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Once the data frame API is released for 1.3, you can write your thing in
>>> Python and get the same performance. It can't express everything, but for
>>> basic things like projection, filter, join, aggregate and simple numeric
>>> computation, it should work pretty well.
>>>
>>>
>>> On Thu, Jan 29, 2015 at 12:45 PM, rtshadow <
>>> pastuszka.przemyslaw@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > In my company, we've been trying to use PySpark to run ETLs on our
>>> data.
>>> > Alas, it turned out to be terribly slow compared to Java or Scala API
>>> > (which
>>> > we ended up using to meet performance criteria).
>>> >
>>> > To be more quantitative, let's consider simple case:
>>> > I've generated test file (848MB): /seq 1 100000000 > /tmp/test/
>>> >
>>> > and tried to run simple computation on it, which includes three steps:
>>> read
>>> > -> multiply each row by 2 -> take max
>>> > Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
>>> > Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
>>> >
>>> > Here are the results of this simple benchmark:
>>> > CPython - 59s
>>> > PyPy - 26s
>>> > Scala version - 7s
>>> >
>>> > I didn't dig into what exactly contributes to execution times of
>>> CPython /
>>> > PyPy, but it seems that serialization / deserialization, when sending
>>> data
>>> > to the worker may be the issue.
>>> > I know some guys already have been asking about using Jython
>>> > (
>>> >
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658
>>> > ,
>>> >
>>> >
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html
>>> > ),
>>> > but it seems, that no one have really done this with Spark.
>>> > It looks like performance gain from using jython can be huge - you
>>> wouldn't
>>> > need to spawn PythonWorkers, all the code would be just executed inside
>>> > SparkExecutor JVM, using python code compiled to java bytecode. Do you
>>> > think
>>> > that's possible to achieve? Do you see any obvious obstacles? Of
>>> course,
>>> > jython doesn't have C extensions, but if one doesn't need them, then it
>>> > should fit here nicely.
>>> >
>>> > I'm willing to try to marry Spark with Jython and see how it goes.
>>> >
>>> > What do you think about this?
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> >
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
>>> > Sent from the Apache Spark Developers List mailing list archive at
>>> > Nabble.com.
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> > For additional commands, e-mail: dev-help@spark.apache.org
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Aleksandar Kacanski
>>
>
>


-- 
Aleksandar Kacanski

Re: How to speed PySpark to match Scala/Java performance

Posted by Reynold Xin <rx...@databricks.com>.

It is something like this: https://issues.apache.org/jira/browse/SPARK-5097

On the master branch, we have a Pandas like API already.


On Thu, Jan 29, 2015 at 4:31 PM, Sasha Kacanski <sk...@gmail.com> wrote:

> Hi Reynold,
> In my project I want to use Python API too.
> When you mention DF's are we talking about pandas or this is something
> internal to spark py api.
> If you could elaborate a bit on this or point me to alternate
> documentation.
> Thanks much --sasha
>
> On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Once the data frame API is released for 1.3, you can write your thing in
>> Python and get the same performance. It can't express everything, but for
>> basic things like projection, filter, join, aggregate and simple numeric
>> computation, it should work pretty well.
>>
>>
>> On Thu, Jan 29, 2015 at 12:45 PM, rtshadow <
>> pastuszka.przemyslaw@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > In my company, we've been trying to use PySpark to run ETLs on our data.
>> > Alas, it turned out to be terribly slow compared to Java or Scala API
>> > (which
>> > we ended up using to meet performance criteria).
>> >
>> > To be more quantitative, let's consider simple case:
>> > I've generated test file (848MB): /seq 1 100000000 > /tmp/test/
>> >
>> > and tried to run simple computation on it, which includes three steps:
>> read
>> > -> multiply each row by 2 -> take max
>> > Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
>> > Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
>> >
>> > Here are the results of this simple benchmark:
>> > CPython - 59s
>> > PyPy - 26s
>> > Scala version - 7s
>> >
>> > I didn't dig into what exactly contributes to execution times of
>> CPython /
>> > PyPy, but it seems that serialization / deserialization, when sending
>> data
>> > to the worker may be the issue.
>> > I know some guys already have been asking about using Jython
>> > (
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658
>> > ,
>> >
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html
>> > ),
>> > but it seems, that no one have really done this with Spark.
>> > It looks like performance gain from using jython can be huge - you
>> wouldn't
>> > need to spawn PythonWorkers, all the code would be just executed inside
>> > SparkExecutor JVM, using python code compiled to java bytecode. Do you
>> > think
>> > that's possible to achieve? Do you see any obvious obstacles? Of course,
>> > jython doesn't have C extensions, but if one doesn't need them, then it
>> > should fit here nicely.
>> >
>> > I'm willing to try to marry Spark with Jython and see how it goes.
>> >
>> > What do you think about this?
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> > Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: dev-help@spark.apache.org
>> >
>> >
>>
>
>
>
> --
> Aleksandar Kacanski
>

Re: How to speed PySpark to match Scala/Java performance

Posted by Sasha Kacanski <sk...@gmail.com>.

Hi Reynold,
In my project I want to use Python API too.
When you mention DF's are we talking about pandas or this is something
internal to spark py api.
If you could elaborate a bit on this or point me to alternate documentation.
Thanks much --sasha

On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin <rx...@databricks.com> wrote:

> Once the data frame API is released for 1.3, you can write your thing in
> Python and get the same performance. It can't express everything, but for
> basic things like projection, filter, join, aggregate and simple numeric
> computation, it should work pretty well.
>
>
> On Thu, Jan 29, 2015 at 12:45 PM, rtshadow <pastuszka.przemyslaw@gmail.com
> >
> wrote:
>
> > Hi,
> >
> > In my company, we've been trying to use PySpark to run ETLs on our data.
> > Alas, it turned out to be terribly slow compared to Java or Scala API
> > (which
> > we ended up using to meet performance criteria).
> >
> > To be more quantitative, let's consider simple case:
> > I've generated test file (848MB): /seq 1 100000000 > /tmp/test/
> >
> > and tried to run simple computation on it, which includes three steps:
> read
> > -> multiply each row by 2 -> take max
> > Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
> > Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
> >
> > Here are the results of this simple benchmark:
> > CPython - 59s
> > PyPy - 26s
> > Scala version - 7s
> >
> > I didn't dig into what exactly contributes to execution times of CPython
> /
> > PyPy, but it seems that serialization / deserialization, when sending
> data
> > to the worker may be the issue.
> > I know some guys already have been asking about using Jython
> > (
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658
> > ,
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html
> > ),
> > but it seems, that no one have really done this with Spark.
> > It looks like performance gain from using jython can be huge - you
> wouldn't
> > need to spawn PythonWorkers, all the code would be just executed inside
> > SparkExecutor JVM, using python code compiled to java bytecode. Do you
> > think
> > that's possible to achieve? Do you see any obvious obstacles? Of course,
> > jython doesn't have C extensions, but if one doesn't need them, then it
> > should fit here nicely.
> >
> > I'm willing to try to marry Spark with Jython and see how it goes.
> >
> > What do you think about this?
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
> >
>



-- 
Aleksandar Kacanski

Re: How to speed PySpark to match Scala/Java performance

Posted by Reynold Xin <rx...@databricks.com>.

Once the data frame API is released for 1.3, you can write your thing in
Python and get the same performance. It can't express everything, but for
basic things like projection, filter, join, aggregate and simple numeric
computation, it should work pretty well.


On Thu, Jan 29, 2015 at 12:45 PM, rtshadow <pa...@gmail.com>
wrote:

> Hi,
>
> In my company, we've been trying to use PySpark to run ETLs on our data.
> Alas, it turned out to be terribly slow compared to Java or Scala API
> (which
> we ended up using to meet performance criteria).
>
> To be more quantitative, let's consider simple case:
> I've generated test file (848MB): /seq 1 100000000 > /tmp/test/
>
> and tried to run simple computation on it, which includes three steps: read
> -> multiply each row by 2 -> take max
> Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
> Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
>
> Here are the results of this simple benchmark:
> CPython - 59s
> PyPy - 26s
> Scala version - 7s
>
> I didn't dig into what exactly contributes to execution times of CPython /
> PyPy, but it seems that serialization / deserialization, when sending data
> to the worker may be the issue.
> I know some guys already have been asking about using Jython
> (
> http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658
> ,
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html
> ),
> but it seems, that no one have really done this with Spark.
> It looks like performance gain from using jython can be huge - you wouldn't
> need to spawn PythonWorkers, all the code would be just executed inside
> SparkExecutor JVM, using python code compiled to java bytecode. Do you
> think
> that's possible to achieve? Do you see any obvious obstacles? Of course,
> jython doesn't have C extensions, but if one doesn't need them, then it
> should fit here nicely.
>
> I'm willing to try to marry Spark with Jython and see how it goes.
>
> What do you think about this?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: How to speed PySpark to match Scala/Java performance

Posted by Davies Liu <da...@databricks.com>.

Hey,

Without having Python as fast as Scala/Java, I think it's impossible to similar
performance in PySpark as in Scala/Java. Jython is also much slower than
Scala/Java.

With Jython, we can avoid the cost of manage multiple process and RPC,
we may still need to do the data conversion between Java and Python.
Given that fact that Jython is not widely used in production, it may introduce
more troubles than the performance gain.

Spark jobs can be easily speed up by scaling out (by adding more resources).
I think the most advantage of PySpark is that it let you do fast prototype.
Once you got your ETL finalized, it's not that hard to translate your
pure Python
jobs into Scala to reduce the cost(it's optional).

Now days, engineer time is much more expensive than CPU time, I think we
should be more focus on the former.

That's my 2 cents.

Davies

On Thu, Jan 29, 2015 at 12:45 PM, rtshadow
<pa...@gmail.com> wrote:
> Hi,
>
> In my company, we've been trying to use PySpark to run ETLs on our data.
> Alas, it turned out to be terribly slow compared to Java or Scala API (which
> we ended up using to meet performance criteria).
>
> To be more quantitative, let's consider simple case:
> I've generated test file (848MB): /seq 1 100000000 > /tmp/test/
>
> and tried to run simple computation on it, which includes three steps: read
> -> multiply each row by 2 -> take max
> Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
> Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
>
> Here are the results of this simple benchmark:
> CPython - 59s
> PyPy - 26s
> Scala version - 7s
>
> I didn't dig into what exactly contributes to execution times of CPython /
> PyPy, but it seems that serialization / deserialization, when sending data
> to the worker may be the issue.
> I know some guys already have been asking about using Jython
> (http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658,
> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html),
> but it seems, that no one have really done this with Spark.
> It looks like performance gain from using jython can be huge - you wouldn't
> need to spawn PythonWorkers, all the code would be just executed inside
> SparkExecutor JVM, using python code compiled to java bytecode. Do you think
> that's possible to achieve? Do you see any obvious obstacles? Of course,
> jython doesn't have C extensions, but if one doesn't need them, then it
> should fit here nicely.
>
> I'm willing to try to marry Spark with Jython and see how it goes.
>
> What do you think about this?
>
>
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org