You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by nileshc <ni...@nileshc.com> on 2014/01/30 17:30:01 UTC

Python API Performance

Hi there,

*Background:*
I need to do some matrix multiplication stuff inside the mappers, and trying
to choose between Python and Scala for writing the Spark MR jobs. I'm
equally fluent with Python and Java, and find Scala pretty easy too for what
it's worth. Going with Python would let me use numpy + scipy, which is
blazing fast when compared to Java libraries like Colt etc. Configuring Java
with BLAS seems to be a pain when compared to scipy (direct apt-get
installs, or pip).

*Question:*
I posted a couple of comments on this answer at StackOverflow:
http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python.
Basically it states that as of Spark 0.7.2, the Python API would be slower
than Scala. What's the performance scenario now? The fork issue seems to be
fixed. How about serialization? Can it match Java/Scala Writable-like
serialization (having knowledge of object type beforehand, reducing I/O)
performance? Also, a probably silly question - loops seem to be slow in
Python in general, do you think this can turn out to be an issue?

Bottomline, should I choose Python for computation-intensive algorithms like
PageRank? Scipy gives me an edge, but does the framework kill it?

Any help, insights, benchmarks will be much appreciated. :)

Cheers,
Nilesh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Python API Performance

Posted by Evan Sparks <ev...@gmail.com>.

We used breeze in some early MLlib prototypes last year. It feels very "scala" which is a huge plus, but unfortunately we found that the object overhead and difficulty of tracking down performance problems due to heavy use of implicit conversions inside breeze made writing high performance matrix code with it difficult. Further - at least for the early algorithms, we didn't need all the extra flexibility that breeze provides, since our use cases were pretty straightforward. 

> On Feb 1, 2014, at 5:51 PM, 尹绪森 <yi...@gmail.com> wrote:
> 
> How about breeze (http://www.scalanlp.org/) ? It is written in scala, and use netlib-java as the backend. (https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra#wiki-performance)
> 
> I think breeze is more like matlab and numpy/scipy on the subject of ease of use. This is also a good aspect to have a test.
> 
> 
> 2014-02-02 Ankur Chauhan <ac...@brightcove.com>:
>> How does Julia interact with spark. I would be interested, mainly because I seem to find scala syntax a little obscure and it would be great to see actual numbers comparing scala, Python, Julia workloads. 
>> 
>>> On Feb 1, 2014, at 16:08, Aureliano Buendia <bu...@gmail.com> wrote:
>>> 
>>> A much (much) better solution than python, (and also scala, if that doesn't make you upset) is julia.
>>> 
>>> Libraries like numpy and scipy are bloated when compared with julia c-like performance. Julia comes with eveything that numpy+scipy come with + more - performance hit.
>>> 
>>> I hope we can see an official support of julia on spark very soon.
>>> 
>>> 
>>>> On Thu, Jan 30, 2014 at 4:30 PM, nileshc <ni...@nileshc.com> wrote:
>>>> Hi there,
>>>> 
>>>> *Background:*
>>>> I need to do some matrix multiplication stuff inside the mappers, and trying
>>>> to choose between Python and Scala for writing the Spark MR jobs. I'm
>>>> equally fluent with Python and Java, and find Scala pretty easy too for what
>>>> it's worth. Going with Python would let me use numpy + scipy, which is
>>>> blazing fast when compared to Java libraries like Colt etc. Configuring Java
>>>> with BLAS seems to be a pain when compared to scipy (direct apt-get
>>>> installs, or pip).
>>>> 
>>>> *Question:*
>>>> I posted a couple of comments on this answer at StackOverflow:
>>>> http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python.
>>>> Basically it states that as of Spark 0.7.2, the Python API would be slower
>>>> than Scala. What's the performance scenario now? The fork issue seems to be
>>>> fixed. How about serialization? Can it match Java/Scala Writable-like
>>>> serialization (having knowledge of object type beforehand, reducing I/O)
>>>> performance? Also, a probably silly question - loops seem to be slow in
>>>> Python in general, do you think this can turn out to be an issue?
>>>> 
>>>> Bottomline, should I choose Python for computation-intensive algorithms like
>>>> PageRank? Scipy gives me an edge, but does the framework kill it?
>>>> 
>>>> Any help, insights, benchmarks will be much appreciated. :)
>>>> 
>>>> Cheers,
>>>> Nilesh
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> 
> 
> -- 
> Best Regards
> -----------------------------------
> Xusen Yin    尹绪森
> Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia
> Beijing University of Posts & Telecommunications
> Intel Labs China
> Homepage: http://yinxusen.github.io/

Re: Python API Performance

Posted by 尹绪森 <yi...@gmail.com>.

How about breeze (http://www.scalanlp.org/) ? It is written in scala, and
use netlib-java as the backend. (
https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra#wiki-performance
)

I think breeze is more like matlab and numpy/scipy on the subject of ease
of use. This is also a good aspect to have a test.


2014-02-02 Ankur Chauhan <ac...@brightcove.com>:

> How does Julia interact with spark. I would be interested, mainly because
> I seem to find scala syntax a little obscure and it would be great to see
> actual numbers comparing scala, Python, Julia workloads.
>
> On Feb 1, 2014, at 16:08, Aureliano Buendia <bu...@gmail.com> wrote:
>
> A much (much) better solution than python, (and also scala, if that
> doesn't make you upset) is julia <http://julialang.org/>.
>
> Libraries like numpy and scipy are bloated when compared with julia c-like
> performance. Julia comes with eveything that numpy+scipy come with + more -
> performance hit.
>
> I hope we can see an official support of julia on spark very soon.
>
>
> On Thu, Jan 30, 2014 at 4:30 PM, nileshc <ni...@nileshc.com> wrote:
>
>> Hi there,
>>
>> *Background:*
>> I need to do some matrix multiplication stuff inside the mappers, and
>> trying
>> to choose between Python and Scala for writing the Spark MR jobs. I'm
>> equally fluent with Python and Java, and find Scala pretty easy too for
>> what
>> it's worth. Going with Python would let me use numpy + scipy, which is
>> blazing fast when compared to Java libraries like Colt etc. Configuring
>> Java
>> with BLAS seems to be a pain when compared to scipy (direct apt-get
>> installs, or pip).
>>
>> *Question:*
>> I posted a couple of comments on this answer at StackOverflow:
>>
>> http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python
>> .
>> Basically it states that as of Spark 0.7.2, the Python API would be slower
>> than Scala. What's the performance scenario now? The fork issue seems to
>> be
>> fixed. How about serialization? Can it match Java/Scala Writable-like
>> serialization (having knowledge of object type beforehand, reducing I/O)
>> performance? Also, a probably silly question - loops seem to be slow in
>> Python in general, do you think this can turn out to be an issue?
>>
>> Bottomline, should I choose Python for computation-intensive algorithms
>> like
>> PageRank? Scipy gives me an edge, but does the framework kill it?
>>
>> Any help, insights, benchmarks will be much appreciated. :)
>>
>> Cheers,
>> Nilesh
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


-- 
Best Regards
-----------------------------------
Xusen Yin    尹绪森
Beijing Key Laboratory of Intelligent Telecommunications Software and
Multimedia
Beijing University of Posts & Telecommunications
Intel Labs China
Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*

Re: Python API Performance

Posted by Ankur Chauhan <ac...@brightcove.com>.

How does Julia interact with spark. I would be interested, mainly because I seem to find scala syntax a little obscure and it would be great to see actual numbers comparing scala, Python, Julia workloads. 

> On Feb 1, 2014, at 16:08, Aureliano Buendia <bu...@gmail.com> wrote:
> 
> A much (much) better solution than python, (and also scala, if that doesn't make you upset) is julia.
> 
> Libraries like numpy and scipy are bloated when compared with julia c-like performance. Julia comes with eveything that numpy+scipy come with + more - performance hit.
> 
> I hope we can see an official support of julia on spark very soon.
> 
> 
>> On Thu, Jan 30, 2014 at 4:30 PM, nileshc <ni...@nileshc.com> wrote:
>> Hi there,
>> 
>> *Background:*
>> I need to do some matrix multiplication stuff inside the mappers, and trying
>> to choose between Python and Scala for writing the Spark MR jobs. I'm
>> equally fluent with Python and Java, and find Scala pretty easy too for what
>> it's worth. Going with Python would let me use numpy + scipy, which is
>> blazing fast when compared to Java libraries like Colt etc. Configuring Java
>> with BLAS seems to be a pain when compared to scipy (direct apt-get
>> installs, or pip).
>> 
>> *Question:*
>> I posted a couple of comments on this answer at StackOverflow:
>> http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python.
>> Basically it states that as of Spark 0.7.2, the Python API would be slower
>> than Scala. What's the performance scenario now? The fork issue seems to be
>> fixed. How about serialization? Can it match Java/Scala Writable-like
>> serialization (having knowledge of object type beforehand, reducing I/O)
>> performance? Also, a probably silly question - loops seem to be slow in
>> Python in general, do you think this can turn out to be an issue?
>> 
>> Bottomline, should I choose Python for computation-intensive algorithms like
>> PageRank? Scipy gives me an edge, but does the framework kill it?
>> 
>> Any help, insights, benchmarks will be much appreciated. :)
>> 
>> Cheers,
>> Nilesh
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Python API Performance

Posted by Aureliano Buendia <bu...@gmail.com>.

A much (much) better solution than python, (and also scala, if that doesn't
make you upset) is julia <http://julialang.org/>.

Libraries like numpy and scipy are bloated when compared with julia c-like
performance. Julia comes with eveything that numpy+scipy come with + more -
performance hit.

I hope we can see an official support of julia on spark very soon.


On Thu, Jan 30, 2014 at 4:30 PM, nileshc <ni...@nileshc.com> wrote:

> Hi there,
>
> *Background:*
> I need to do some matrix multiplication stuff inside the mappers, and
> trying
> to choose between Python and Scala for writing the Spark MR jobs. I'm
> equally fluent with Python and Java, and find Scala pretty easy too for
> what
> it's worth. Going with Python would let me use numpy + scipy, which is
> blazing fast when compared to Java libraries like Colt etc. Configuring
> Java
> with BLAS seems to be a pain when compared to scipy (direct apt-get
> installs, or pip).
>
> *Question:*
> I posted a couple of comments on this answer at StackOverflow:
>
> http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python
> .
> Basically it states that as of Spark 0.7.2, the Python API would be slower
> than Scala. What's the performance scenario now? The fork issue seems to be
> fixed. How about serialization? Can it match Java/Scala Writable-like
> serialization (having knowledge of object type beforehand, reducing I/O)
> performance? Also, a probably silly question - loops seem to be slow in
> Python in general, do you think this can turn out to be an issue?
>
> Bottomline, should I choose Python for computation-intensive algorithms
> like
> PageRank? Scipy gives me an edge, but does the framework kill it?
>
> Any help, insights, benchmarks will be much appreciated. :)
>
> Cheers,
> Nilesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Python API Performance

Posted by nileshc <ni...@nileshc.com>.

Hi Jeremy,

Thanks for the reply.


Jeremy Freeman wrote
> That said, there's a performance hit. In my testing (v0.8.1) a simple
> algorithm, KMeans (the versions included with Spark), is ~2x faster per
> iteration in Scala than Python in our set up (private HPC, ~30 nodes, each
> with 128GB and 16 cores, roughly comparable to the higher-end EC2
> instances). I'm preparing more extensive benchmarks, esp. re: matrix
> calculations, where the difference may shrink (will post them to this
> forum when ready). For our purposes (purely research), things are fast
> enough already that the benefits of PySpark outweigh the costs, but will
> depend on your use case.

So you measured with a Scala/Java library on Spark vs numpy/scipy on
PySpark, right? Can you tell me which library you used?

A benchmark (or just an initial ballpark figure about the performance
difference) on matrix calculations would be awesome - that's the thing that
I'm wondering about, whether the difference will even out. I'm still working
on something else, and will arrive at Spark/PySpark in a couple of weeks. If
you guys can share the results before, it'll save me a great deal of
time/toil.

Best,
Nilesh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048p1051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Python API Performance

Posted by nileshc <ni...@nileshc.com>.

OK, I did some uber-basic testing of the Python ALS example and the Scala ALS
example (I wouldn't call this real benchmarking because of the configuration
and casual nature of the test).

CPU:i5-2500K
Memory allotted to an example with -Djava.executor.memory=2g
I've got one master and one slave running.

I'm listing the results in <API Language> <List of params: movies users
features iterations slices> : <Time taken> format.

<Scala> 500 2000 100 5 2 1m21s
<Scala> 500 2000 100 5 4 0m50s
<Scala> 700 2000 100 5 2 1m41s
<Scala> 700 2000 100 5 4 1m14s
<Python> 500 2000 100 5 4 8m18s
(Sorry, no more for Python, I'm pressed for time at the moment.)

I noticed that average CPU utilization on the quad-core was always 99%+
(except for the drops to ~90% between iterations) during the Scala runs.
During Python however, it was around 55-67%, and the rest was WAIT.
Evidently a huge time was being wasted (on I/O? slow loops?).

And a funnier thing was, the RMSE over the 5 iterations for Scala began with
0.82 and ended with 0.73, while for the Python version, it started with an
RMSE of 1294.1236 and ended with 210.2984. That's a pretty huge gap. Can
someone very all this at least on a single node?

I haven't even modified any code, so Scala's using the usual Colt.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048p1109.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Python API Performance

Posted by Jeremy Freeman <fr...@gmail.com>.

The test I was referring to was the included KMeans algorithm, which uses
NumPy for PySpark but can be done without jBlas in scala, so it's more
testing basic performance, not matrix libraries.

I can certainly try the ALS test, though note that the scala example you
pointed to uses Colt, whereas most of MlLib at this point uses jBlas, so
probably most relevant to compare to something using jBlas (or simply
rewrite that example to use jBlas). 

I basically agree with Evan that if you're only using matrices, and not the
richer features of SciPy/NumPy, scala is the way to go, but I'll report back
with more tests. I also like Josh's suggestion of adding proper PySpark
benchmarking, I'll take a stab at that.

-- Jeremy



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048p1099.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Python API Performance

Posted by Josh Rosen <ro...@gmail.com>.

If anyone wants to benchmark PySpark against the Scala/Java APIs, it might
be nice to add Python benchmarks to the spark-perf performance testing
suite: https://github.com/amplab/spark-perf.


On Thu, Jan 30, 2014 at 3:53 PM, nileshc <ni...@nileshc.com> wrote:

> Hi Jeremy,
>
> Can you try doing a comparison of the Scala ALS code
> (
> https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkALS.scala
> )
> and Python ALS code
> (
> https://github.com/apache/incubator-spark/blob/master/python/examples/als.py
> )
> from the Spark repo?
>
> This might be the easiest way to compare Scala+Colt vs Python+Numpy on
> Spark! Both contain sparse matrix manipulation and multiplications. If
> someone already has a small Spark cluster (even standalone) already setup,
> please let us know about how this fares.
>
> I'll try setting up Spark on a few nodes next week.
>
> Best,
> Nilesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048p1071.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Python API Performance

Posted by nileshc <ni...@nileshc.com>.

Hi Jeremy,

Can you try doing a comparison of the Scala ALS code
(https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkALS.scala)
and Python ALS code
(https://github.com/apache/incubator-spark/blob/master/python/examples/als.py)
from the Spark repo?

This might be the easiest way to compare Scala+Colt vs Python+Numpy on
Spark! Both contain sparse matrix manipulation and multiplications. If
someone already has a small Spark cluster (even standalone) already setup,
please let us know about how this fares.

I'll try setting up Spark on a few nodes next week.

Best,
Nilesh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048p1071.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Python API Performance

Posted by Jeremy Freeman <fr...@gmail.com>.

Hi Nilesh,

We're building a data analysis library purely in PySpark that uses a fair
bit of numerical computing (https://github.com/freeman-lab/thunder), and
faced the same decision as you when starting out.

We went with PySpark because of NumPy and SciPy. So many functions are
included, with robust implementations: signal processing, optimization,
matrix math, etc., and it's trivial to setup. In Scala, we needed different
libraries for specific problems, and many are still in their early days
(bugs, missing features, etc.). The PySpark API is relatively complete,
though a few bits of functionality aren't there (zipping is probably the
only one we're sometimes missing, useful for certain matrix operations). It
was definitely feasible to build a functional library entirely in PySpark.

That said, there's a performance hit. In my testing (v0.8.1) a simple
algorithm, KMeans (the versions included with Spark), is ~2x faster per
iteration in Scala than Python in our set up (private HPC, ~30 nodes, each
with 128GB and 16 cores, roughly comparable to the higher-end EC2
instances). I'm preparing more extensive benchmarks, esp. re: matrix
calculations, where the difference may shrink (will post them to this forum
when ready). For our purposes (purely research), things are fast enough
already that the benefits of PySpark outweigh the costs, but will depend on
your use case.

I can't speak much to the current roadblocks and future plans for speed-ups,
though I know Josh has mentioned he's working on new custom serializers.

-- Jeremy



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048p1049.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Python API Performance

Posted by Aureliano Buendia <bu...@gmail.com>.

On Thu, Jan 30, 2014 at 7:51 PM, Evan R. Sparks <ev...@gmail.com>wrote:

> If you just need basic matrix operations - Spark is dependent on JBlas (
> http://mikiobraun.github.io/jblas/) to have access to quick linear
> algebra routines inside of MLlib and graphx. Jblas does a nice job of
> avoiding boxing/unboxing issues when calling out to blas, so it might be
> what you're looking for. The programming patterns you'll be able to support
> with jblas (matrix ops on local partitions) are very similar to what you'd
> get with numpy, etc.
>

jblas is not the top java matrix library when it comes to performance:

https://code.google.com/p/java-matrix-benchmark/wiki/RuntimeCorei7v2600_2013_10


>
> I agree that the python libraries are more complete/feature rich, but if
> you really crave high performance then I'd recommend staying pure scala and
> giving jblas a try.
>
>
> On Thu, Jan 30, 2014 at 8:30 AM, nileshc <ni...@nileshc.com> wrote:
>
>> Hi there,
>>
>> *Background:*
>> I need to do some matrix multiplication stuff inside the mappers, and
>> trying
>> to choose between Python and Scala for writing the Spark MR jobs. I'm
>> equally fluent with Python and Java, and find Scala pretty easy too for
>> what
>> it's worth. Going with Python would let me use numpy + scipy, which is
>> blazing fast when compared to Java libraries like Colt etc. Configuring
>> Java
>> with BLAS seems to be a pain when compared to scipy (direct apt-get
>> installs, or pip).
>>
>> *Question:*
>> I posted a couple of comments on this answer at StackOverflow:
>>
>> http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python
>> .
>> Basically it states that as of Spark 0.7.2, the Python API would be slower
>> than Scala. What's the performance scenario now? The fork issue seems to
>> be
>> fixed. How about serialization? Can it match Java/Scala Writable-like
>> serialization (having knowledge of object type beforehand, reducing I/O)
>> performance? Also, a probably silly question - loops seem to be slow in
>> Python in general, do you think this can turn out to be an issue?
>>
>> Bottomline, should I choose Python for computation-intensive algorithms
>> like
>> PageRank? Scipy gives me an edge, but does the framework kill it?
>>
>> Any help, insights, benchmarks will be much appreciated. :)
>>
>> Cheers,
>> Nilesh
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Re: Python API Performance

Posted by nileshc <ni...@nileshc.com>.

Hi Evans,

Thanks! I didn't know that Sparks has a dependency on JBLAS. That's good to
know. Does this mean I can directly use JBLAS from my Spark MR code and not
worry about the painstaking setup of getting Java to recognize the native
BLAS libraries on my system? Does Spark take care of that?

But then again, my particular use case deals with large sparse matrices, in
which case my only option on the Java/Scala side seems to be Colt (which is
pretty slow compared to both JBLAS and scipy/numpy). MTJ is another option
- but I'm not sure how much BLAS/ATLAS-setup that'll need. That's what's
confusing me - I can't figure out how this will balance out until I take
some time off to code some benchmarks myself. :(

Nilesh


On Fri, Jan 31, 2014 at 3:04 AM, Evan R. Sparks [via Apache Spark User
List] <ml...@n3.nabble.com> wrote:

> If you just need basic matrix operations - Spark is dependent on JBlas (
> http://mikiobraun.github.io/jblas/) to have access to quick linear
> algebra routines inside of MLlib and graphx. Jblas does a nice job of
> avoiding boxing/unboxing issues when calling out to blas, so it might be
> what you're looking for. The programming patterns you'll be able to support
> with jblas (matrix ops on local partitions) are very similar to what you'd
> get with numpy, etc.
>
> I agree that the python libraries are more complete/feature rich, but if
> you really crave high performance then I'd recommend staying pure scala and
> giving jblas a try.
>
>
> On Thu, Jan 30, 2014 at 8:30 AM, nileshc <[hidden email]<http://user/SendEmail.jtp?type=node&node=1068&i=0>
> > wrote:
>
>> Hi there,
>>
>> *Background:*
>> I need to do some matrix multiplication stuff inside the mappers, and
>> trying
>> to choose between Python and Scala for writing the Spark MR jobs. I'm
>> equally fluent with Python and Java, and find Scala pretty easy too for
>> what
>> it's worth. Going with Python would let me use numpy + scipy, which is
>> blazing fast when compared to Java libraries like Colt etc. Configuring
>> Java
>> with BLAS seems to be a pain when compared to scipy (direct apt-get
>> installs, or pip).
>>
>> *Question:*
>> I posted a couple of comments on this answer at StackOverflow:
>>
>> http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python
>> .
>> Basically it states that as of Spark 0.7.2, the Python API would be slower
>> than Scala. What's the performance scenario now? The fork issue seems to
>> be
>> fixed. How about serialization? Can it match Java/Scala Writable-like
>> serialization (having knowledge of object type beforehand, reducing I/O)
>> performance? Also, a probably silly question - loops seem to be slow in
>> Python in general, do you think this can turn out to be an issue?
>>
>> Bottomline, should I choose Python for computation-intensive algorithms
>> like
>> PageRank? Scipy gives me an edge, but does the framework kill it?
>>
>> Any help, insights, benchmarks will be much appreciated. :)
>>
>> Cheers,
>> Nilesh
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048p1068.html
>  To unsubscribe from Python API Performance, click here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1048&code=bmlsZXNoQG5pbGVzaGMuY29tfDEwNDh8MTA4ODg3MjEwMg==>
> .
> NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
A quest eternal, a life so small! So don't just play the guitar, build one.
You can also email me at contact@nileshc.com or visit my
website<http://www.nileshc.com/>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048p1070.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Python API Performance

Posted by "Evan R. Sparks" <ev...@gmail.com>.

If you just need basic matrix operations - Spark is dependent on JBlas (
http://mikiobraun.github.io/jblas/) to have access to quick linear algebra
routines inside of MLlib and graphx. Jblas does a nice job of avoiding
boxing/unboxing issues when calling out to blas, so it might be what you're
looking for. The programming patterns you'll be able to support with jblas
(matrix ops on local partitions) are very similar to what you'd get with
numpy, etc.

I agree that the python libraries are more complete/feature rich, but if
you really crave high performance then I'd recommend staying pure scala and
giving jblas a try.

On Thu, Jan 30, 2014 at 8:30 AM, nileshc <ni...@nileshc.com> wrote:

> Hi there,
>
> *Background:*
> I need to do some matrix multiplication stuff inside the mappers, and
> trying
> to choose between Python and Scala for writing the Spark MR jobs. I'm
> equally fluent with Python and Java, and find Scala pretty easy too for
> what
> it's worth. Going with Python would let me use numpy + scipy, which is
> blazing fast when compared to Java libraries like Colt etc. Configuring
> Java
> with BLAS seems to be a pain when compared to scipy (direct apt-get
> installs, or pip).
>
> *Question:*
> I posted a couple of comments on this answer at StackOverflow:
>
> http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python
> .
> Basically it states that as of Spark 0.7.2, the Python API would be slower
> than Scala. What's the performance scenario now? The fork issue seems to be
> fixed. How about serialization? Can it match Java/Scala Writable-like
> serialization (having knowledge of object type beforehand, reducing I/O)
> performance? Also, a probably silly question - loops seem to be slow in
> Python in general, do you think this can turn out to be an issue?
>
> Bottomline, should I choose Python for computation-intensive algorithms
> like
> PageRank? Scipy gives me an edge, but does the framework kill it?
>
> Any help, insights, benchmarks will be much appreciated. :)
>
> Cheers,
> Nilesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>