You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "utk.pat" <ut...@gmail.com> on 2015/08/24 14:22:27 UTC

Performance - Python streaming v/s Scala streaming

I am new to SPARK streaming. I was running the "kafka_wordcount" example with
a local KAFKA and SPARK instance. It was very easy to set this up and get
going :)I tried running both SCALA and Python versions of the word count
example. Python versions seems to be extremely slow. Sometimes it has delays
of more than couple of minutes. On the other hand SCALA versions seems to be
way better. I am running on a windows machine.I am trying to understand what
is the cause slowness in python streaming? Is there anything that I am
missing? For real time streaming analysis should I prefer SCALA?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-Python-streaming-v-s-Scala-streaming-tp24415.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Performance - Python streaming v/s Scala streaming

Posted by Utkarsh Patkar <ut...@gmail.com>.
Thanks for the quick response.
I have tried the direct word count python example and it also seems to be
slow. Lot of times it is not fetching the words that are sent by the
producer.
I am using SPARK version 1.4.1 and KAFKA 2.10-0.8.2.0.


On Tue, Aug 25, 2015 at 2:05 AM, Tathagata Das <td...@databricks.com> wrote:

> The scala version of the Kafka  is something that we have been working on
> for a while, and is likely to be more optimized than the python one. The
> python one definitely requires pass the data back and forth between JVM and
> Python VM and decoding the raw bytes to the Python strings (probably less
> efficient that Java's Byte to UTF8 decoder), so that may cause some extra
> overheads compared to scala.
>
> Also consider trying the direct API. Read more in the Kafka integration
> guide -
> http://spark.apache.org/docs/latest/streaming-kafka-integration.html
> That overall has a much higher throughput that the earlier receiver based
> approach.
>
> BTW, disclaimer. Do not consider this difference as generalization of the
> performance difference between Scala and Python for all of Spark, For
> example, DataFrames provide performance parity between Scala and Python
> APIs.
>
>
> On Mon, Aug 24, 2015 at 5:22 AM, utk.pat <ut...@gmail.com> wrote:
>
>> I am new to SPARK streaming. I was running the "kafka_wordcount" example
>> with a local KAFKA and SPARK instance. It was very easy to set this up and
>> get going :) I tried running both SCALA and Python versions of the word
>> count example. Python versions seems to be extremely slow. Sometimes it has
>> delays of more than couple of minutes. On the other hand SCALA versions
>> seems to be way better. I am running on a windows machine. I am trying to
>> understand what is the cause slowness in python streaming? Is there
>> anything that I am missing? For real time streaming analysis should I
>> prefer SCALA?
>> ------------------------------
>> View this message in context: Performance - Python streaming v/s Scala
>> streaming
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Performance-Python-streaming-v-s-Scala-streaming-tp24415.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>

Re: Performance - Python streaming v/s Scala streaming

Posted by Tathagata Das <td...@databricks.com>.
The scala version of the Kafka  is something that we have been working on
for a while, and is likely to be more optimized than the python one. The
python one definitely requires pass the data back and forth between JVM and
Python VM and decoding the raw bytes to the Python strings (probably less
efficient that Java's Byte to UTF8 decoder), so that may cause some extra
overheads compared to scala.

Also consider trying the direct API. Read more in the Kafka integration
guide - http://spark.apache.org/docs/latest/streaming-kafka-integration.html
That overall has a much higher throughput that the earlier receiver based
approach.

BTW, disclaimer. Do not consider this difference as generalization of the
performance difference between Scala and Python for all of Spark, For
example, DataFrames provide performance parity between Scala and Python
APIs.


On Mon, Aug 24, 2015 at 5:22 AM, utk.pat <ut...@gmail.com> wrote:

> I am new to SPARK streaming. I was running the "kafka_wordcount" example
> with a local KAFKA and SPARK instance. It was very easy to set this up and
> get going :) I tried running both SCALA and Python versions of the word
> count example. Python versions seems to be extremely slow. Sometimes it has
> delays of more than couple of minutes. On the other hand SCALA versions
> seems to be way better. I am running on a windows machine. I am trying to
> understand what is the cause slowness in python streaming? Is there
> anything that I am missing? For real time streaming analysis should I
> prefer SCALA?
> ------------------------------
> View this message in context: Performance - Python streaming v/s Scala
> streaming
> <http://apache-spark-user-list.1001560.n3.nabble.com/Performance-Python-streaming-v-s-Scala-streaming-tp24415.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>