You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by maddenpj <ma...@gmail.com> on 2014/09/24 23:14:52 UTC

Spark Streaming unable to handle production Kafka load

I am attempting to use Spark Streaming to summarize event data streaming in
and save it to a MySQL table. The input source data is stored in 4 topics on
Kafka with each topic having 12 partitions. My approach to doing this worked
both in local development and a simulated load testing environment but I
cannot seem to get it working when hooking it up to our production source.
I'm having a hard time figuring out what is going on because I'm not seeing
any obvious errors in the logs, just the first batch never finishes
processing. I believe its a data rate problem (most active topic clocks in
around 4k messages per second, least active topic is around 0.5 msg/s) but
I'm completely stuck of the best way to resolve this and maybe I'm not
following best practices.

Here is a gist of the essentials of my program. I use an updateStateByKey
approach to keep around the MySQL id of that piece of data (so if we've
already seen that particular piece we just update the existing total in
mysql with the total spark just computed in the current window.
https://gist.github.com/maddenpj/74a4c8ce372888ade92d
<https://gist.github.com/maddenpj/74a4c8ce372888ade92d>  

One thing I have noticed is my Kafka Receiver is only on one machine and I
have not yet tried to increase the parallelism of reading out of kafka,
something like this solution:
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Kafka-Receivers-and-Union-td14901.html
<http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Kafka-Receivers-and-Union-td14901.html>  

So that's next on my list, but I'm still in need of insight into how to
figure out what's going on. When I watch the stages execute on the web UI, I
see occasional activity (A map stage processing) but most of the time it
looks like I'm stuck in some arbitrary stage (i.e.: take and runJob will be
active for the entire life of the program with 0 tasks ever completing).
This is contrary to what I see when I watch the Kafka topics get consumed,
the program is always consuming messages, it just gives no indication it's
doing any actual processing on them.

On a somewhat related note, how does everyone capacity plan for building out
a spark cluster? So far I've just been using trial and error but I still
haven't found the right number of nodes that can handle our 4k/s topic. I've
tried up to 6 amazon m3.large's (2 cores, 7.5 GB memory),  but even that
feels excessive to me as currently we're processing this data load on a
single node mapreduce cluster,  an m3.xlarge (4 cores, 15 GB memory).

Thanks,Patrick



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-unable-to-handle-production-Kafka-load-tp15077.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming unable to handle production Kafka load

Posted by maddenpj <ma...@gmail.com>.
Another update, actually it just hit me my problem is probably right here:

https://gist.github.com/maddenpj/74a4c8ce372888ade92d#file-gistfile1-scala-L22

I'm creating a JDBC connection on every record, that's probably whats
killing the performance. I assume the fix is just broadcast the connection
pool? 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-unable-to-handle-production-Kafka-load-tp15077p15081.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark Streaming unable to handle production Kafka load

Posted by maddenpj <ma...@gmail.com>.
Oh I should add I've tried a range of batch durations and reduce by window
durations to no effect. I'm not too sure how to choose these?

<br/><br/>

Currently today I've been testing with batch duration of 1 minute - 10
minute and reduce window duration of 10 minute or 20 minutes.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-unable-to-handle-production-Kafka-load-tp15077p15080.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org