You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sourav Chandra <so...@livestream.com> on 2014/03/11 19:09:29 UTC

Spark usage patterns and questions

Hi,

I have some questions regarding usage patterns and debugging in spark/spark
streaming.

1. What is some used design patterns of using broadcast variable? In my
application i created some and also created a scheduled task which
periodically refreshes the variables. I want to know how efficiently and in
modular way people generally achieve this?

2. Sometimes a uncaught exception in driver program/worker does not get
traced anywhere? How can we debug this?

3. In our usecase we read from Kafka, do some mapping and lastly persists
data to cassandra as well as pushes the data over remote actor for realtime
update in dashboard. I used below approaches
 - First tried to use vary naive way like stream.map(...).foreachRDD(
pushes to actor)
    It does not work and stage failed saying akka exception
 - Second tried to use
akka.serialization.JavaSerilizer.withSystem(system){...} approach
     It does not work and stage failed BUT without any trace anywhere in
lofs
 - Finally did rdd.collect to collect the output into driver and then
pushes to actor
     It worked.

I would like to know is there any efficient way of achieving this sort of
usecases

4. Sometimes I see failed stages but when opened those stage details it
said stage did not start. What does this mean?

Looking forward for some interesting responses :)

Thanks,
-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

sourav.chandra@livestream.com

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

"Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com

Re: Spark usage patterns and questions

Posted by Rohit Rai <ro...@tuplejump.com>.
>
> 3. In our usecase we read from Kafka, do some mapping and lastly persists
> data to cassandra as well as pushes the data over remote actor for
> realtime update in dashboard. I used below approaches
>  - First tried to use vary naive way like stream.map(...)*.foreachRDD(
> pushes to actor)*
>     It does not work and stage failed saying akka exception
>  - Second tried to use
> akka.serialization.JavaSerilizer.withSystem(system){...} approach
>      It does not work and stage failed BUT without any trace anywhere in
> lofs
>  - Finally did rdd.collect to collect the output into driver and then
> pushes to actor
>      It worked.


Have you tried writing to Cassandra using Hadoop Output instead of pushing
it to actor in the foreachRDD handler. I am sure that will be more
efficient than collect and send to actor.

In Calliope we do a similar thing... You can checkout the example here -
http://tuplejump.github.io/calliope/streaming.html
https://github.com/tuplejump/calliope/blob/develop/src/main/scala/com/tuplejump/calliope/examples/PersistDStream.scala

Hope that helps

Regards,
Rohit


*Founder & CEO, **Tuplejump, Inc.*
____________________________
www.tuplejump.com
*The Data Engineering Platform*


On Tue, Mar 11, 2014 at 11:39 PM, Sourav Chandra <
sourav.chandra@livestream.com> wrote:

> Hi,
>
> I have some questions regarding usage patterns and debugging in
> spark/spark streaming.
>
> 1. What is some used design patterns of using broadcast variable? In my
> application i created some and also created a scheduled task which
> periodically refreshes the variables. I want to know how efficiently and in
> modular way people generally achieve this?
>
> 2. Sometimes a uncaught exception in driver program/worker does not get
> traced anywhere? How can we debug this?
>
> 3. In our usecase we read from Kafka, do some mapping and lastly persists
> data to cassandra as well as pushes the data over remote actor for realtime
> update in dashboard. I used below approaches
>  - First tried to use vary naive way like stream.map(...).foreachRDD(
> pushes to actor)
>     It does not work and stage failed saying akka exception
>  - Second tried to use
> akka.serialization.JavaSerilizer.withSystem(system){...} approach
>      It does not work and stage failed BUT without any trace anywhere in
> lofs
>  - Finally did rdd.collect to collect the output into driver and then
> pushes to actor
>      It worked.
>
> I would like to know is there any efficient way of achieving this sort of
> usecases
>
> 4. Sometimes I see failed stages but when opened those stage details it
> said stage did not start. What does this mean?
>
> Looking forward for some interesting responses :)
>
> Thanks,
> --
>
> Sourav Chandra
>
> Senior Software Engineer
>
> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>
> sourav.chandra@livestream.com
>
> o: +91 80 4121 8723
>
> m: +91 988 699 3746
>
> skype: sourav.chandra
>
> Livestream
>
> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
> Block, Koramangala Industrial Area,
>
> Bangalore 560034
>
> www.livestream.com
>