You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Lin Zhao <li...@exabeam.com> on 2016/01/07 18:34:05 UTC

Spark streaming routing

I have a need to route the dstream through the streming pipeline by some key, such that data with the same key always goes through the same executor.

There doesn't seem to be a way to do manual routing with Spark Streaming. The closest I can come up with is:

stream.foreachRDD {rdd =>
  rdd.groupBy(rdd.key).flatMap { line =>...}.map(...).map(...)
}

Does this do what I expect? How about between batches? Does it guarrantee the same key goes to the same executor in all batches?

Thanks,

Lin

Re: Spark streaming routing

Posted by Lin Zhao <li...@exabeam.com>.

Thanks for the replay Tathagata. Our pipeline has a rather fat state and that's why we have custom failure handling that kills all executors and go back to a certain point in time in the past.

On a separate but related note, I noticed that in a chained map job, the entire pipeline runs on the same thread.

Say I have

dstream.map(func1).map(func2).map(func3)

For the same input func1, func2 and func3 run on the same thread. Is there a way to configure Spark such that they run in the same executor but different threads? Our piple line has a hight memory footprint and need a low cpu:memory ratio.


From: Tathagata Das <td...@databricks.com>>
Date: Thursday, January 7, 2016 at 1:56 PM
To: Lin Zhao <li...@exabeam.com>>
Cc: user <us...@spark.apache.org>>
Subject: Re: Spark streaming routing

You cannot guarantee that each key will forever be on the same executor. That is flawed approach to designing an application if you have to take ensure fault-tolerance toward executor failures.

On Thu, Jan 7, 2016 at 9:34 AM, Lin Zhao <li...@exabeam.com>> wrote:
I have a need to route the dstream through the streming pipeline by some key, such that data with the same key always goes through the same executor.

There doesn't seem to be a way to do manual routing with Spark Streaming. The closest I can come up with is:

stream.foreachRDD {rdd =>
  rdd.groupBy(rdd.key).flatMap { line =>...}.map(...).map(...)
}

Does this do what I expect? How about between batches? Does it guarrantee the same key goes to the same executor in all batches?

Thanks,

Lin

Re: Spark streaming routing

Posted by Tathagata Das <td...@databricks.com>.

You cannot guarantee that each key will forever be on the same executor.
That is flawed approach to designing an application if you have to take
ensure fault-tolerance toward executor failures.

On Thu, Jan 7, 2016 at 9:34 AM, Lin Zhao <li...@exabeam.com> wrote:

> I have a need to route the dstream through the streming pipeline by some
> key, such that data with the same key always goes through the same
> executor.
>
> There doesn't seem to be a way to do manual routing with Spark Streaming.
> The closest I can come up with is:
>
> stream.foreachRDD {rdd =>
>   rdd.groupBy(rdd.key).flatMap { line =>…}.map(…).map(…)
> }
>
> Does this do what I expect? How about between batches? Does it guarrantee
> the same key goes to the same executor in all batches?
>
> Thanks,
>
> Lin
>