You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ron's Yahoo! <zl...@yahoo.com.INVALID> on 2014/09/09 04:27:38 UTC

Spark streaming for synchronous API

Hi,
  I’m trying to figure out how I can run Spark Streaming like an API.
  The goal is to have a synchronous REST API that runs the spark data flow on YARN.
  Has anyone done something like this? Can you share your architecture? To begin with, is it even possible to have Spark Streaming run as a yarn job?

Thanks,
Ron
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark streaming for synchronous API

Posted by Ron's Yahoo! <zl...@yahoo.com.INVALID>.

Tobias,
  Let me explain a little more.
  I want to create a synchronous REST API that will process some data that is passed in as some request.
  I would imagine that the Spark Streaming Job on YARN is a long running job that waits on requests from something. What that something is is still not clear to me, but I would imagine that it’s some queue. The goal is to be able to push a message onto a queue with some id, and then get the processed results back from Spark Streaming.
  The goal is for the REST API be able to respond to lots of calls with low latency.
  Hope that clarifies things...

Thanks,
Ron

On Sep 8, 2014, at 7:41 PM, Tobias Pfeiffer <tg...@preferred.jp> wrote:

> Ron,
> 
> On Tue, Sep 9, 2014 at 11:27 AM, Ron's Yahoo! <zl...@yahoo.com.invalid> wrote:
>   I’m trying to figure out how I can run Spark Streaming like an API.
>   The goal is to have a synchronous REST API that runs the spark data flow on YARN.
> 
> I guess I *may* develop something similar in the future.
> 
> By "a synchronous REST API", do you mean that submitting the job is synchronous and you would fetch the processing results via a different call? Or do you want to submit a job and get the processed data back as an HTTP stream?
> 
> To begin with, is it even possible to have Spark Streaming run as a yarn job?
> 
> I think it is very much possible to run Spark Streaming as a YARN job; at least it worked well with Mesos.
> 
> Tobias
>

Re: Spark streaming for synchronous API

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Hi again,

On Tue, Sep 9, 2014 at 2:20 PM, Tobias Pfeiffer <tg...@preferred.jp> wrote:
>
> On Tue, Sep 9, 2014 at 2:02 PM, Ron's Yahoo! <zl...@yahoo.com> wrote:
>>
>>   For example, let’s say there’s a particular topic T1 in a Kafka queue.
>> If I have a new set of requests coming from a particular client A, I was
>> wondering if I could create a partition A.
>>   The streaming job is submitted to listen to T1.A and will write to a
>> topic T2.A, which the REST endpoint would be listening on.
>>
>
> That doesn't seem like a good way to use Kafka. It may be possible, but I
> am pretty sure you should create a new topic T_A instead of a partition A
> in an existing topic. With some modifications of Spark Streaming's
> KafkaReceiver you *might* be able to get it to work as you imagine, but it
> was not meant to be that way, I think.
>

Maybe I was wrong about a new topic being the better way. Looking, for
example, at the way that Samza consumes Kafka streams <
http://samza.incubator.apache.org/learn/documentation/latest/introduction/concepts.html>,
it seems like there is one task per partition and data can go into
partitions keyed by user ID. So maybe a new partition is actually the
conceptually better way.

Nonetheless, the built-in KafkaReceiver doesn't support assignment of
partitions to receivers AFAIK ;-)

Tobias

Re: Spark streaming for synchronous API

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Hi,

On Tue, Sep 9, 2014 at 2:02 PM, Ron's Yahoo! <zl...@yahoo.com> wrote:

>   So I guess where I was coming from was the assumption that starting up a
> new job to be listening on a particular queue topic could be done
> asynchronously.
>

No, with the current state of Spark Streaming, all data sources and the
processing pipeline must be fixed when you start your StreamingContext. You
cannot add new data sources dynamically at the moment, see
http://apache-spark-user-list.1001560.n3.nabble.com/Multi-tenancy-for-Spark-Streaming-Applications-td13398.html

>   For example, let’s say there’s a particular topic T1 in a Kafka queue.
> If I have a new set of requests coming from a particular client A, I was
> wondering if I could create a partition A.
>   The streaming job is submitted to listen to T1.A and will write to a
> topic T2.A, which the REST endpoint would be listening on.
>

That doesn't seem like a good way to use Kafka. It may be possible, but I
am pretty sure you should create a new topic T_A instead of a partition A
in an existing topic. With some modifications of Spark Streaming's
KafkaReceiver you *might* be able to get it to work as you imagine, but it
was not meant to be that way, I think.

Also, you will not get "low latency", because Spark Streaming processes
data in batches of fixed interval length (say, 1 second) and in the worst
case your query will wait up to 1 second before processing even starts.

If I understand correctly what you are trying to do (which I am not sure
about), I would probably recommend to choose a bit of a different
architecture; in particular given that you cannot dynamically add data
sources.

Tobias

Re: Spark streaming for synchronous API

Posted by Ron's Yahoo! <zl...@yahoo.com.INVALID>.

Hi Tobias,
  So I guess where I was coming from was the assumption that starting up a new job to be listening on a particular queue topic could be done asynchronously.
  For example, let’s say there’s a particular topic T1 in a Kafka queue. If I have a new set of requests coming from a particular client A, I was wondering if I could create a partition A.
  The streaming job is submitted to listen to T1.A and will write to a topic T2.A, which the REST endpoint would be listening on.
  It does seem a little contrived but the ultimate goal here is to get a bunch of messages from a queue, distribute to a bunch of Spark jobs that process and write back to another queue, which the REST endpoint synchronously waits on. Storm might be a better fit, but the background behind this question is that I want to reuse the same set of transformations for both batch and streaming, with the streaming use case represented by a REST call.
  In other words, the job submission would not be part of the equation so I would imagine the latency is limited to the processing, write back and consumption of the processed message by the original REST request.
  Let me know what you think…

Thanks,
Ron

On Sep 8, 2014, at 9:28 PM, Tobias Pfeiffer <tg...@preferred.jp> wrote:

> Hi,
> 
> On Tue, Sep 9, 2014 at 12:59 PM, Ron's Yahoo! <zl...@yahoo.com> wrote:
>  I want to create a synchronous REST API that will process some data that is passed in as some request.
>  I would imagine that the Spark Streaming Job on YARN is a long running job that waits on requests from something. What that something is is still not clear to me, but I would imagine that it’s some queue. The goal is to be able to push a message onto a queue with some id, and then  get the processed results back from Spark Streaming.
> 
> That is not exactly a Spark Streaming use case, I think. Spark Streaming pulls data from some source (like a queue), then processes all data collected in a certain interval in a mini-batch, and stores that data somewhere. It is not well suited for handling request-response cycles in a synchronous way; you might consider using plain Spark (without Streaming) for that.
> 
> For example, you could use the unfiltered http://unfiltered.databinder.net/Unfiltered.html library and within request handling do some RDD operation, returning the output as HTTP response. This works fine as multiple threads can submit Spark jobs concurrently https://spark.apache.org/docs/latest/job-scheduling.html You could also check https://github.com/adobe-research/spindle -- that seems to be similar to what you are doing.
> 
>  The goal is for the REST API be able to respond to lots of calls with low latency.
>  Hope that clarifies things...
> 
> Note that "low latency" for "lots of calls" is maybe not something that Spark was built for. Even if you do close to nothing data processing, you may not get below 200ms or so due to the overhead of submitting jobs etc., from my experience.
> 
> Tobias
> 
>

Re: Spark streaming for synchronous API

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Hi,

On Tue, Sep 9, 2014 at 12:59 PM, Ron's Yahoo! <zl...@yahoo.com> wrote:
>
>  I want to create a synchronous REST API that will process some data that
> is passed in as some request.
>  I would imagine that the Spark Streaming Job on YARN is a long
> running job that waits on requests from something. What that something is
> is still not clear to me, but I would imagine that it’s some queue.
> The goal is to be able to push a message onto a queue with some id, and
> then  get the processed results back from Spark Streaming.
>

That is not exactly a Spark Streaming use case, I think. Spark Streaming
pulls data from some source (like a queue), then processes all data
collected in a certain interval in a mini-batch, and stores that data
somewhere. It is not well suited for handling request-response cycles in a
synchronous way; you might consider using plain Spark (without Streaming)
for that.

For example, you could use the unfiltered
http://unfiltered.databinder.net/Unfiltered.html library and within request
handling do some RDD operation, returning the output as HTTP response. This
works fine as multiple threads can submit Spark jobs concurrently
https://spark.apache.org/docs/latest/job-scheduling.html You could also
check https://github.com/adobe-research/spindle -- that seems to be similar
to what you are doing.

 The goal is for the REST API be able to respond to lots of calls with low
> latency.
>  Hope that clarifies things...
>

Note that "low latency" for "lots of calls" is maybe not something that
Spark was built for. Even if you do close to nothing data processing, you
may not get below 200ms or so due to the overhead of submitting jobs etc.,
from my experience.

Tobias

Re: Spark streaming for synchronous API

Posted by Ron's Yahoo! <zl...@yahoo.com.INVALID>.

Tobias,
 Let me explain a little more.
 I want to create a synchronous REST API that will process some data that is passed in as some request.
 I would imagine that the Spark Streaming Job on YARN is a long running job that waits on requests from something. What that something is is still not clear to me, but I would imagine that it’s some queue. The goal is to be able to push a message onto a queue with some id, and then  get the processed results back from Spark Streaming.
 The goal is for the REST API be able to respond to lots of calls with low latency.
 Hope that clarifies things...

Thanks,
Ron

On Sep 8, 2014, at 7:41 PM, Tobias Pfeiffer <tg...@preferred.jp> wrote:

> Ron,
> 
> On Tue, Sep 9, 2014 at 11:27 AM, Ron's Yahoo! <zl...@yahoo.com.invalid> wrote:
>   I’m trying to figure out how I can run Spark Streaming like an API.
>   The goal is to have a synchronous REST API that runs the spark data flow on YARN.
> 
> I guess I *may* develop something similar in the future.
> 
> By "a synchronous REST API", do you mean that submitting the job is synchronous and you would fetch the processing results via a different call? Or do you want to submit a job and get the processed data back as an HTTP stream?
> 
> To begin with, is it even possible to have Spark Streaming run as a yarn job?
> 
> I think it is very much possible to run Spark Streaming as a YARN job; at least it worked well with Mesos.
> 
> Tobias
>

Re: Spark streaming for synchronous API

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Ron,

On Tue, Sep 9, 2014 at 11:27 AM, Ron's Yahoo! <zl...@yahoo.com.invalid>
 wrote:
>
>   I’m trying to figure out how I can run Spark Streaming like an API.
>   The goal is to have a synchronous REST API that runs the spark data flow
> on YARN.

I guess I *may* develop something similar in the future.

By "a synchronous REST API", do you mean that submitting the job is
synchronous and you would fetch the processing results via a different
call? Or do you want to submit a job and get the processed data back as an
HTTP stream?

To begin with, is it even possible to have Spark Streaming run as a yarn
> job?
>

I think it is very much possible to run Spark Streaming as a YARN job; at
least it worked well with Mesos.

Tobias