You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Liam Stewart <li...@gmail.com> on 2014/02/03 20:03:20 UTC

spark streaming questions

I'm looking at adding spark / shark to our analytics pipeline and would
also like to use spark streaming for some incremental computations, but I
have some questions about the suitability of spark streaming.

Roughly, we have events that are generated by app servers based on user
interactions with our system; these events are protocol buffers and sent
through kafka for distribution. For certain events, we'd like to extract
information and update a data store, basically just take a value, check
against what's already in the store, and update - it seems like storm is
more suitable for this task as the events can be processed individually.

We'd also like to update some statistics incrementally. These are currently
spec'd as 30 or 90 day rolling windows but could end up being longer. We
have O(million) users generating events over these time frames, but the
individual streams are very sparse - generally only a few events per day
per user at most. Writing roll-ups ourselves with storm is a pain while
just running roll-up queries against the database is quite easy, but we're
moving away from that as it does generate a non-negligible amount of load
that we'd rather avoid.

It seems like spark streaming's windows would fit the bill quite well, but
I'm curious how well (if?) such a large number of partitions and long
windows are supported.

One other question that I had right now is with the windows - whether or
not an input falls outside a window is based on spark's notion of when the
input arrives, correct? If our kafka cluster stopped sending data to spark
for a non-negligible amount for some reason, it seems that there would be
the possibility of including extra data in a window.

Thanks,

liam

-- 
Liam Stewart :: liam.stewart@gmail.com

Re: spark streaming questions

Posted by Tathagata Das <ta...@gmail.com>.

Responses inline.


On Mon, Feb 3, 2014 at 11:03 AM, Liam Stewart <li...@gmail.com>wrote:

> I'm looking at adding spark / shark to our analytics pipeline and would
> also like to use spark streaming for some incremental computations, but I
> have some questions about the suitability of spark streaming.
>
> Roughly, we have events that are generated by app servers based on user
> interactions with our system; these events are protocol buffers and sent
> through kafka for distribution. For certain events, we'd like to extract
> information and update a data store, basically just take a value, check
> against what's already in the store, and update - it seems like storm is
> more suitable for this task as the events can be processed individually.
>
> We'd also like to update some statistics incrementally. These are
> currently spec'd as 30 or 90 day rolling windows but could end up being
> longer. We have O(million) users generating events over these time frames,
> but the individual streams are very sparse - generally only a few events
> per day per user at most. Writing roll-ups ourselves with storm is a pain
> while just running roll-up queries against the database is quite easy, but
> we're moving away from that as it does generate a non-negligible amount of
> load that we'd rather avoid.
>
> It seems like spark streaming's windows would fit the bill quite well, but
> I'm curious how well (if?) such a large number of partitions and long
> windows are supported.
>

We have not tested Spark Streaming with very large windows. Though for
window may not be the suitable option here. You should take a look at
updateStateByKey operation of DStreams, that allows you to maintain
arbitrary state and update them with data. That may be more suitable for
the purpose of roll ups. You have to write the roll up function for values
(i.e. statistics) for each key (i.e., for each user). Take a look at the
example -
https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala?source=c


>
> One other question that I had right now is with the windows - whether or
> not an input falls outside a window is based on spark's notion of when the
> input arrives, correct? If our kafka cluster stopped sending data to spark
> for a non-negligible amount for some reason, it seems that there would be
> the possibility of including extra data in a window.
>
> Yes, Spark Streaming's receiver decides which batch a particular records
falls into based on when the record arrived. Though I didnt get how you can
end up getting *extra* data if kafka stops sending.

TD


> Thanks,
>
> liam
>
> --
> Liam Stewart :: liam.stewart@gmail.com
>