You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ognen Duzlevski <og...@nengoiksvelzud.com> on 2014/01/09 20:07:21 UTC

General Spark question (streaming)

Hello,

I am new to spark and have a few questions that are fairly general in
nature:

I am trying to set up a real-time data analysis pipeline where I have
clients sending events to a collection point (load balanced) and onward the
"collectors" send the data to a Spark cluster via zeromq pub/sub (just an
experiment).

What do people generally do once they have the data in Spark to enable
real-time analytics. Do you store it in some persistent storage and analyze
it within some window (let's say the last five minutes) after enough has
been aggregated or...?

If I want to count the number of occurrences of an event within a given
time frame within a streaming context - does Spark support this and how?
General guidelines are OK and any experiences, knowledge and advice is
greatly appreciated!

Thanks
Ognen

Re: General Spark question (streaming)

Posted by Khanderao kand <kh...@gmail.com>.

1."What do people generally do once they have the data in Spark to enable
real-time analytics. Do you store it in some persistent storage and analyze
it within some window (let's say the last five minutes) after enough has
been aggregated or...?"
>>>It is based on your application. If you have dash boarding / alerting
application then you would push the aggregated results to UI / message
queue. However, if you want these results to be available for later
queries, it would need to be persisted in some storage like HBase.

2. "If I want to count the number of occurrences of an event within a given
time frame within a streaming context - does Spark support this and how? "
  >>>Spark supporting windowing, as well as counter.


On Thu, Jan 9, 2014 at 11:07 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello,
>
> I am new to spark and have a few questions that are fairly general in
> nature:
>
> I am trying to set up a real-time data analysis pipeline where I have
> clients sending events to a collection point (load balanced) and onward the
> "collectors" send the data to a Spark cluster via zeromq pub/sub (just an
> experiment).
>
> What do people generally do once they have the data in Spark to enable
> real-time analytics. Do you store it in some persistent storage and analyze
> it within some window (let's say the last five minutes) after enough has
> been aggregated or...?
>
> If I want to count the number of occurrences of an event within a given
> time frame within a streaming context - does Spark support this and how?
> General guidelines are OK and any experiences, knowledge and advice is
> greatly appreciated!
>
> Thanks
> Ognen
>