You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Rajasekar Elango <re...@salesforce.com> on 2015/08/28 03:59:10 UTC

Time series aggregation with storm Trident

We have time series data in kafka and we want to aggregate it in storm
using trident. I was able to get data aggregated using persistentAggregate
based onFAQ <https://storm.apache.org/documentation/FAQ.html>. But
aggregation is always done within small batches, I could not figure out a
way to detect when all events for a one minute time window is processed.
Calling each after persistentAggregate(...).newValuesStream() returns
results as soon as a batch is processed, but I want to aggregate values
across multiple batches for a time window. I could not find good answer or
example online. I also see mixed opinion, some people say it's not possible
to do time window aggregation in trident, some people say it's possible
(especially FAQ <https://storm.apache.org/documentation/FAQ.html> looks
promising). The alternate option seem to be using tick tuples with storm
basic, but would prefer to do it in trident as it has better guaranteed
processing semantics and abstraction for persistence.

Can some one provide more details or examples on how to do this?

-- 
Thanks,
Raja.

Re: Time series aggregation with storm Trident

Posted by Andrew Xor <an...@gmail.com>.

Hi,

 I am not aware of any public implementations that exist maybe others can
point you out to some. The computation could be performed in Storm as well
by fetching the items after the time slice is completed (how you resolve
where the end of a time-slice happens, is up to your particular domain
constraints).

Regards.

Kindly yours,

Andrew Grammenos

-- PGP PKey --
 <https://www.dropbox.com/s/2kcxe59zsi9nrdt/pgpsig.txt>
https://www.dropbox.com/s/yxvycjvlsc111bh/pgpsig.txt

On Mon, Sep 14, 2015 at 9:28 PM, Ajay Chander <aj...@gmail.com> wrote:

> Hi Andrew,
>
> Thank you for your response. If possible can you point me to any such
> implementation examples where they used trident.
>
> I have a couple of questions:
>
> If we persist the state to the external store like memsql, I assume that
> the computation is happening on the memsql side. In this case are we
> limiting ourselves with regards to the scalability provided by Apache
> storm? Here in this particular scenario are we using storm just as a router
> ?
>
> Thank you for your time,
> Ajay
>
> On Monday, September 14, 2015, Andrew Xor <an...@gmail.com>
> wrote:
>
>> I have, what you could do is have an external persistent store (something
>> fast, say for example Memcached or Haystack) that you have your aggregation
>> batches for a specific time-slice. For example have a 1 hour window with
>> 10-minute slices that are cleared and rotated as needed. Another problem
>> that you have to deal with is the fact that should a spout source fails
>> everything is delayed unless you have an opaque spout which of course has
>> some downsides as indicated here
>> <https://storm.apache.org/documentation/Trident-state>.
>>
>> Hope this helps.
>>
>> Kindly yours,
>>
>> Andrew Grammenos
>>
>> -- PGP PKey --
>>  <https://www.dropbox.com/s/2kcxe59zsi9nrdt/pgpsig.txt>
>> https://www.dropbox.com/s/yxvycjvlsc111bh/pgpsig.txt
>>
>> On Mon, Sep 14, 2015 at 7:40 PM, Ajay Chander <aj...@gmail.com>
>> wrote:
>>
>>> Hi Guys,
>>>
>>> Right now I am trying to implement the same as mentioned by Elango in
>>> the below email. I want to perform aggregations based on a time window
>>> using trident. Anyone have done this before using trident? Any help is
>>> highly appreciated.
>>>
>>> Thank you,
>>> Ajay
>>>
>>>
>>> On Thursday, August 27, 2015, Rajasekar Elango <re...@salesforce.com>
>>> wrote:
>>>
>>>> We have time series data in kafka and we want to aggregate it in storm
>>>> using trident. I was able to get data aggregated using persistentAggregate
>>>> based onFAQ <https://storm.apache.org/documentation/FAQ.html>. But
>>>> aggregation is always done within small batches, I could not figure out a
>>>> way to detect when all events for a one minute time window is processed.
>>>> Calling each after persistentAggregate(...).newValuesStream() returns
>>>> results as soon as a batch is processed, but I want to aggregate values
>>>> across multiple batches for a time window. I could not find good answer or
>>>> example online. I also see mixed opinion, some people say it's not possible
>>>> to do time window aggregation in trident, some people say it's possible
>>>> (especially FAQ <https://storm.apache.org/documentation/FAQ.html> looks
>>>> promising). The alternate option seem to be using tick tuples with storm
>>>> basic, but would prefer to do it in trident as it has better guaranteed
>>>> processing semantics and abstraction for persistence.
>>>>
>>>> Can some one provide more details or examples on how to do this?
>>>>
>>>> --
>>>> Thanks,
>>>> Raja.
>>>>
>>>
>>

Re: Time series aggregation with storm Trident

Posted by Ajay Chander <aj...@gmail.com>.

Hi Andrew,

Thank you for your response. If possible can you point me to any such
implementation examples where they used trident.

I have a couple of questions:

If we persist the state to the external store like memsql, I assume that
the computation is happening on the memsql side. In this case are we
limiting ourselves with regards to the scalability provided by Apache
storm? Here in this particular scenario are we using storm just as a router
?

Thank you for your time,
Ajay

On Monday, September 14, 2015, Andrew Xor <an...@gmail.com>
wrote:

> I have, what you could do is have an external persistent store (something
> fast, say for example Memcached or Haystack) that you have your aggregation
> batches for a specific time-slice. For example have a 1 hour window with
> 10-minute slices that are cleared and rotated as needed. Another problem
> that you have to deal with is the fact that should a spout source fails
> everything is delayed unless you have an opaque spout which of course has
> some downsides as indicated here
> <https://storm.apache.org/documentation/Trident-state>.
>
> Hope this helps.
>
> Kindly yours,
>
> Andrew Grammenos
>
> -- PGP PKey --
>  <https://www.dropbox.com/s/2kcxe59zsi9nrdt/pgpsig.txt>
> https://www.dropbox.com/s/yxvycjvlsc111bh/pgpsig.txt
>
> On Mon, Sep 14, 2015 at 7:40 PM, Ajay Chander <ajay.chevva@gmail.com
> <javascript:_e(%7B%7D,'cvml','ajay.chevva@gmail.com');>> wrote:
>
>> Hi Guys,
>>
>> Right now I am trying to implement the same as mentioned by Elango in the
>> below email. I want to perform aggregations based on a time window using
>> trident. Anyone have done this before using trident? Any help is highly
>> appreciated.
>>
>> Thank you,
>> Ajay
>>
>>
>> On Thursday, August 27, 2015, Rajasekar Elango <relango@salesforce.com
>> <javascript:_e(%7B%7D,'cvml','relango@salesforce.com');>> wrote:
>>
>>> We have time series data in kafka and we want to aggregate it in storm
>>> using trident. I was able to get data aggregated using persistentAggregate
>>> based onFAQ <https://storm.apache.org/documentation/FAQ.html>. But
>>> aggregation is always done within small batches, I could not figure out a
>>> way to detect when all events for a one minute time window is processed.
>>> Calling each after persistentAggregate(...).newValuesStream() returns
>>> results as soon as a batch is processed, but I want to aggregate values
>>> across multiple batches for a time window. I could not find good answer or
>>> example online. I also see mixed opinion, some people say it's not possible
>>> to do time window aggregation in trident, some people say it's possible
>>> (especially FAQ <https://storm.apache.org/documentation/FAQ.html> looks
>>> promising). The alternate option seem to be using tick tuples with storm
>>> basic, but would prefer to do it in trident as it has better guaranteed
>>> processing semantics and abstraction for persistence.
>>>
>>> Can some one provide more details or examples on how to do this?
>>>
>>> --
>>> Thanks,
>>> Raja.
>>>
>>
>

Re: Time series aggregation with storm Trident

Posted by Ajay Chander <aj...@gmail.com>.

Hi Indranil,

Thank you for the info. I was looking at some options where I can use
trident to perform times window aggregations. In the example which you
provided I see it was using core storm implementation. Thanks for your time.


On Monday, September 14, 2015, Indranil Roy <in...@gmail.com>
wrote:

> This might help : https://tomdzk.wordpress.com/2011/09/28/storm-esper/
>
> On Mon, Sep 14, 2015 at 10:19 PM Andrew Xor <andreas.grammenos@gmail.com
> <javascript:_e(%7B%7D,'cvml','andreas.grammenos@gmail.com');>> wrote:
>
>> I have, what you could do is have an external persistent store (something
>> fast, say for example Memcached or Haystack) that you have your aggregation
>> batches for a specific time-slice. For example have a 1 hour window with
>> 10-minute slices that are cleared and rotated as needed. Another problem
>> that you have to deal with is the fact that should a spout source fails
>> everything is delayed unless you have an opaque spout which of course has
>> some downsides as indicated here
>> <https://storm.apache.org/documentation/Trident-state>.
>>
>> Hope this helps.
>>
>> Kindly yours,
>>
>> Andrew Grammenos
>>
>> -- PGP PKey --
>>  <https://www.dropbox.com/s/2kcxe59zsi9nrdt/pgpsig.txt>
>> https://www.dropbox.com/s/yxvycjvlsc111bh/pgpsig.txt
>>
>> On Mon, Sep 14, 2015 at 7:40 PM, Ajay Chander <ajay.chevva@gmail.com
>> <javascript:_e(%7B%7D,'cvml','ajay.chevva@gmail.com');>> wrote:
>>
>>> Hi Guys,
>>>
>>> Right now I am trying to implement the same as mentioned by Elango in
>>> the below email. I want to perform aggregations based on a time window
>>> using trident. Anyone have done this before using trident? Any help is
>>> highly appreciated.
>>>
>>> Thank you,
>>> Ajay
>>>
>>>
>>> On Thursday, August 27, 2015, Rajasekar Elango <relango@salesforce.com
>>> <javascript:_e(%7B%7D,'cvml','relango@salesforce.com');>> wrote:
>>>
>>>> We have time series data in kafka and we want to aggregate it in storm
>>>> using trident. I was able to get data aggregated using persistentAggregate
>>>> based onFAQ <https://storm.apache.org/documentation/FAQ.html>. But
>>>> aggregation is always done within small batches, I could not figure out a
>>>> way to detect when all events for a one minute time window is processed.
>>>> Calling each after persistentAggregate(...).newValuesStream() returns
>>>> results as soon as a batch is processed, but I want to aggregate values
>>>> across multiple batches for a time window. I could not find good answer or
>>>> example online. I also see mixed opinion, some people say it's not possible
>>>> to do time window aggregation in trident, some people say it's possible
>>>> (especially FAQ <https://storm.apache.org/documentation/FAQ.html> looks
>>>> promising). The alternate option seem to be using tick tuples with storm
>>>> basic, but would prefer to do it in trident as it has better guaranteed
>>>> processing semantics and abstraction for persistence.
>>>>
>>>> Can some one provide more details or examples on how to do this?
>>>>
>>>> --
>>>> Thanks,
>>>> Raja.
>>>>
>>>
>> --
> Indranil RoyChowdhury
> +91-9830027560
>

Re: Time series aggregation with storm Trident

Posted by Indranil Roy <in...@gmail.com>.

This might help : https://tomdzk.wordpress.com/2011/09/28/storm-esper/

On Mon, Sep 14, 2015 at 10:19 PM Andrew Xor <an...@gmail.com>
wrote:

> I have, what you could do is have an external persistent store (something
> fast, say for example Memcached or Haystack) that you have your aggregation
> batches for a specific time-slice. For example have a 1 hour window with
> 10-minute slices that are cleared and rotated as needed. Another problem
> that you have to deal with is the fact that should a spout source fails
> everything is delayed unless you have an opaque spout which of course has
> some downsides as indicated here
> <https://storm.apache.org/documentation/Trident-state>.
>
> Hope this helps.
>
> Kindly yours,
>
> Andrew Grammenos
>
> -- PGP PKey --
>  <https://www.dropbox.com/s/2kcxe59zsi9nrdt/pgpsig.txt>
> https://www.dropbox.com/s/yxvycjvlsc111bh/pgpsig.txt
>
> On Mon, Sep 14, 2015 at 7:40 PM, Ajay Chander <aj...@gmail.com>
> wrote:
>
>> Hi Guys,
>>
>> Right now I am trying to implement the same as mentioned by Elango in the
>> below email. I want to perform aggregations based on a time window using
>> trident. Anyone have done this before using trident? Any help is highly
>> appreciated.
>>
>> Thank you,
>> Ajay
>>
>>
>> On Thursday, August 27, 2015, Rajasekar Elango <re...@salesforce.com>
>> wrote:
>>
>>> We have time series data in kafka and we want to aggregate it in storm
>>> using trident. I was able to get data aggregated using persistentAggregate
>>> based onFAQ <https://storm.apache.org/documentation/FAQ.html>. But
>>> aggregation is always done within small batches, I could not figure out a
>>> way to detect when all events for a one minute time window is processed.
>>> Calling each after persistentAggregate(...).newValuesStream() returns
>>> results as soon as a batch is processed, but I want to aggregate values
>>> across multiple batches for a time window. I could not find good answer or
>>> example online. I also see mixed opinion, some people say it's not possible
>>> to do time window aggregation in trident, some people say it's possible
>>> (especially FAQ <https://storm.apache.org/documentation/FAQ.html> looks
>>> promising). The alternate option seem to be using tick tuples with storm
>>> basic, but would prefer to do it in trident as it has better guaranteed
>>> processing semantics and abstraction for persistence.
>>>
>>> Can some one provide more details or examples on how to do this?
>>>
>>> --
>>> Thanks,
>>> Raja.
>>>
>>
> --
Indranil RoyChowdhury
+91-9830027560

Re: Time series aggregation with storm Trident

Posted by Andrew Xor <an...@gmail.com>.

I have, what you could do is have an external persistent store (something
fast, say for example Memcached or Haystack) that you have your aggregation
batches for a specific time-slice. For example have a 1 hour window with
10-minute slices that are cleared and rotated as needed. Another problem
that you have to deal with is the fact that should a spout source fails
everything is delayed unless you have an opaque spout which of course has
some downsides as indicated here
<https://storm.apache.org/documentation/Trident-state>.

Hope this helps.

Kindly yours,

Andrew Grammenos

-- PGP PKey --
 <https://www.dropbox.com/s/2kcxe59zsi9nrdt/pgpsig.txt>
https://www.dropbox.com/s/yxvycjvlsc111bh/pgpsig.txt

On Mon, Sep 14, 2015 at 7:40 PM, Ajay Chander <aj...@gmail.com> wrote:

> Hi Guys,
>
> Right now I am trying to implement the same as mentioned by Elango in the
> below email. I want to perform aggregations based on a time window using
> trident. Anyone have done this before using trident? Any help is highly
> appreciated.
>
> Thank you,
> Ajay
>
>
> On Thursday, August 27, 2015, Rajasekar Elango <re...@salesforce.com>
> wrote:
>
>> We have time series data in kafka and we want to aggregate it in storm
>> using trident. I was able to get data aggregated using persistentAggregate
>> based onFAQ <https://storm.apache.org/documentation/FAQ.html>. But
>> aggregation is always done within small batches, I could not figure out a
>> way to detect when all events for a one minute time window is processed.
>> Calling each after persistentAggregate(...).newValuesStream() returns
>> results as soon as a batch is processed, but I want to aggregate values
>> across multiple batches for a time window. I could not find good answer or
>> example online. I also see mixed opinion, some people say it's not possible
>> to do time window aggregation in trident, some people say it's possible
>> (especially FAQ <https://storm.apache.org/documentation/FAQ.html> looks
>> promising). The alternate option seem to be using tick tuples with storm
>> basic, but would prefer to do it in trident as it has better guaranteed
>> processing semantics and abstraction for persistence.
>>
>> Can some one provide more details or examples on how to do this?
>>
>> --
>> Thanks,
>> Raja.
>>
>

Re: Time series aggregation with storm Trident

Posted by Ajay Chander <aj...@gmail.com>.

Hi Guys,

Right now I am trying to implement the same as mentioned by Elango in the
below email. I want to perform aggregations based on a time window using
trident. Anyone have done this before using trident? Any help is highly
appreciated.

Thank you,
Ajay

On Thursday, August 27, 2015, Rajasekar Elango <re...@salesforce.com>
wrote:

> We have time series data in kafka and we want to aggregate it in storm
> using trident. I was able to get data aggregated using persistentAggregate
> based onFAQ <https://storm.apache.org/documentation/FAQ.html>. But
> aggregation is always done within small batches, I could not figure out a
> way to detect when all events for a one minute time window is processed.
> Calling each after persistentAggregate(...).newValuesStream() returns
> results as soon as a batch is processed, but I want to aggregate values
> across multiple batches for a time window. I could not find good answer or
> example online. I also see mixed opinion, some people say it's not possible
> to do time window aggregation in trident, some people say it's possible
> (especially FAQ <https://storm.apache.org/documentation/FAQ.html> looks
> promising). The alternate option seem to be using tick tuples with storm
> basic, but would prefer to do it in trident as it has better guaranteed
> processing semantics and abstraction for persistence.
>
> Can some one provide more details or examples on how to do this?
>
> --
> Thanks,
> Raja.
>