You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Raphael Hsieh <ra...@gmail.com> on 2014/04/25 05:27:19 UTC

Flush aggregated data every X seconds

Is there a way in Storm Trident to aggregate data over a certain time
period and have it flush the data out to an external data store after that
time period is up ?

Trident does not have the functionality of Tick Tuples yet, so I cannot use
that. Everything I've been researching leads to believe that this is not
possible in Storm/Trident, however this seems to me to be a fairly standard
use case of any streaming map reduce library.

For example,
If I am receiving a stream of integers
I want to aggregate all those integers over a period of 1 second, then
persist it into an external datastore.

This is not in order to count how much it will add up to over X amount of
time, rather I would like to minimize the read/write/updates I do to said
datastore.

There are many ways in order to reduce these variables, however all of them
force me to modify my schema in ways that are unpleasant. Also, I would
rather not have my final external datastore be my scratch space, where my
program is reading/updating/writing and checking to make sure that the
transaction id's line up.
Instead I want that scratch work to be done separately, then the final
result stored into a final database that no longer needs to do constant
updating.

Thanks
-- 
Raphael Hsieh

Re: Flush aggregated data every X seconds

Posted by Raphael Hsieh <ra...@gmail.com>.
Thank you very much for your quick reply Corey,
Unfortunately I don't believe the TickSpout exists within Trident yet. I
have seen the threads discussing the implementation of a sliding window and
I've read Michael Noll's
blog<http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/>about
it as well. I don't need a sliding window, as much as just multiple
window chunks if that makes sense haha.

What I'm thinking about resorting to is increasing my Batch size to be much
larger than the throughput of the spout, then at the end of my topology,
doing an aggregation such that everything aggregates to a single tuple, and
running a ".each" on that single tuple with the function just sleeping for
X time.

My theory is that this should allow the stream to back up enough such that
the next batch takes (roughly) the entire next X time amount of data.

Can anyone validate that this technique will work ?


On Thu, Apr 24, 2014 at 8:36 PM, Corey Nolet <cj...@gmail.com> wrote:

> Raphael, in your case it sounds like a "TickSpout" could be useful where
> you emit a tuple every n time slices and then sleep until needing to emit
> another. I'm not sure how that'd work in a Trident aggregator, however.
>
> I'm not sure if this is something Nathan or the community would approve
> of, but I've been writing my own framework for doing sliding/tumbling
> windows in Storm that allow aggregations and triggering/eviction by count,
> time, and other policies like "when the time difference between the first
> item and the last item in a window is less than x". The bolts could easily
> be ripped out for doing your own aggregations.
>
> It's located here: https://github.com/calrissian/flowbox
>
> It's very much so in the proof of concept stage. My other requirement (and
> the reason I cared so much to implement this) was that the rules need to be
> dynamic and the topology needs to be static as to make the best use of
> resources while users are defining that they need.
>
>
>
> On Thu, Apr 24, 2014 at 11:27 PM, Raphael Hsieh <ra...@gmail.com>wrote:
>
>> Is there a way in Storm Trident to aggregate data over a certain time
>> period and have it flush the data out to an external data store after that
>> time period is up ?
>>
>> Trident does not have the functionality of Tick Tuples yet, so I cannot
>> use that. Everything I've been researching leads to believe that this is
>> not possible in Storm/Trident, however this seems to me to be a fairly
>> standard use case of any streaming map reduce library.
>>
>> For example,
>> If I am receiving a stream of integers
>> I want to aggregate all those integers over a period of 1 second, then
>> persist it into an external datastore.
>>
>> This is not in order to count how much it will add up to over X amount of
>> time, rather I would like to minimize the read/write/updates I do to said
>> datastore.
>>
>> There are many ways in order to reduce these variables, however all of
>> them force me to modify my schema in ways that are unpleasant. Also, I
>> would rather not have my final external datastore be my scratch space,
>> where my program is reading/updating/writing and checking to make sure that
>> the transaction id's line up.
>> Instead I want that scratch work to be done separately, then the final
>> result stored into a final database that no longer needs to do constant
>> updating.
>>
>> Thanks
>> --
>> Raphael Hsieh
>>
>>
>>
>>
>
>


-- 
Raphael Hsieh

Re: Flush aggregated data every X seconds

Posted by Corey Nolet <cj...@gmail.com>.
Raphael, in your case it sounds like a "TickSpout" could be useful where
you emit a tuple every n time slices and then sleep until needing to emit
another. I'm not sure how that'd work in a Trident aggregator, however.

I'm not sure if this is something Nathan or the community would approve of,
but I've been writing my own framework for doing sliding/tumbling windows
in Storm that allow aggregations and triggering/eviction by count, time,
and other policies like "when the time difference between the first item
and the last item in a window is less than x". The bolts could easily be
ripped out for doing your own aggregations.

It's located here: https://github.com/calrissian/flowbox

It's very much so in the proof of concept stage. My other requirement (and
the reason I cared so much to implement this) was that the rules need to be
dynamic and the topology needs to be static as to make the best use of
resources while users are defining that they need.



On Thu, Apr 24, 2014 at 11:27 PM, Raphael Hsieh <ra...@gmail.com>wrote:

> Is there a way in Storm Trident to aggregate data over a certain time
> period and have it flush the data out to an external data store after that
> time period is up ?
>
> Trident does not have the functionality of Tick Tuples yet, so I cannot
> use that. Everything I've been researching leads to believe that this is
> not possible in Storm/Trident, however this seems to me to be a fairly
> standard use case of any streaming map reduce library.
>
> For example,
> If I am receiving a stream of integers
> I want to aggregate all those integers over a period of 1 second, then
> persist it into an external datastore.
>
> This is not in order to count how much it will add up to over X amount of
> time, rather I would like to minimize the read/write/updates I do to said
> datastore.
>
> There are many ways in order to reduce these variables, however all of
> them force me to modify my schema in ways that are unpleasant. Also, I
> would rather not have my final external datastore be my scratch space,
> where my program is reading/updating/writing and checking to make sure that
> the transaction id's line up.
> Instead I want that scratch work to be done separately, then the final
> result stored into a final database that no longer needs to do constant
> updating.
>
> Thanks
> --
> Raphael Hsieh
>
>
>
>