You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Marco Nicolini <ma...@gmail.com> on 2016/07/05 14:33:23 UTC

"Coalescing" tuples?

Hello guys,

I'm evaluating using storm for a project at work.

I've just scratched the surface of storm and I'd like to ask opinions on
how you would tackle a typical issue that pops up in the domain I'm working
in (financial real time market data).

I have some real time data pushed into storm, (I have written a spout for
that), and I need to group it on certain keys (field grouping seems to take
care of that nicely) do some processing on it and write it to a queue
somewhere (think JMS or other middle ware here...)

Thing is... the processing takes a long time (in respect to the frequency
of the input data) and I'm not really interested in building a queue of
data to process... (as I would never be able to keep up with it).
What really I'm after is a way to coalesce tuples at bolt level so that the
processing bolt will work on the freshest possible data. In other words i'm
looking for a way to skip tuples that came in while my processor bolt was
working).

Note that I'm not concerned about parallelizing the processing...as
processing data for the same group in parallel would break the ordering
requisite i have on processed outputs. At most I'm aiming at one "processor
bolt per group" (we can assume the groups numerosity is known in advance).

I think I can implement this behaviour in the spout: emit a new tuple (for
a given grouping) only when the last emitted tuple (for a given grouping)
has been fully processed (I.e. acknowledge is called on the spout). However
this complicates a bit the spout code (as it would have to group things
already before emitting them) and in general ends up putting too much
complexity on one component... imho.

If a bolt had the possibility to know when tuple emitted by itself becomes
fully processed (similarly to what acknowledge does for the spout) the
problem would be simpler and complexity-per component would be lower.

Or the processor bolt could just process things asynchronously (i.e. not in
the thread where execute() is called on the bolt) always picking up the
"latest" data from some local state to the bolt. This state would be
overwritten every time a new fresh tuple is received with execute.

Or...

Is there a 'configurable way' to achieve that?

Any other way you would tackle this?

Sorry for the "wall of text". Any idea/help/advice is greatly
appreciated...

Regards,

Marco

Re: "Coalescing" tuples?

Posted by Nathan Leung <nc...@gmail.com>.
I would push the incoming tuples into a stack, and have a background thread
take the stack, process the top element, and discard the rest.
On Jul 5, 2016 10:33 AM, "Marco Nicolini" <ma...@gmail.com> wrote:

> Hello guys,
>
> I'm evaluating using storm for a project at work.
>
> I've just scratched the surface of storm and I'd like to ask opinions on
> how you would tackle a typical issue that pops up in the domain I'm working
> in (financial real time market data).
>
> I have some real time data pushed into storm, (I have written a spout for
> that), and I need to group it on certain keys (field grouping seems to take
> care of that nicely) do some processing on it and write it to a queue
> somewhere (think JMS or other middle ware here...)
>
> Thing is... the processing takes a long time (in respect to the frequency
> of the input data) and I'm not really interested in building a queue of
> data to process... (as I would never be able to keep up with it).
> What really I'm after is a way to coalesce tuples at bolt level so that
> the processing bolt will work on the freshest possible data. In other words
> i'm looking for a way to skip tuples that came in while my processor bolt
> was working).
>
> Note that I'm not concerned about parallelizing the processing...as
> processing data for the same group in parallel would break the ordering
> requisite i have on processed outputs. At most I'm aiming at one "processor
> bolt per group" (we can assume the groups numerosity is known in advance).
>
> I think I can implement this behaviour in the spout: emit a new tuple (for
> a given grouping) only when the last emitted tuple (for a given grouping)
> has been fully processed (I.e. acknowledge is called on the spout). However
> this complicates a bit the spout code (as it would have to group things
> already before emitting them) and in general ends up putting too much
> complexity on one component... imho.
>
> If a bolt had the possibility to know when tuple emitted by itself becomes
> fully processed (similarly to what acknowledge does for the spout) the
> problem would be simpler and complexity-per component would be lower.
>
> Or the processor bolt could just process things asynchronously (i.e. not
> in the thread where execute() is called on the bolt) always picking up the
> "latest" data from some local state to the bolt. This state would be
> overwritten every time a new fresh tuple is received with execute.
>
> Or...
>
> Is there a 'configurable way' to achieve that?
>
> Any other way you would tackle this?
>
> Sorry for the "wall of text". Any idea/help/advice is greatly
> appreciated...
>
> Regards,
>
> Marco
>