You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Morten Pedersen <mo...@sneftrup.net> on 2016/05/23 22:37:11 UTC

Working around tuple timeouts in long running joins

Hi everyone,

I'm working on a system that is using Kafka and Storm to transcribe phone
calls that can be up to 4 hours long. The call is transcribed in chunks and
will slowly stream into Kafka as the phone call progresses.

Now to the problem. As we transcribe chunks, they get put into Kafka and
we're trying to use Storm to join all chunks for a given call into one
complete transcript. We do this by field grouping on the id of the call,
and keeping an in-memory map of all the chunks per call in the bolt, and
emitting the complete transcript once the call is done. In order to
guarantee that we process the complete call, I want to delay ack'ing of the
chunk tuples until we have received every single tuple for the call, at
which point we then emit the complete transcript anchored to every chunk
tuple for the call. Because the tuples could be waiting in the bolt for up
to 4 hours, we need to do something to guarantee that Storm doesn't replay
the tuples that are still good, while also guaranteeing that if something
falls further in the topology, the all chunks for a call is replayed.

I've had a couple of ideas on how to solve this, but neither of them seem
right.

1) Up the tuple timeout to 4 hours. This would technically solve this, but
given the original 30 second timeout, changing it to 4 hours seems crazy.

2) Periodically call resetTimeout on the tuples to keep them alive. Would
also work, but there's supposedly significant performance cost of doing
that.

3) Store chunks in a data store somewhere outside of storm and join from
that when a call is done.

Any help would be greatly appreciated.

Thanks,
Morten