You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Gabriel Commeau (JIRA)" <ji...@apache.org> on 2013/09/02 08:12:52 UTC

[jira] [Commented] (FLUME-2173) Exactly once semantics for Flume

    [ https://issues.apache.org/jira/browse/FLUME-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755918#comment-13755918 ] 

Gabriel Commeau commented on FLUME-2173:
----------------------------------------

Hi Hari & team,

May I suggest the following idea: instead of assigning a UUID to the events, which I assume would be arbitrary if not random, what about enforcing ordering of events? Each "ingest" agent/client (i.e. first tier) would have a unique identifier (e.g. a random UUID , or host name + agent name), and a local counter, which would increment for every event generated/ingested by that agent. Consequently, each event has an "ingest" ID and a counter value. In ZooKeeper, instead of having a long list of UUID for the events recently gone once, we'd only have as many Z-nodes as ingest agents/clients (let it be N), which contain the highest counter value of events successfully passed through from the corresponding ingest agent/client. If an event has successfully been processed (i.e. the first time), the dedup channel increments the ZK counter to the counter value of that event. If the ZK counter is equal or greater than the counter value of the event, it's a duplicate.
The advantage is that a batch of M events successfully processed can be acknowledged in k <= N ZooKeeper operations, and not in M - usually much larger than N. The inconvenient is that if some events get stuck in process, the dedup channels will be waiting on them, and so we'd need a way to resend these events - and therefore, a channel seems like the appropriate place to do that. The "ingest" channel can clear the events that have a counter value below the ZK counter, as they have successfully been through once.

On another note, what about grouping the dedup channels for exactly-once-semantics? We would define a namespace for this group of channels, and would guaranty that the events come exactly once in that group of channels; but it could come twice in 2 distinct groups of channels - say one that goes to HDFS and one to HBase for instance. The ZK structure detailed above can be duplicated for each namespace (which would be a parent Z-node); and in order to clear the events, the "ingest" channel needs to check all existing namespaces.

I hope this helps.
                
> Exactly once semantics for Flume
> --------------------------------
>
>                 Key: FLUME-2173
>                 URL: https://issues.apache.org/jira/browse/FLUME-2173
>             Project: Flume
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>            Assignee: Hari Shreedharan
>
> Currently Flume guarantees only at least once semantics. This jira is meant to track exactly once semantics for Flume. My initial idea is to include uuid event ids on events at the original source (use a config to mark a source an original source) and identify destination sinks. At the destination sinks, use a unique ZK Znode to track the events. If once seen (and configured), pull the duplicate out.
> This might need some refactoring, but my belief is we can do this in a backward compatible way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira