You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Thomas Vachon <va...@sessionm.com> on 2011/12/21 20:52:09 UTC

Duplication Possibilities in autoE2E

It seems from my view, that there is the possibility of having duplicated data in a couple scenarios.  I will outline one that jumps to mind below.  My ultimate question is how do we solve this without dealing with the expensive (resource-wise) de-dupe inside Hive.

Scenario:

Node1 sends data to collector1 using an autoE2EChain.  Before the roll.milli timer expires (10 minutes in our case), collector1 goes down.  This causes node1 to transmit to collector2.  Collector2 rolls its file after the roll.milli timer expires and starts a new log.  Later collector1 comes back online and tries to flush its WAL files to its collectorSink (the same sink as collector2 writes to).  We theoretically get duplicate data.

So is my understanding correct, and if it is, can we avoid it?

Thanks,
Tom Vachon