You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Sandeep Khurana <sk...@gmail.com> on 2014/03/05 07:12:23 UTC

Flume duplicates

We have flume ingestion servers in production environment which are getting
data from scribe source. These servers are behind a load balancer. We
observed that we get lots of duplicates (7-8 times of original events) when
we
a) take out a flume server from load balancer
b) wait for channel capacity to be zero i.e. wait for all data to be
flushed out.
c) change some configuration in flume (e.g. 1 time we changed the batch
size)
d) put the server back into load balancer.

As soon as the flume server is put back into load balancer we see sudden
surge of data being processed. These are duplicate records (events).
Question is

a) Why do we see 7-9 times of duplicate events when we add this server back
into load balancer.
b) What is the best way to handle such type of changes in flume production
boxes so that we dont see these many duplicated.

Few hundred or couple of thousands duplicates we can live with. But if
instead of getting 1,50,000 events we get 9,00,000 events (mostly
duplicates) then our workflows will start having problems.