You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by "Meyer, Dennis" <de...@adtech.com> on 2012/03/02 13:35:23 UTC

Flume Reliability Issues

Hi,

We encountered the following Issues in our development with Flume. We're investigating the issues currently, but it would be great if someone could sent some feedback if this is

  *   Work as designed (but maybe misused)
  *   Known Issues (in that version only?)
  *   Not supported feature

Here comes the list with the four issues we have seen:



  *   Used Flume version 0.9.4+25.40-1
  *   1) Feature "Duplicate Data" works inconsistently
     *   Not all data will be duplicated all the time (usage: send data to a SAN for a full backup and send data to HDFS)
     *   If a receiving node goes out of service, the sending node stops sending data to all receiving nodes
     *   Should a failed receiving node reconnect
        *   The sending nodes CPU will go up to 100% usage, meaning that it will stop handling records from now and as even if the CPU recovers, Flume does not
        *   Even if the failed node reconnects, there is a chance that the sending source will not notice the reconnect. This can only be fixed by a full restart of all involved sending/receiving nodes
  *
  *   2) Flume is unable to recover failed/crashed/lost nodes reliably
     *   Often failed nodes get back up, but are not integrated into the data flow anymore(i.e. a source not knowing that its sink reconnected)
     *   A node may be lost, but neither the master nor any connected node know about it
     *   A failed node can only be reliably re-introduced into its flow if ALL nodes are restarted manually!
  *
  *   3) Flume is unable to run the highest reliability mode for records crash free
     *   If a node reconnects after a failure, there is a good chance that the master node crashes
  *
  *   4) Loosing records on node failure
     *   Flume sends up to one thousand records as a batch from source to sink. If the sink failes on the first record, the other 999 records sometimes get lost.
     *   On the highest reliability mode, Flume was unable to reroute records safely through. As we send data to a node which is or goes out of service, Flume saves this data for later when the node reconnects. What it should really do is take the events from the failed node and reroute them accordingly to the defined flow into another node.
  *

BIG THANKS!
Dennis

Re: Flume Reliability Issues

Posted by Matthew Rathbone <ma...@foursquare.com>.
Hey Dennis, 

We've had a lot of issues with any flume version < 1.0. We lose a lot of data and we get a lot of deadlocks.

Speaking to Cloudera I'd suggest you try out the flume-NG beta, labelled flume 1.0. It's a total rewrite and it looks like there are people working on it full time. We're going to be testing it out in the next few weeks. 

-- 
Matthew Rathbone
Foursquare | Software Engineer | Server Engineering Team
matthew@foursquare.com (mailto:matthew@foursquare.com) | @rathboma (http://twitter.com/rathboma) | 4sq (http://foursquare.com/rathboma)



On Friday, March 2, 2012 at 4:35 AM, Meyer, Dennis wrote:

> Hi,
> 
> 
> 
> We encountered the following Issues in our development with Flume. We're investigating the issues currently, but it would be great if someone could sent some feedback if this is 
> Work as designed (but maybe misused)
> Known Issues (in that version only?)
> Not supported feature
> 
> 
> Here comes the list with the four issues we have seen: 
> 
> 
> Used Flume version 0.9.4+25.40-1 
> 1) Feature "Duplicate Data" works inconsistently 
> Not all data will be duplicated all the time (usage: send data to a SAN for a full backup and send data to HDFS) 
> If a receiving node goes out of service, the sending node stops sending data to all receiving nodes 
> Should a failed receiving node reconnect 
> The sending nodes CPU will go up to 100% usage, meaning that it will stop handling records from now and as even if the CPU recovers, Flume does not 
> Even if the failed node reconnects, there is a chance that the sending source will not notice the reconnect. This can only be fixed by a full restart of all involved sending/receiving nodes
> 
> 
> 
> 
> 
> 
> 2) Flume is unable to recover failed/crashed/lost nodes reliably 
> Often failed nodes get back up, but are not integrated into the data flow anymore(i.e. a source not knowing that its sink reconnected) 
> A node may be lost, but neither the master nor any connected node know about it 
> A failed node can only be reliably re-introduced into its flow if ALL nodes are restarted manually! 
> 
> 
> 
> 3) Flume is unable to run the highest reliability mode for records crash free 
> If a node reconnects after a failure, there is a good chance that the master node crashes 
> 
> 
> 
> 4) Loosing records on node failure 
> Flume sends up to one thousand records as a batch from source to sink. If the sink failes on the first record, the other 999 records sometimes get lost. 
> On the highest reliability mode, Flume was unable to reroute records safely through. As we send data to a node which is or goes out of service, Flume saves this data for later when the node reconnects. What it should really do is take the events from the failed node and reroute them accordingly to the defined flow into another node.
> 
> 
> 
> 
> 
> BIG THANKS!
> 
> Dennis
> 
>