You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Guy Peleg <gu...@gmail.com> on 2012/12/13 08:34:47 UTC

HDFSChannel?

Say I have multi-hop flow, and lets say the last one stores its data in
HDFS using the HDFS sink.

In the last agent, as in every agent, there are the source-channel-sink
trio, my question is: why do we need that channel if the only thing that
agent does is store the events in HDFS (or other data source)?

Won't it be more efficient to have an 'HDFSChannel' that is part of the
transaction, and no sink at all? otherwise I might need to use persistent
channel (JDBC, File) to make sure that data is not lost before
it is moved to the sink, which again, is redundant, since ideally I would
like the incoming events, on the 'last agent' to be stored as quickly as
possible in their destination without paying the extra channel coast

Re: HDFSChannel?

Posted by Hari Shreedharan <hs...@cloudera.com>.
There are several reasons we did not want a channel loading events into the next hop/final destination. 

One of the reasons is to clearly define the responsibilities of each component in the system and the responsibility of the channel is to be a buffer and that is it - you can see this from the Channel interface (It is the same reason you don't want classes and methods exist - in theory you could put everything into your main method and expect it to work - but in reality, that is not something you want to do.).

Another important thing to consider is that such an architecture is going to hit issues because a transaction is owned by a source thread, and by making the same transaction responsible for writing to HDFS, there is a tight coupling created between hop 1 to hop 2 writes and hop 2 to hdfs writes - which is exactly what Flume strives to remove, by providing the channel as a buffer. 

In addition to this, such a single threaded source-sink coupling existed in Flume OG which caused major issues and introduced much complexity making things impossible to debug.  

In your case if you have a channel that also does the writes within the same transaction, you are going to have complex issues when HDFS writes fail or timeout (I guarantee you this is going to happen). Handling such issues are complex. Now if you have an extra thread within the channel trying to clear up the data out of the "HDFS channel," it is not any different from an HDFS Sink. Having no channel and having just a source+sink is also going to make things quite complex and you are going to have to do a lot of handling if and when you hit some failure. 

I don't recommend having such an approach, and I don't think the File channel is going to hit your performance too much - which is what I'd recommend you use.


Hari
-- 
Hari Shreedharan


On Wednesday, December 12, 2012 at 11:34 PM, Guy Peleg wrote:

> Say I have multi-hop flow, and lets say the last one stores its data in HDFS using the HDFS sink.
> 
> In the last agent, as in every agent, there are the source-channel-sink trio, my question is: why do we need that channel if the only thing that agent does is store the events in HDFS (or other data source)? 
> 
> Won't it be more efficient to have an 'HDFSChannel' that is part of the transaction, and no sink at all? otherwise I might need to use persistent channel (JDBC, File) to make sure that data is not lost before 
> it is moved to the sink, which again, is redundant, since ideally I would like the incoming events, on the 'last agent' to be stored as quickly as possible in their destination without paying the extra channel coast
> 
>