You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by ed <ed...@gmail.com> on 2014/01/06 15:22:23 UTC

How to improve File Channel performance with multiple sinks

I saw in some previous list mail that one can improve FileChannel
performance when writing to HDFS by using multiple sinks  (
http://mail-archives.apache.org/mod_mbox/flume-user/201212.mbox/%3C8A87C252755D4F0AB9F5F17D6A7FA9D6@cloudera.com%3E).
 Initially I thought that this meant I should use a sink group to have
multiple sinks writing to HDFS.  However, I read in another thread that in
a sink group you still only have one sink active at a time  (can't find
that message at the moment).

If that's the case how do you setup multiple sinks to improve FileChannel
performance?  Is it as simple as assigning both sinks to the same
FileChannel like this::

log.sinks = hdfsSink1 hdfsSink2
log.channels = fileChannel
log.sinks.hdfsSink1.channel = fileChannel
log.sinks.hdfsSink2.channel = fileChannel
#assume hdfsSink1 and hdfsSink2 write to different directories

I feel like the above would replicate data in HDFS or does the channel send
each event to only one of the two possible sinks?

Thank you for your assistance!

Best Regards,

Ed

Re: How to improve File Channel performance with multiple sinks

Posted by Devin Suiter RDX <ds...@rdx.com>.
This might help:

"On Wed, Dec 12, 2012 at 12:53 PM, Hari Shreedharan

> <hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com)> wrote:
> > Also note that having multiple sinks often improves performance - though you
> > should have each sink write to a different directory on HDFS. Since each
> > sink really uses only on thread at a time to write, having multiple sinks
> > allows multiple threads to write to HDFS. Also if you can spare additional
> > disks on your Flume agent machine for file channel data directories, that
> > will also improve performance.


"

So basically, the sink group has a thread, and each sink in the group
has a thread, and there can only be one thread at a time writing to
HDFS. The HDFS write operation is really pretty quick though. I think
the sink group is the way to go - that should line up the sinks in
rotation for queuing "takes" from the channel.

The big thing with FileChannel is planning for an fsync() every time
your CHANNEL takes from your SOURCE. You want a good batch size.
Reading the data from the channel, the "take," is pretty quick - this
is what your sink is doing. The immediately upstream source, however,
is pushing things to the channel, the "put", and if that is small and
frequent, you are making small and frequent filesysncs and locking
things down while that is happening.


*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Mon, Jan 6, 2014 at 9:22 AM, ed <ed...@gmail.com> wrote:

> I saw in some previous list mail that one can improve FileChannel
> performance when writing to HDFS by using multiple sinks  (
> http://mail-archives.apache.org/mod_mbox/flume-user/201212.mbox/%3C8A87C252755D4F0AB9F5F17D6A7FA9D6@cloudera.com%3E).
>  Initially I thought that this meant I should use a sink group to have
> multiple sinks writing to HDFS.  However, I read in another thread that in
> a sink group you still only have one sink active at a time  (can't find
> that message at the moment).
>
> If that's the case how do you setup multiple sinks to improve FileChannel
> performance?  Is it as simple as assigning both sinks to the same
> FileChannel like this::
>
> log.sinks = hdfsSink1 hdfsSink2
> log.channels = fileChannel
> log.sinks.hdfsSink1.channel = fileChannel
> log.sinks.hdfsSink2.channel = fileChannel
> #assume hdfsSink1 and hdfsSink2 write to different directories
>
> I feel like the above would replicate data in HDFS or does the channel
> send each event to only one of the two possible sinks?
>
> Thank you for your assistance!
>
> Best Regards,
>
> Ed
>