You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Chris Lin <ch...@etudata.com> on 2012/02/08 04:50:17 UTC

Big throughput differences on different sinks.

Hi all,

We are using flume 0.9.4 from CDH3u2. We are testing it in our environment
and found that there are quite some differences when using different sinks.

The data source is a plain text file, and we use* exec("cat test.txt",
aggregate = true) *to specify source of the agent.
When used customdfs or formatDfs as sink, we got throughput around 40MB/s
which is comparable with direct *hadoop fs -put *in the same environment.
However when used escapedFormatDfs or collectorSink where we would like to
utilize it's escape feature, the throughput dropped to about 4MB/s for
escapedFormatDfs and 1MB/s for collectorSink.

Is there any way that we can tweak so that collectorSink can have better
throughput? Or is it the limitation on collectorSink/escapedFormatDfs? We
would like to be able to rotate the output written to HDFS on some time
interval, with an expected throughput of 10MB/s. Any comment is
appreciated, thank you.

Regards,
Chris