You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by Juhani Connolly <ju...@cyberagent.co.jp> on 2013/09/02 07:00:20 UTC
Scaling hdfs writes when dealing with large numbers of files

Hey guys, trying to figure out an approach around what I personally feel 
is a bit of an abuse of the system, would appreciate your input, and 
perhaps can come up with a solution to share with others.

Short version, see below for detail:
Lot's of files being simulataneously written. Number of connections from 
each data path is defined by path file no * sink no, with a lot of files 
this will cause even increasing from 2->3 hdfs sinks to not scale even 
though nothing(flume aggregators or HDFS datanodes) is fully loaded. 
Experimenting with alternate configs, increasing data paths thus 
reducing files per path, allowing 2 sinks per path has shown the number 
of connections up are causing this, but what exactly might be causing 
the bottleneck and is there a way to get rid of it?

Long version:
We've lately been running into somewhat interesting issues with 
performance seemingly capped by the number of hdfs connections that are 
up. This is due to the very significant number of different files that 
are being streamed simultaneously.

I have serious doubts about the viability of the approach and have 
suggested storing files together and post-processing but at the moment 
this appears not to be a possibility, so I'm looking for alternate 
approaches to allow scaling of throughput.

Some basic stats: on each aggregator node we have roughly 2,500 files 
being written to every hour, and a bit over 25,000,000 lines in that 
period of time. Approx 10k events per second. There are multiple 
aggregators writing to the same hdfs though each one writes to separate 
files from one another.

Originally we were scaling individual aggregator nodes by increasing the 
HDFS sink count, but this wasn't giving any increases beyond the second 
node, despite the results from Mike Percy's 
tests(https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Syslog+Performance+Test+2012-04-30). 
By increasing sinks we were also increasing the number of hdfs 
connections for every single file, hitting some kind of 
bottleneck(datanode transfer threads? Considering each has 4k and there 
are a lot more datanodes than aggregators, there should be plenty 
available).

Afterwards, splitting incoming data into multiple paths(separate sources 
and sinks) allowed each path to have two sinks and thus scales our 
throughput by increasing the number of sinks without increasing the 
connections(because each sink was only handling a fraction of the 
filepaths). Now a single node has 5 avro sources set up with a file 
channel and 2 sinks attached to each. With all of the transfer 
threads(resulting from all the different files), each node has about 
4000 threads running though at peak it can be double that.

So we have a rough idea of what is wrong(too many connections are 
hitting some kind of bottleneck) but can't track any exact cause(neither 
the aggregators nor datanodes are at full load, the only blocked waiting 
threads in flume are those waiting on HDFS). I personally feel we just 
need to reduce the number of files being simultaneously written since 
HDFS isn't really made to deal with such small files, and batches are 
not getting efficiently processed(waiting on dozens, possible hundreds 
of small transfers before being able to commit). That being said, can 
anyone provide specific insight into what may cause the bottleneck at 
high connection numbers, and if there are ways around it(other than 
reducing file counts and proportion of files to each sink which I'm 
already pushing for)?