You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Aaron Baff <Aa...@telescope.tv> on 2011/03/02 18:33:50 UTC

Small file Map performance

So, the problem is we have a crap ton of small files, and a limited sized cluster (only 4 nodes, just up from 2, yay!) as we are just starting to use Hadoop. With our current hardware, we have 32 Map slots, and >1500 files. The Task startup time is, frankly, killing us, and at this time we can't easily concat them all into a single file as we are receiving them in, and we want to run some analysis on them while they are still inbound. Several months ago we played around with the JVM re-use, but if I recall correctly a Task stays keyed to an individual MR Job until it hit's it's TTL, and then that slot becomes available for another Job. Is there a way to adjust this TTL? Or be able to re-use the JVM for a different Job? This is all with 0.21.0.


--Aaron