You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jeff Eastman <je...@collab.net> on 2008/01/16 18:32:41 UTC

Platform reliability with Hadoop

I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen
machines in our CUBiT array for the last month. During this time I have
experienced two major data corruption losses on relatively small amounts
of data (<50gb) that make me wonder about the suitability of this
platform for hosting Hadoop. CUBiT is one of our products for managing a
pool of development servers, allowing developers to check out machines,
install various OS profiles on them and monitor their utilization via
the web. With most machines reporting very low utilization it seemed a
natural place to run Hadoop in the background. I have an NFS-mounted
account on all of the machines and have installed Hadoop there. The DFS
is stored in /tmp on each box. The developers who own the machines
occasionally reboot and reprofile them, but this occurs infrequently and
does not clobber /tmp. Hadoop is designed to deal with slave failures of
this nature, though this platform may well be an acid test.

 

My initial cloud was configured for replication factor of 3 and I have
increased that now to 4 in hopes of improving data reliability in the
face of these more-prevalent slave outages. Ted Dunning has suggested
aggressive rebalancing in his recent posts and I have done this by
increasing replication to 5 (from 3) and then dropping it to 4. Are
there other rebalancing or configuration techniques that might improve
my data reliability? Or, is this platform just too unstable to be a good
fit for Hadoop?

 

Jeff