You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Andy Isaacson <ad...@cloudera.com> on 2012/10/04 01:55:35 UTC

Re: hadoop disk selection

Moving this to user@ since it's not appropriate for general@.

On Fri, Sep 28, 2012 at 11:16 PM, Xiang Hua <be...@gmail.com> wrote:
> Hi,
>   i want to  select 4(600G) local disks combined with  3*800G disks form
> diskarray  in one datanode.
>   is there any problem? performance ?

The recommended configuration would be to partition and format each
disk with ext4, then set dfs.datanode.data.dir to point to the
mountpoints of each disk:

  <property>
     <name>dfs.datanode.data.dir</name>
     <value>/data/1/datadir,/data/2/datadir,/data/3/datadir</value>
  </property>

You may also want to set dfs.datanode.du.reserved to 1GB or thereabouts.

With this configuration your DN will fill all 7 datadir at the same
rate pseudorandomly, until the 600G disks are nearly full, then it
will write any further blocks to the 800G disks. Performance will be
OK except that you will see performance hot-spots on the larger disks
when writing past the 600GB mark. See
https://issues.apache.org/jira/browse/HDFS-1564 for one missing
feature in this area.

I would not recommend using RAID-0 for datadir because if you
experience a disk failure with independent filesystems, only the
blocks on one datadir are lost and need to be rereplicated. If you
experience a disk failure with RAID-0, all blocks stored on that DN
are lost and need to be rereplicated. Also, RAID results in
performance lockstep; a single slow disk will slow down access to all
blocks on that DN, while with independent filesystems a single slow
disk slows down only a fraction of the blocks on that DN.

-andy