You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Xiang Hua <be...@gmail.com> on 2012/09/29 08:16:19 UTC

hadoop disk selection

Hi,
  i want to  select 4(600G) local disks combined with  3*800G disks form
diskarray  in one datanode.
  is there any problem? performance ?


Best R.
beatls

Re: hadoop disk selection

Posted by Ted Dunning <td...@maprtech.com>.
If you use any sort of RAID-0, you may effectively either lose 200GB from
the larger disks or lose significant performance from the disk array.

Make sure that the striping is significantly long.  The default for most
RAID setups is too short.

On Sat, Sep 29, 2012 at 2:16 AM, Xiang Hua <be...@gmail.com> wrote:

> Hi,
>   i want to  select 4(600G) local disks combined with  3*800G disks form
> diskarray  in one datanode.
>   is there any problem? performance ?
>
>
> Best R.
> beatls
>

Re: hadoop disk selection

Posted by Andy Isaacson <ad...@cloudera.com>.
Moving this to user@ since it's not appropriate for general@.

On Fri, Sep 28, 2012 at 11:16 PM, Xiang Hua <be...@gmail.com> wrote:
> Hi,
>   i want to  select 4(600G) local disks combined with  3*800G disks form
> diskarray  in one datanode.
>   is there any problem? performance ?

The recommended configuration would be to partition and format each
disk with ext4, then set dfs.datanode.data.dir to point to the
mountpoints of each disk:

  <property>
     <name>dfs.datanode.data.dir</name>
     <value>/data/1/datadir,/data/2/datadir,/data/3/datadir</value>
  </property>

You may also want to set dfs.datanode.du.reserved to 1GB or thereabouts.

With this configuration your DN will fill all 7 datadir at the same
rate pseudorandomly, until the 600G disks are nearly full, then it
will write any further blocks to the 800G disks. Performance will be
OK except that you will see performance hot-spots on the larger disks
when writing past the 600GB mark. See
https://issues.apache.org/jira/browse/HDFS-1564 for one missing
feature in this area.

I would not recommend using RAID-0 for datadir because if you
experience a disk failure with independent filesystems, only the
blocks on one datadir are lost and need to be rereplicated. If you
experience a disk failure with RAID-0, all blocks stored on that DN
are lost and need to be rereplicated. Also, RAID results in
performance lockstep; a single slow disk will slow down access to all
blocks on that DN, while with independent filesystems a single slow
disk slows down only a fraction of the blocks on that DN.

-andy

Re: hadoop disk selection

Posted by Andy Isaacson <ad...@cloudera.com>.
Moving this to user@ since it's not appropriate for general@.

On Fri, Sep 28, 2012 at 11:16 PM, Xiang Hua <be...@gmail.com> wrote:
> Hi,
>   i want to  select 4(600G) local disks combined with  3*800G disks form
> diskarray  in one datanode.
>   is there any problem? performance ?

The recommended configuration would be to partition and format each
disk with ext4, then set dfs.datanode.data.dir to point to the
mountpoints of each disk:

  <property>
     <name>dfs.datanode.data.dir</name>
     <value>/data/1/datadir,/data/2/datadir,/data/3/datadir</value>
  </property>

You may also want to set dfs.datanode.du.reserved to 1GB or thereabouts.

With this configuration your DN will fill all 7 datadir at the same
rate pseudorandomly, until the 600G disks are nearly full, then it
will write any further blocks to the 800G disks. Performance will be
OK except that you will see performance hot-spots on the larger disks
when writing past the 600GB mark. See
https://issues.apache.org/jira/browse/HDFS-1564 for one missing
feature in this area.

I would not recommend using RAID-0 for datadir because if you
experience a disk failure with independent filesystems, only the
blocks on one datadir are lost and need to be rereplicated. If you
experience a disk failure with RAID-0, all blocks stored on that DN
are lost and need to be rereplicated. Also, RAID results in
performance lockstep; a single slow disk will slow down access to all
blocks on that DN, while with independent filesystems a single slow
disk slows down only a fraction of the blocks on that DN.

-andy

Re: hadoop disk selection

Posted by Andy Isaacson <ad...@cloudera.com>.
Moving this to user@ since it's not appropriate for general@.

On Fri, Sep 28, 2012 at 11:16 PM, Xiang Hua <be...@gmail.com> wrote:
> Hi,
>   i want to  select 4(600G) local disks combined with  3*800G disks form
> diskarray  in one datanode.
>   is there any problem? performance ?

The recommended configuration would be to partition and format each
disk with ext4, then set dfs.datanode.data.dir to point to the
mountpoints of each disk:

  <property>
     <name>dfs.datanode.data.dir</name>
     <value>/data/1/datadir,/data/2/datadir,/data/3/datadir</value>
  </property>

You may also want to set dfs.datanode.du.reserved to 1GB or thereabouts.

With this configuration your DN will fill all 7 datadir at the same
rate pseudorandomly, until the 600G disks are nearly full, then it
will write any further blocks to the 800G disks. Performance will be
OK except that you will see performance hot-spots on the larger disks
when writing past the 600GB mark. See
https://issues.apache.org/jira/browse/HDFS-1564 for one missing
feature in this area.

I would not recommend using RAID-0 for datadir because if you
experience a disk failure with independent filesystems, only the
blocks on one datadir are lost and need to be rereplicated. If you
experience a disk failure with RAID-0, all blocks stored on that DN
are lost and need to be rereplicated. Also, RAID results in
performance lockstep; a single slow disk will slow down access to all
blocks on that DN, while with independent filesystems a single slow
disk slows down only a fraction of the blocks on that DN.

-andy

Re: hadoop disk selection

Posted by Andy Isaacson <ad...@cloudera.com>.
Moving this to user@ since it's not appropriate for general@.

On Fri, Sep 28, 2012 at 11:16 PM, Xiang Hua <be...@gmail.com> wrote:
> Hi,
>   i want to  select 4(600G) local disks combined with  3*800G disks form
> diskarray  in one datanode.
>   is there any problem? performance ?

The recommended configuration would be to partition and format each
disk with ext4, then set dfs.datanode.data.dir to point to the
mountpoints of each disk:

  <property>
     <name>dfs.datanode.data.dir</name>
     <value>/data/1/datadir,/data/2/datadir,/data/3/datadir</value>
  </property>

You may also want to set dfs.datanode.du.reserved to 1GB or thereabouts.

With this configuration your DN will fill all 7 datadir at the same
rate pseudorandomly, until the 600G disks are nearly full, then it
will write any further blocks to the 800G disks. Performance will be
OK except that you will see performance hot-spots on the larger disks
when writing past the 600GB mark. See
https://issues.apache.org/jira/browse/HDFS-1564 for one missing
feature in this area.

I would not recommend using RAID-0 for datadir because if you
experience a disk failure with independent filesystems, only the
blocks on one datadir are lost and need to be rereplicated. If you
experience a disk failure with RAID-0, all blocks stored on that DN
are lost and need to be rereplicated. Also, RAID results in
performance lockstep; a single slow disk will slow down access to all
blocks on that DN, while with independent filesystems a single slow
disk slows down only a fraction of the blocks on that DN.

-andy