You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Vasilis Liaskovitis <vl...@gmail.com> on 2010/02/09 17:49:43 UTC

using multiple disks for HDFS

Hi,

I am trying to use 4 SATA disks per node in my hadoop cluster. This is
a JBOD configuration, no RAID is involved. There is one single xfs
partition per disk, each one mounted as /local/, /local2/, /local3,
/local4 - with sufficient privileges for running hadoop jobs. HDFS is
setup across the 4 disks for a single user usage (user2) with the
following comma separated list in hadoop.tmp.dir

<property>
  <name>dfs.data.dir</name>
  <value>${hadoop.tmp.dir}/dfs/data</value>
</property>

 <property>
    <name>hadoop.tmp.dir</name>
    <value>/local/user2/hdfs/hadoop-${user.name},/local2/user2/hdfs/hadoop-${user.name},/local3/user2/hdfs/hadoop-${user.name,/local4/user2/hdfs/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
  </property>

What I see is that most or all data is stored on disks /local and
/local4 across nodes. Directories local2 and local3 from the other
disks are not used. I have verified that these disks can be written to
and have free space.

Isn't HDFS supposed to use all disks in a round-robin way? (provided
there is free space on all). Do I need to change another config
parameter for HDFS to spread I/O across all  provided mount points?

- Vasilis

Re: using multiple disks for HDFS

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Vasilis,

Two things:

1) You're missing a matching } in your hadoop.tmp.dir setting
2) When you use ${hadoop.tmp.dir}/dfs/data, it does a literal string
interpolation. Thus, it's not adding dfs/data to each of the
hadoop.tmp.dir directories, but rather just the last one.

I'd recommend setting dfs.data.dir explicitly to the full comma
separated list and ignoring hadoop.tmp.dir

Thanks
-Todd

On Tue, Feb 9, 2010 at 8:49 AM, Vasilis Liaskovitis <vl...@gmail.com> wrote:
> Hi,
>
> I am trying to use 4 SATA disks per node in my hadoop cluster. This is
> a JBOD configuration, no RAID is involved. There is one single xfs
> partition per disk, each one mounted as /local/, /local2/, /local3,
> /local4 - with sufficient privileges for running hadoop jobs. HDFS is
> setup across the 4 disks for a single user usage (user2) with the
> following comma separated list in hadoop.tmp.dir
>
> <property>
>  <name>dfs.data.dir</name>
>  <value>${hadoop.tmp.dir}/dfs/data</value>
> </property>
>
>  <property>
>    <name>hadoop.tmp.dir</name>
>    <value>/local/user2/hdfs/hadoop-${user.name},/local2/user2/hdfs/hadoop-${user.name},/local3/user2/hdfs/hadoop-${user.name,/local4/user2/hdfs/hadoop-${user.name}</value>
>    <description>A base for other temporary directories.</description>
>  </property>
>
> What I see is that most or all data is stored on disks /local and
> /local4 across nodes. Directories local2 and local3 from the other
> disks are not used. I have verified that these disks can be written to
> and have free space.
>
> Isn't HDFS supposed to use all disks in a round-robin way? (provided
> there is free space on all). Do I need to change another config
> parameter for HDFS to spread I/O across all  provided mount points?
>
> - Vasilis
>

Re: using multiple disks for HDFS

Posted by Allen Wittenauer <aw...@linkedin.com>.



On 2/9/10 8:49 AM, "Vasilis Liaskovitis" <vl...@gmail.com> wrote:
> <property>
>   <name>dfs.data.dir</name>
>   <value>${hadoop.tmp.dir}/dfs/data</value>
> </property>
> 
>  <property>
>     <name>hadoop.tmp.dir</name>
>     
> <value>/local/user2/hdfs/hadoop-${user.name},/local2/user2/hdfs/hadoop-${user.
> name},/local3/user2/hdfs/hadoop-${user.name,/local4/user2/hdfs/hadoop-${user.n
> ame}</value>
>     <description>A base for other temporary directories.</description>
>   </property>
> 
> What I see is that most or all data is stored on disks /local and
> /local4 across nodes. Directories local2 and local3 from the other
> disks are not used. I have verified that these disks can be written to
> and have free space.
> 
> Isn't HDFS supposed to use all disks in a round-robin way? (provided
> there is free space on all). Do I need to change another config
> parameter for HDFS to spread I/O across all  provided mount points?

You've fallen into a trap that the defaults lay for you.  You're not the
only one, and I think I'm going to file a JIRA to fix this.

What you really want is:

dfs.data.dir pointed to /local/user2/hdfs/dfs-data,
/local2/user2/hdfs/dfs-data, etc

hadoop.tmp.dir pointed to /local/user2/tmp/hadoop-${user.name},
/local2/user2/tmp/hadoop-${user.name}, etc


The hadoop.tmp.dir expansion is meant for a really quick QA and not for Real
Work (TM).