You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Vasilis Liaskovitis <vl...@gmail.com> on 2010/02/09 17:49:43 UTC
using multiple disks for HDFS
Hi,
I am trying to use 4 SATA disks per node in my hadoop cluster. This is
a JBOD configuration, no RAID is involved. There is one single xfs
partition per disk, each one mounted as /local/, /local2/, /local3,
/local4 - with sufficient privileges for running hadoop jobs. HDFS is
setup across the 4 disks for a single user usage (user2) with the
following comma separated list in hadoop.tmp.dir
<property>
<name>dfs.data.dir</name>
<value>${hadoop.tmp.dir}/dfs/data</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/local/user2/hdfs/hadoop-${user.name},/local2/user2/hdfs/hadoop-${user.name},/local3/user2/hdfs/hadoop-${user.name,/local4/user2/hdfs/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
What I see is that most or all data is stored on disks /local and
/local4 across nodes. Directories local2 and local3 from the other
disks are not used. I have verified that these disks can be written to
and have free space.
Isn't HDFS supposed to use all disks in a round-robin way? (provided
there is free space on all). Do I need to change another config
parameter for HDFS to spread I/O across all provided mount points?
- Vasilis
Re: using multiple disks for HDFS
Posted by Todd Lipcon <to...@cloudera.com>.
Hi Vasilis,
Two things:
1) You're missing a matching } in your hadoop.tmp.dir setting
2) When you use ${hadoop.tmp.dir}/dfs/data, it does a literal string
interpolation. Thus, it's not adding dfs/data to each of the
hadoop.tmp.dir directories, but rather just the last one.
I'd recommend setting dfs.data.dir explicitly to the full comma
separated list and ignoring hadoop.tmp.dir
Thanks
-Todd
On Tue, Feb 9, 2010 at 8:49 AM, Vasilis Liaskovitis <vl...@gmail.com> wrote:
> Hi,
>
> I am trying to use 4 SATA disks per node in my hadoop cluster. This is
> a JBOD configuration, no RAID is involved. There is one single xfs
> partition per disk, each one mounted as /local/, /local2/, /local3,
> /local4 - with sufficient privileges for running hadoop jobs. HDFS is
> setup across the 4 disks for a single user usage (user2) with the
> following comma separated list in hadoop.tmp.dir
>
> <property>
> <name>dfs.data.dir</name>
> <value>${hadoop.tmp.dir}/dfs/data</value>
> </property>
>
> <property>
> <name>hadoop.tmp.dir</name>
> <value>/local/user2/hdfs/hadoop-${user.name},/local2/user2/hdfs/hadoop-${user.name},/local3/user2/hdfs/hadoop-${user.name,/local4/user2/hdfs/hadoop-${user.name}</value>
> <description>A base for other temporary directories.</description>
> </property>
>
> What I see is that most or all data is stored on disks /local and
> /local4 across nodes. Directories local2 and local3 from the other
> disks are not used. I have verified that these disks can be written to
> and have free space.
>
> Isn't HDFS supposed to use all disks in a round-robin way? (provided
> there is free space on all). Do I need to change another config
> parameter for HDFS to spread I/O across all provided mount points?
>
> - Vasilis
>
Re: using multiple disks for HDFS
Posted by Allen Wittenauer <aw...@linkedin.com>.
On 2/9/10 8:49 AM, "Vasilis Liaskovitis" <vl...@gmail.com> wrote:
> <property>
> <name>dfs.data.dir</name>
> <value>${hadoop.tmp.dir}/dfs/data</value>
> </property>
>
> <property>
> <name>hadoop.tmp.dir</name>
>
> <value>/local/user2/hdfs/hadoop-${user.name},/local2/user2/hdfs/hadoop-${user.
> name},/local3/user2/hdfs/hadoop-${user.name,/local4/user2/hdfs/hadoop-${user.n
> ame}</value>
> <description>A base for other temporary directories.</description>
> </property>
>
> What I see is that most or all data is stored on disks /local and
> /local4 across nodes. Directories local2 and local3 from the other
> disks are not used. I have verified that these disks can be written to
> and have free space.
>
> Isn't HDFS supposed to use all disks in a round-robin way? (provided
> there is free space on all). Do I need to change another config
> parameter for HDFS to spread I/O across all provided mount points?
You've fallen into a trap that the defaults lay for you. You're not the
only one, and I think I'm going to file a JIRA to fix this.
What you really want is:
dfs.data.dir pointed to /local/user2/hdfs/dfs-data,
/local2/user2/hdfs/dfs-data, etc
hadoop.tmp.dir pointed to /local/user2/tmp/hadoop-${user.name},
/local2/user2/tmp/hadoop-${user.name}, etc
The hadoop.tmp.dir expansion is meant for a really quick QA and not for Real
Work (TM).