You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by yang song <ha...@gmail.com> on 2009/08/19 16:21:31 UTC

How does hadoop deal with hadoop-site.xml?

Hello, everybody
    I feel puzzled about setting properties in hadoop-site.xml.
    Suppose I submit the job from machine A, and JobTracker runs on machine
B. So there are two hadoop-site.xml files. Now, I increase
"mapred.reduce.parallel.copies"(e.g. 10) on machine B since I want to make
copy phrase faster. However, "mapred.reduce.parallel.copies" from WebUI is
still 5. When I increase it on machine A, it changes. So, I feel very
puzzled. Why does it doesn't work when I change it on B? What's more, when I
add some properties on B, the certain properties will be found on WebUI. And
why I can't change properties through machine B? Does some certain
properties must be changed through A and some others must be changed through
B?
    Thank you!
    Inifok

Re: How does hadoop deal with hadoop-site.xml?

Posted by yang song <ha...@gmail.com>.

Thank you very much! I'm clear about it now.

2009/8/20 Aaron Kimball <aa...@cloudera.com>

> On Wed, Aug 19, 2009 at 8:39 PM, yang song <ha...@gmail.com>
> wrote:
>
> >    Thank you, Aaron. I've benefited a lot. "per-node" means some settings
> > associated with the node. e.g., "fs.default.name", "mapred.job.tracker",
> > etc. "per-job" means some settings associated with the jobs which are
> > submited from the node. e.g., "mapred.reduce.tasks". That means, if I set
> > "per-job" properties on JobTracker, it will doesn't work. Is my
> > understanding right?
>
>
> It will work if you submit your job (run "hadoop jar ....") from the
> JobTracker node :) It won't if you submit your job from elsewhere.
>
>
> >
> >    In addition, when I add some new properties, e.g.,
> > "mapred.inifok.setting" on JobTracker, I can find it in every job.xml
> from
> > WebUI. I think all jobs will use the new properties. Is it right?
>
>
> If you set a property programmatically when configuring your job, that will
> be available in the JobConf on all machines for that job only. If you set a
> property in your hadoop-site.xml on the submitting machine, then I think
> that will also be available for the job on all nodes.
>
> - Aaron
>
>
> >
> >    Thanks again.
> >    Inifok
> >
> > 2009/8/20 Aaron Kimball <aa...@cloudera.com>
> >
> > > Hi Inifok,
> > >
> > > This is a confusing aspect of Hadoop, I'm afraid.
> > >
> > > Settings are divided into two categories: "per-job" and "per-node."
> > > Unfortunately, which are which, isn't documented.
> > >
> > > Some settings are applied to the node that is being used. So for
> example,
> > > if
> > > you set fs.default.name on a node to be "hdfs://some.namenode:8020/",
> > then
> > > any FS connections you make from that node will go to some.namenode. If
> a
> > > different machine in your cluster has fs.default.name set to
> > > hdfs://other.namenode, then that machine will connect to a different
> > > namenode.
> > >
> > > Another example of a per-machine setting is
> > > mapred.tasktracker.map.tasks.maximum; this tells a tasktracker the
> > maximum
> > > number of tasks it should run in parallel. Each tasktracker is free to
> > > configure this value differently. e.g., if you have some quad-core and
> > some
> > > eight-core machines. dfs.data.dir tells a datanode where its data
> > > directories should be kept. Naturally, this can vary
> machine-to-machine.
> > >
> > > Other settings are applied to a job as a whole. These settings are
> > > configured when you submit the job. So if you write
> > > conf.set("mapred.reduce.parallel.copies", 20) in your code, this will
> be
> > > the
> > > setting for the job. Settings that you don't explicitly put in your
> code,
> > > are drawn from the hadoop-site.xml file on the machine where the job is
> > > submitted from.
> > >
> > > In general, I strongly recommend you save yourself some pain by keeping
> > > your
> > > configuration files as identical as possible :)
> > > Good luck,
> > > - Aaron
> > >
> > >
> > > On Wed, Aug 19, 2009 at 7:21 AM, yang song <ha...@gmail.com>
> > > wrote:
> > >
> > > > Hello, everybody
> > > >    I feel puzzled about setting properties in hadoop-site.xml.
> > > >    Suppose I submit the job from machine A, and JobTracker runs on
> > > machine
> > > > B. So there are two hadoop-site.xml files. Now, I increase
> > > > "mapred.reduce.parallel.copies"(e.g. 10) on machine B since I want to
> > > make
> > > > copy phrase faster. However, "mapred.reduce.parallel.copies" from
> WebUI
> > > is
> > > > still 5. When I increase it on machine A, it changes. So, I feel very
> > > > puzzled. Why does it doesn't work when I change it on B? What's more,
> > > when
> > > > I
> > > > add some properties on B, the certain properties will be found on
> > WebUI.
> > > > And
> > > > why I can't change properties through machine B? Does some certain
> > > > properties must be changed through A and some others must be changed
> > > > through
> > > > B?
> > > >    Thank you!
> > > >    Inifok
> > > >
> > >
> >
>

Re: Running hadoop jobs from a client and tuning (was Re: How does hadoop deal with hadoop-site.xml?)

Posted by stephen mulcahy <st...@deri.org>.

Hi Amogh,

Thanks for your reply. Some comments below.

Amogh Vasekar wrote:
> AFAIK,
> hadoop.tmp.dir : Used by NN and DN for directory listings and metadata ( don't have much info on this )

I've been running some test jobs against a local hadoop cluster from 
eclipse using the eclipse plugin. The eclipse plugin manages the client 
equivalent of hadoop-site.xml in it's settings. One of the settings 
there is hadoop.tmp.dir. When running a hadoop job through eclipse, 
there are sometimes items created on the client machine in the 
designated hadoop.tmp.dir so there is a client notion of a 
hadoop.tmp.dir aswell.

Some of my confusion is arising from trying to get client jobs working 
on our cluster while running as someone other than the superuser - what 
I thought were permissions errors may have been caused by blindly 
copying the hadoop.tmp.dir setting from a cluster node for client use - 
instead of setting this to something client specific like 
/tmp/hadoop-${user}

> java.opts & ulimit : ulimit defines the maximum limit of virtual mem for task launched. java.opts is the amount of memory reserved for a task. 
> When setting you need to account for memory set aside for hadoop daemons like tasktracker etc.

Right. This is the one tunable I mostly understand :)

> mapred.map.tasks and mapred.reduce.tasks : these are job wide configurations and not per-task configurations for a node. Acts as a hint to the hadoop framework and explicitly setting them might not be always recommended, unless you want to run a no-reducer job.

I notice in the Hadoop samples like WordCount, the mappers and reducers 
are being explicitly set in the code. Is there any standard approach to 
this? Is it better to set this in the client's hadoop-site.xml after 
understanding the capacity of the cluster or do developers normally make 
their own call on this? As a hadoop cluster admin, should I let 
developers worry about this themselves and concentrate on the 
per-machine limits below?

> mapred.tasktracker.(map | reduce )tasks.maximum : Limit on concurrent tasks running on a machine, typically set according to cores & memory each map/reduce task will be using.

Right, so as an admin, these are probably the more interesting ones to 
worry.

> Also, typically client and datanodes will be the same.

Given my comments above, is this correct?

Thanks,

-stephen

-- 
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com

RE: Running hadoop jobs from a client and tuning (was Re: How does hadoop deal with hadoop-site.xml?)

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

AFAIK,
hadoop.tmp.dir : Used by NN and DN for directory listings and metadata ( don't have much info on this )

java.opts & ulimit : ulimit defines the maximum limit of virtual mem for task launched. java.opts is the amount of memory reserved for a task. 
When setting you need to account for memory set aside for hadoop daemons like tasktracker etc.

mapred.map.tasks and mapred.reduce.tasks : these are job wide configurations and not per-task configurations for a node. Acts as a hint to the hadoop framework and explicitly setting them might not be always recommended, unless you want to run a no-reducer job.

mapred.tasktracker.(map | reduce )tasks.maximum : Limit on concurrent tasks running on a machine, typically set according to cores & memory each map/reduce task will be using.

Also, typically client and datanodes will be the same.

Thanks,
Amogh
-----Original Message-----
From: stephen mulcahy [mailto:stephen.mulcahy@deri.org] 
Sent: Thursday, August 20, 2009 3:22 PM
To: common-user@hadoop.apache.org
Subject: Running hadoop jobs from a client and tuning (was Re: How does hadoop deal with hadoop-site.xml?)

Hi folks,

Sorry to cut across this discussion but I'm experiencing some similar 
confusion about where to change some parameters.

In particular, I'm not entirely clear on how the following should be 
used - clarification welcome (I'm happy to pull some of this together on 
a blog once I get some clarity).

In hadoop/conf/hadoop-site.xml

hadoop.tmp.dir - when submitting a job from a client (not one of the 
hadoop cluster machines), does this specify a directory local to the 
client in which hadoop creates temporary files or is it a directory that 
on each hadoop machine on which the job runs? I notice that the cloudera 
configurator specifies this as /tmp/hadoop-${user.name} - this seems 
like a nice approach to use, is it safe for this tmp.dir to be blown 
away when a machine is rebooted?

mapred.child.java.opts (-Xmx) and mapred.child.ulimit

presumably these should be set totally differently on the namenode, data 
nodes and client machine (assuming they are different?). In the case of 
the namenode and data nodes, I assume they should be set quite large. In 
the case of the client, should they be set so that the number of tasks * 
allocated memory is roughly equal to the amount of memory free on each 
data node?

mapred.map.tasks and mapred.reduce.tasks

My understanding on the namenode and data nodes is that these should be 
set to less than the number of cores or less. Is that correct? For the 
client, should these be bumped closer to the total number of cores that 
are available in the overall cluster?

mapred.tasktracker.tasks.maximum

Does this work as a cap on mapred.map.tasks and mapred.reduce.tasks? Is 
it neccesary to use this as well as mapred.map.tasks and 
mapred.reduce.tasks?


Finally, in hadoop/conf/hadoop-env.sh

export HADOOP_HEAPSIZE=xxxx

Should this be changed normally? If so, how large should it normally be? 
50% of total system memory?

Thanks for any input,

-stephen

-- 
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com

Running hadoop jobs from a client and tuning (was Re: How does hadoop deal with hadoop-site.xml?)

Posted by stephen mulcahy <st...@deri.org>.

Hi folks,

Sorry to cut across this discussion but I'm experiencing some similar 
confusion about where to change some parameters.

In particular, I'm not entirely clear on how the following should be 
used - clarification welcome (I'm happy to pull some of this together on 
a blog once I get some clarity).

In hadoop/conf/hadoop-site.xml

hadoop.tmp.dir - when submitting a job from a client (not one of the 
hadoop cluster machines), does this specify a directory local to the 
client in which hadoop creates temporary files or is it a directory that 
on each hadoop machine on which the job runs? I notice that the cloudera 
configurator specifies this as /tmp/hadoop-${user.name} - this seems 
like a nice approach to use, is it safe for this tmp.dir to be blown 
away when a machine is rebooted?

mapred.child.java.opts (-Xmx) and mapred.child.ulimit

presumably these should be set totally differently on the namenode, data 
nodes and client machine (assuming they are different?). In the case of 
the namenode and data nodes, I assume they should be set quite large. In 
the case of the client, should they be set so that the number of tasks * 
allocated memory is roughly equal to the amount of memory free on each 
data node?

mapred.map.tasks and mapred.reduce.tasks

My understanding on the namenode and data nodes is that these should be 
set to less than the number of cores or less. Is that correct? For the 
client, should these be bumped closer to the total number of cores that 
are available in the overall cluster?

mapred.tasktracker.tasks.maximum

Does this work as a cap on mapred.map.tasks and mapred.reduce.tasks? Is 
it neccesary to use this as well as mapred.map.tasks and 
mapred.reduce.tasks?


Finally, in hadoop/conf/hadoop-env.sh

export HADOOP_HEAPSIZE=xxxx

Should this be changed normally? If so, how large should it normally be? 
50% of total system memory?

Thanks for any input,

-stephen

-- 
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com

Re: How does hadoop deal with hadoop-site.xml?

Posted by Aaron Kimball <aa...@cloudera.com>.

On Wed, Aug 19, 2009 at 8:39 PM, yang song <ha...@gmail.com> wrote:

>    Thank you, Aaron. I've benefited a lot. "per-node" means some settings
> associated with the node. e.g., "fs.default.name", "mapred.job.tracker",
> etc. "per-job" means some settings associated with the jobs which are
> submited from the node. e.g., "mapred.reduce.tasks". That means, if I set
> "per-job" properties on JobTracker, it will doesn't work. Is my
> understanding right?


It will work if you submit your job (run "hadoop jar ....") from the
JobTracker node :) It won't if you submit your job from elsewhere.


>
>    In addition, when I add some new properties, e.g.,
> "mapred.inifok.setting" on JobTracker, I can find it in every job.xml from
> WebUI. I think all jobs will use the new properties. Is it right?


If you set a property programmatically when configuring your job, that will
be available in the JobConf on all machines for that job only. If you set a
property in your hadoop-site.xml on the submitting machine, then I think
that will also be available for the job on all nodes.

- Aaron


>
>    Thanks again.
>    Inifok
>
> 2009/8/20 Aaron Kimball <aa...@cloudera.com>
>
> > Hi Inifok,
> >
> > This is a confusing aspect of Hadoop, I'm afraid.
> >
> > Settings are divided into two categories: "per-job" and "per-node."
> > Unfortunately, which are which, isn't documented.
> >
> > Some settings are applied to the node that is being used. So for example,
> > if
> > you set fs.default.name on a node to be "hdfs://some.namenode:8020/",
> then
> > any FS connections you make from that node will go to some.namenode. If a
> > different machine in your cluster has fs.default.name set to
> > hdfs://other.namenode, then that machine will connect to a different
> > namenode.
> >
> > Another example of a per-machine setting is
> > mapred.tasktracker.map.tasks.maximum; this tells a tasktracker the
> maximum
> > number of tasks it should run in parallel. Each tasktracker is free to
> > configure this value differently. e.g., if you have some quad-core and
> some
> > eight-core machines. dfs.data.dir tells a datanode where its data
> > directories should be kept. Naturally, this can vary machine-to-machine.
> >
> > Other settings are applied to a job as a whole. These settings are
> > configured when you submit the job. So if you write
> > conf.set("mapred.reduce.parallel.copies", 20) in your code, this will be
> > the
> > setting for the job. Settings that you don't explicitly put in your code,
> > are drawn from the hadoop-site.xml file on the machine where the job is
> > submitted from.
> >
> > In general, I strongly recommend you save yourself some pain by keeping
> > your
> > configuration files as identical as possible :)
> > Good luck,
> > - Aaron
> >
> >
> > On Wed, Aug 19, 2009 at 7:21 AM, yang song <ha...@gmail.com>
> > wrote:
> >
> > > Hello, everybody
> > >    I feel puzzled about setting properties in hadoop-site.xml.
> > >    Suppose I submit the job from machine A, and JobTracker runs on
> > machine
> > > B. So there are two hadoop-site.xml files. Now, I increase
> > > "mapred.reduce.parallel.copies"(e.g. 10) on machine B since I want to
> > make
> > > copy phrase faster. However, "mapred.reduce.parallel.copies" from WebUI
> > is
> > > still 5. When I increase it on machine A, it changes. So, I feel very
> > > puzzled. Why does it doesn't work when I change it on B? What's more,
> > when
> > > I
> > > add some properties on B, the certain properties will be found on
> WebUI.
> > > And
> > > why I can't change properties through machine B? Does some certain
> > > properties must be changed through A and some others must be changed
> > > through
> > > B?
> > >    Thank you!
> > >    Inifok
> > >
> >
>

Re: How does hadoop deal with hadoop-site.xml?

Posted by yang song <ha...@gmail.com>.

    Thank you, Aaron. I've benefited a lot. "per-node" means some settings
associated with the node. e.g., "fs.default.name", "mapred.job.tracker",
etc. "per-job" means some settings associated with the jobs which are
submited from the node. e.g., "mapred.reduce.tasks". That means, if I set
"per-job" properties on JobTracker, it will doesn't work. Is my
understanding right?
    In addition, when I add some new properties, e.g.,
"mapred.inifok.setting" on JobTracker, I can find it in every job.xml from
WebUI. I think all jobs will use the new properties. Is it right?
    Thanks again.
    Inifok

2009/8/20 Aaron Kimball <aa...@cloudera.com>

> Hi Inifok,
>
> This is a confusing aspect of Hadoop, I'm afraid.
>
> Settings are divided into two categories: "per-job" and "per-node."
> Unfortunately, which are which, isn't documented.
>
> Some settings are applied to the node that is being used. So for example,
> if
> you set fs.default.name on a node to be "hdfs://some.namenode:8020/", then
> any FS connections you make from that node will go to some.namenode. If a
> different machine in your cluster has fs.default.name set to
> hdfs://other.namenode, then that machine will connect to a different
> namenode.
>
> Another example of a per-machine setting is
> mapred.tasktracker.map.tasks.maximum; this tells a tasktracker the maximum
> number of tasks it should run in parallel. Each tasktracker is free to
> configure this value differently. e.g., if you have some quad-core and some
> eight-core machines. dfs.data.dir tells a datanode where its data
> directories should be kept. Naturally, this can vary machine-to-machine.
>
> Other settings are applied to a job as a whole. These settings are
> configured when you submit the job. So if you write
> conf.set("mapred.reduce.parallel.copies", 20) in your code, this will be
> the
> setting for the job. Settings that you don't explicitly put in your code,
> are drawn from the hadoop-site.xml file on the machine where the job is
> submitted from.
>
> In general, I strongly recommend you save yourself some pain by keeping
> your
> configuration files as identical as possible :)
> Good luck,
> - Aaron
>
>
> On Wed, Aug 19, 2009 at 7:21 AM, yang song <ha...@gmail.com>
> wrote:
>
> > Hello, everybody
> >    I feel puzzled about setting properties in hadoop-site.xml.
> >    Suppose I submit the job from machine A, and JobTracker runs on
> machine
> > B. So there are two hadoop-site.xml files. Now, I increase
> > "mapred.reduce.parallel.copies"(e.g. 10) on machine B since I want to
> make
> > copy phrase faster. However, "mapred.reduce.parallel.copies" from WebUI
> is
> > still 5. When I increase it on machine A, it changes. So, I feel very
> > puzzled. Why does it doesn't work when I change it on B? What's more,
> when
> > I
> > add some properties on B, the certain properties will be found on WebUI.
> > And
> > why I can't change properties through machine B? Does some certain
> > properties must be changed through A and some others must be changed
> > through
> > B?
> >    Thank you!
> >    Inifok
> >
>

Re: How does hadoop deal with hadoop-site.xml?

Posted by Aaron Kimball <aa...@cloudera.com>.

Hi Inifok,

This is a confusing aspect of Hadoop, I'm afraid.

Settings are divided into two categories: "per-job" and "per-node."
Unfortunately, which are which, isn't documented.

Some settings are applied to the node that is being used. So for example, if
you set fs.default.name on a node to be "hdfs://some.namenode:8020/", then
any FS connections you make from that node will go to some.namenode. If a
different machine in your cluster has fs.default.name set to
hdfs://other.namenode, then that machine will connect to a different
namenode.

Another example of a per-machine setting is
mapred.tasktracker.map.tasks.maximum; this tells a tasktracker the maximum
number of tasks it should run in parallel. Each tasktracker is free to
configure this value differently. e.g., if you have some quad-core and some
eight-core machines. dfs.data.dir tells a datanode where its data
directories should be kept. Naturally, this can vary machine-to-machine.

Other settings are applied to a job as a whole. These settings are
configured when you submit the job. So if you write
conf.set("mapred.reduce.parallel.copies", 20) in your code, this will be the
setting for the job. Settings that you don't explicitly put in your code,
are drawn from the hadoop-site.xml file on the machine where the job is
submitted from.

In general, I strongly recommend you save yourself some pain by keeping your
configuration files as identical as possible :)
Good luck,
- Aaron

On Wed, Aug 19, 2009 at 7:21 AM, yang song <ha...@gmail.com> wrote:

> Hello, everybody
>    I feel puzzled about setting properties in hadoop-site.xml.
>    Suppose I submit the job from machine A, and JobTracker runs on machine
> B. So there are two hadoop-site.xml files. Now, I increase
> "mapred.reduce.parallel.copies"(e.g. 10) on machine B since I want to make
> copy phrase faster. However, "mapred.reduce.parallel.copies" from WebUI is
> still 5. When I increase it on machine A, it changes. So, I feel very
> puzzled. Why does it doesn't work when I change it on B? What's more, when
> I
> add some properties on B, the certain properties will be found on WebUI.
> And
> why I can't change properties through machine B? Does some certain
> properties must be changed through A and some others must be changed
> through
> B?
>    Thank you!
>    Inifok
>