You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Vasilis Liaskovitis <vl...@gmail.com> on 2009/08/18 02:17:51 UTC

utilizing all cores on single-node hadoop

Hi,

I am a beginner trying to setup a few simple hadoop tests on a single
node before moving on to a cluster. I am just using the simple
wordcount example for now. My question is what's the best way to
guarantee utilization of all cores on a single-node? So assuming a
single node with 16-cores what are the suggested values for:

mapred.map.tasks
mapred.reduce.tasks
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.map.tasks.maxium

I found an old similar thread
http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg00152.html
and I have followed similar settings for my 16-core system (e.g.
map.tasks=reduce.tasks=90 and map.tasks.maximum=100), however I always
see only 3-4 cores utilized using top.

- The description for mapred.map.tasks says "Ignored when
mapred.job.tracker is "local" ", and in my case
mapred.job.tracker=hdfs://localhost:54311
is it possible that the map.tasks and reduce.tasks I am setting are
being ignored? How can I verify this? Is there a way to enforce my
values even on a localhost scenario like this?

- Are there other config options/values that I need to set besides the
4 I mentioned above?

- Also is it possible that for short tasks, I won't see full
utilization of all cores anyway? Something along those lines is
mentioned in an issue a year ago:
http://issues.apache.org/jira/browse/HADOOP-3136
"If the individual tasks are very short i.e. run for less than the
heartbeat interval the TaskTracker serially runs one task at a time"

I am using hadoop-0.19.2

thanks for any guidance,

- Vasilis

Re: utilizing all cores on single-node hadoop

Posted by Vasilis Liaskovitis <vl...@gmail.com>.

Hi,

thanks to everyone for the valuable suggestions.

what would be the default number of map and reduce tasks for the
sort-rand example described at:
http://wiki.apache.org/hadoop/Sort
This is one of the simplest possible examples and uses identity mapper/reducers

I am seeing 160 map tasks and 27 reduce tasks on my jobtracker web ui
for a single-node test. The number of map tasks seems particularly
odd, because my tasktracker.reduce.tasks.maximum=30 and
mapred.map.tasks=24 settings were was well below 160.

In general, is the number of map/reduce tasks for a specific job set
by the Mapper/Reducer job-specific java classes or is it inferred
somehow by the framework?

Also, cores may be idle because the job is I/O-bound - what are the
config parameters related to memory/disk buffering of map outputs and
reduce merges?  WIth the default io.sort.mb and io.sort.factor, would
you expect the sort example to be i/o-bound? Some profiling runs
should help investigate this soon, but at this point I am just asking
for any untuition from more  experienced users.

I have switched to using hadoop-0.20.0 (I believe this version has
changed the site-specfic overrides file from conf/hadoop-site.xml to
conf/mapred-site.xml and several other conf/ files. Let me know if the
site overrides don't work or should be changed somewhere else for this
version)
Does 0.20.0 have a different job scheduler or different default
settings than 0.19.2 - I am getting higher core utlizations with
0.20.0 for some jobs e.g. wordcount examples.

thanks,

- Vasilis

On Wed, Aug 19, 2009 at 9:09 AM, Jason Venner<ja...@gmail.com> wrote:
> Another reason you may not see full utilization of your map tasks per
> tracker is if the mean run time of a task is very short, All the slots are
> being used but the setup and teardown for each task is large enough in time
> compared to the run time of the task that it appears that not all the task
> slots are being used.
>
>
> On Mon, Aug 17, 2009 at 10:35 PM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
>
>> While setting mapred.tasktracker.map.tasks.maximum and
>> mapred.tasktracker.reduce.tasks.maximum, please consider the memory usage
>> your application might have since all tasks will be competing for the same
>> and might reduce overall performance.
>>
>> Thanks,
>> Amogh
>> -----Original Message-----
>> From: Harish Mallipeddi [mailto:harish.mallipeddi@gmail.com]
>> Sent: Tuesday, August 18, 2009 10:37 AM
>> To: common-user@hadoop.apache.org
>> Subject: Re: utilizing all cores on single-node hadoop
>>
>> Hi Vasilis,
>>
>> Here's some info that I know:
>>
>> mapred.map.tasks - this is a job-specific setting. This is just a hint to
>> InputFormat as to how many InputSplits (and hence MapTasks) you want for
>> your job. The default InputFormat classes usually keep each split size to
>> the HDFS block size (64MB default). So if your input data is less than 64
>> MB, it will just result in only 1 split and hence 1 MapTask only.
>>
>> mapred.reduce.tasks - this is also a job-specific setting.
>>
>> mapred.tasktracker.map.tasks.maximum
>> mapred.tasktracker.reduce.tasks.maximum
>>
>> The above 2 are tasktracker-specific config options and determine how many
>> "simultaneous" MapTasks and ReduceTasks run on each TT. Ideally on a 8-core
>> box, you would want to set map.tasks.maximum to something like 6 and
>> reduce.tasks.maximum to 4 to utilize all the 8 cores to the maximum
>> (there's
>> a little bit of over-subscription to account for tasks idling while doing
>> I/O).
>>
>> In the web admin console, how many map-tasks and reduce-tasks are reported
>> to have been launched for your job?
>>
>> Cheers,
>> Harish
>>
>> On Tue, Aug 18, 2009 at 5:47 AM, Vasilis Liaskovitis <vliaskov@gmail.com
>> >wrote:
>>
>> > Hi,
>> >
>> > I am a beginner trying to setup a few simple hadoop tests on a single
>> > node before moving on to a cluster. I am just using the simple
>> > wordcount example for now. My question is what's the best way to
>> > guarantee utilization of all cores on a single-node? So assuming a
>> > single node with 16-cores what are the suggested values for:
>> >
>> > mapred.map.tasks
>> > mapred.reduce.tasks
>> >
>> mapred.tasktracker.map.tasks.maximum
>> > mapred.tasktracker.map.tasks.maxium
>> >
>>
>> > I found an old similar thread
>> > http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg00152.html
>> > and I have followed similar settings for my 16-core system (e.g.
>> > map.tasks=reduce.tasks=90 and map.tasks.maximum=100), however I always
>> > see only 3-4 cores utilized using top.
>> >
>> > - The description for mapred.map.tasks says "Ignored when
>> > mapred.job.tracker is "local" ", and in my case
>> > mapred.job.tracker=hdfs://localhost:54311
>> > is it possible that the map.tasks and reduce.tasks I am setting are
>> > being ignored? How can I verify this? Is there a way to enforce my
>> > values even on a localhost scenario like this?
>> >
>> > - Are there other config options/values that I need to set besides the
>> > 4 I mentioned above?
>> >
>> > - Also is it possible that for short tasks, I won't see full
>> > utilization of all cores anyway? Something along those lines is
>> > mentioned in an issue a year ago:
>> > http://issues.apache.org/jira/browse/HADOOP-3136
>> > "If the individual tasks are very short i.e. run for less than the
>> > heartbeat interval the TaskTracker serially runs one task at a time"
>> >
>> > I am using hadoop-0.19.2
>> >
>> > thanks for any guidance,
>> >
>> > - Vasilis
>> >
>>
>>
>>
>> --
>> Harish Mallipeddi
>> http://blog.poundbang.in
>>
>
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.amazon.com/dp/1430219424?tag=jewlerymall
> www.prohadoopbook.com a community for Hadoop Professionals
>

Re: utilizing all cores on single-node hadoop

Posted by Jason Venner <ja...@gmail.com>.

Another reason you may not see full utilization of your map tasks per
tracker is if the mean run time of a task is very short, All the slots are
being used but the setup and teardown for each task is large enough in time
compared to the run time of the task that it appears that not all the task
slots are being used.


On Mon, Aug 17, 2009 at 10:35 PM, Amogh Vasekar <am...@yahoo-inc.com> wrote:

> While setting mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum, please consider the memory usage
> your application might have since all tasks will be competing for the same
> and might reduce overall performance.
>
> Thanks,
> Amogh
> -----Original Message-----
> From: Harish Mallipeddi [mailto:harish.mallipeddi@gmail.com]
> Sent: Tuesday, August 18, 2009 10:37 AM
> To: common-user@hadoop.apache.org
> Subject: Re: utilizing all cores on single-node hadoop
>
> Hi Vasilis,
>
> Here's some info that I know:
>
> mapred.map.tasks - this is a job-specific setting. This is just a hint to
> InputFormat as to how many InputSplits (and hence MapTasks) you want for
> your job. The default InputFormat classes usually keep each split size to
> the HDFS block size (64MB default). So if your input data is less than 64
> MB, it will just result in only 1 split and hence 1 MapTask only.
>
> mapred.reduce.tasks - this is also a job-specific setting.
>
> mapred.tasktracker.map.tasks.maximum
> mapred.tasktracker.reduce.tasks.maximum
>
> The above 2 are tasktracker-specific config options and determine how many
> "simultaneous" MapTasks and ReduceTasks run on each TT. Ideally on a 8-core
> box, you would want to set map.tasks.maximum to something like 6 and
> reduce.tasks.maximum to 4 to utilize all the 8 cores to the maximum
> (there's
> a little bit of over-subscription to account for tasks idling while doing
> I/O).
>
> In the web admin console, how many map-tasks and reduce-tasks are reported
> to have been launched for your job?
>
> Cheers,
> Harish
>
> On Tue, Aug 18, 2009 at 5:47 AM, Vasilis Liaskovitis <vliaskov@gmail.com
> >wrote:
>
> > Hi,
> >
> > I am a beginner trying to setup a few simple hadoop tests on a single
> > node before moving on to a cluster. I am just using the simple
> > wordcount example for now. My question is what's the best way to
> > guarantee utilization of all cores on a single-node? So assuming a
> > single node with 16-cores what are the suggested values for:
> >
> > mapred.map.tasks
> > mapred.reduce.tasks
> >
> mapred.tasktracker.map.tasks.maximum
> > mapred.tasktracker.map.tasks.maxium
> >
>
> > I found an old similar thread
> > http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg00152.html
> > and I have followed similar settings for my 16-core system (e.g.
> > map.tasks=reduce.tasks=90 and map.tasks.maximum=100), however I always
> > see only 3-4 cores utilized using top.
> >
> > - The description for mapred.map.tasks says "Ignored when
> > mapred.job.tracker is "local" ", and in my case
> > mapred.job.tracker=hdfs://localhost:54311
> > is it possible that the map.tasks and reduce.tasks I am setting are
> > being ignored? How can I verify this? Is there a way to enforce my
> > values even on a localhost scenario like this?
> >
> > - Are there other config options/values that I need to set besides the
> > 4 I mentioned above?
> >
> > - Also is it possible that for short tasks, I won't see full
> > utilization of all cores anyway? Something along those lines is
> > mentioned in an issue a year ago:
> > http://issues.apache.org/jira/browse/HADOOP-3136
> > "If the individual tasks are very short i.e. run for less than the
> > heartbeat interval the TaskTracker serially runs one task at a time"
> >
> > I am using hadoop-0.19.2
> >
> > thanks for any guidance,
> >
> > - Vasilis
> >
>
>
>
> --
> Harish Mallipeddi
> http://blog.poundbang.in
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

RE: utilizing all cores on single-node hadoop

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

While setting mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum, please consider the memory usage your application might have since all tasks will be competing for the same and might reduce overall performance.

Thanks,
Amogh
-----Original Message-----
From: Harish Mallipeddi [mailto:harish.mallipeddi@gmail.com] 
Sent: Tuesday, August 18, 2009 10:37 AM
To: common-user@hadoop.apache.org
Subject: Re: utilizing all cores on single-node hadoop

Hi Vasilis,

Here's some info that I know:

mapred.map.tasks - this is a job-specific setting. This is just a hint to
InputFormat as to how many InputSplits (and hence MapTasks) you want for
your job. The default InputFormat classes usually keep each split size to
the HDFS block size (64MB default). So if your input data is less than 64
MB, it will just result in only 1 split and hence 1 MapTask only.

mapred.reduce.tasks - this is also a job-specific setting.

mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum

The above 2 are tasktracker-specific config options and determine how many
"simultaneous" MapTasks and ReduceTasks run on each TT. Ideally on a 8-core
box, you would want to set map.tasks.maximum to something like 6 and
reduce.tasks.maximum to 4 to utilize all the 8 cores to the maximum (there's
a little bit of over-subscription to account for tasks idling while doing
I/O).

In the web admin console, how many map-tasks and reduce-tasks are reported
to have been launched for your job?

Cheers,
Harish

On Tue, Aug 18, 2009 at 5:47 AM, Vasilis Liaskovitis <vl...@gmail.com>wrote:

> Hi,
>
> I am a beginner trying to setup a few simple hadoop tests on a single
> node before moving on to a cluster. I am just using the simple
> wordcount example for now. My question is what's the best way to
> guarantee utilization of all cores on a single-node? So assuming a
> single node with 16-cores what are the suggested values for:
>
> mapred.map.tasks
> mapred.reduce.tasks
>
mapred.tasktracker.map.tasks.maximum
> mapred.tasktracker.map.tasks.maxium
>

> I found an old similar thread
> http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg00152.html
> and I have followed similar settings for my 16-core system (e.g.
> map.tasks=reduce.tasks=90 and map.tasks.maximum=100), however I always
> see only 3-4 cores utilized using top.
>
> - The description for mapred.map.tasks says "Ignored when
> mapred.job.tracker is "local" ", and in my case
> mapred.job.tracker=hdfs://localhost:54311
> is it possible that the map.tasks and reduce.tasks I am setting are
> being ignored? How can I verify this? Is there a way to enforce my
> values even on a localhost scenario like this?
>
> - Are there other config options/values that I need to set besides the
> 4 I mentioned above?
>
> - Also is it possible that for short tasks, I won't see full
> utilization of all cores anyway? Something along those lines is
> mentioned in an issue a year ago:
> http://issues.apache.org/jira/browse/HADOOP-3136
> "If the individual tasks are very short i.e. run for less than the
> heartbeat interval the TaskTracker serially runs one task at a time"
>
> I am using hadoop-0.19.2
>
> thanks for any guidance,
>
> - Vasilis
>

-- 
Harish Mallipeddi
http://blog.poundbang.in

Re: utilizing all cores on single-node hadoop

Posted by Harish Mallipeddi <ha...@gmail.com>.

Hi Vasilis,

Here's some info that I know:

mapred.map.tasks - this is a job-specific setting. This is just a hint to
InputFormat as to how many InputSplits (and hence MapTasks) you want for
your job. The default InputFormat classes usually keep each split size to
the HDFS block size (64MB default). So if your input data is less than 64
MB, it will just result in only 1 split and hence 1 MapTask only.

mapred.reduce.tasks - this is also a job-specific setting.

mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum

The above 2 are tasktracker-specific config options and determine how many
"simultaneous" MapTasks and ReduceTasks run on each TT. Ideally on a 8-core
box, you would want to set map.tasks.maximum to something like 6 and
reduce.tasks.maximum to 4 to utilize all the 8 cores to the maximum (there's
a little bit of over-subscription to account for tasks idling while doing
I/O).

In the web admin console, how many map-tasks and reduce-tasks are reported
to have been launched for your job?

Cheers,
Harish

On Tue, Aug 18, 2009 at 5:47 AM, Vasilis Liaskovitis <vl...@gmail.com>wrote:

> Hi,
>
> I am a beginner trying to setup a few simple hadoop tests on a single
> node before moving on to a cluster. I am just using the simple
> wordcount example for now. My question is what's the best way to
> guarantee utilization of all cores on a single-node? So assuming a
> single node with 16-cores what are the suggested values for:
>
> mapred.map.tasks
> mapred.reduce.tasks
>
mapred.tasktracker.map.tasks.maximum
> mapred.tasktracker.map.tasks.maxium
>

> I found an old similar thread
> http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg00152.html
> and I have followed similar settings for my 16-core system (e.g.
> map.tasks=reduce.tasks=90 and map.tasks.maximum=100), however I always
> see only 3-4 cores utilized using top.
>
> - The description for mapred.map.tasks says "Ignored when
> mapred.job.tracker is "local" ", and in my case
> mapred.job.tracker=hdfs://localhost:54311
> is it possible that the map.tasks and reduce.tasks I am setting are
> being ignored? How can I verify this? Is there a way to enforce my
> values even on a localhost scenario like this?
>
> - Are there other config options/values that I need to set besides the
> 4 I mentioned above?
>
> - Also is it possible that for short tasks, I won't see full
> utilization of all cores anyway? Something along those lines is
> mentioned in an issue a year ago:
> http://issues.apache.org/jira/browse/HADOOP-3136
> "If the individual tasks are very short i.e. run for less than the
> heartbeat interval the TaskTracker serially runs one task at a time"
>
> I am using hadoop-0.19.2
>
> thanks for any guidance,
>
> - Vasilis
>

-- 
Harish Mallipeddi
http://blog.poundbang.in