You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Elaine Gan <el...@gmo.jp> on 2012/09/11 03:56:10 UTC

Understanding of the hadoop distribution system (tuning)

Hi,

I'm new to hadoop and i've just played around with map reduce.
I would like to check if my understanding to hadoop is correct and i
would appreciate if anyone could correct me if i'm wrong.

I have a data of around 518MB, and i wrote a MR program to process it.
Here are some of my settings in my mapred-site.xml.
---------------------------------------------------------------
mapred.tasktracker.map.tasks.maximum = 20
mapred.tasktracker.reduce.tasks.maximum = 20
---------------------------------------------------------------
My block size is default, 64MB
With my data size = 518MB, i guess setting the maximum for MR task to 20
is far more than enough (518/64 = 8) , did i get it correctly?

When i run the MR program, i could see in the Map/Reduce Administration
page that the number of Maps Total = 8, so i assume that everything is
going well here, once again if i'm wrong please correct me.
(Sometimes it shows only Maps Total = 3)

There's one thing which i'm uncertain about hadoop distribution.
Is the Maps Total = 8 means that there are 8 map tasks split among all
the data nodes (task trackers)?
Is there anyway i can checked whether all the tasks are shared among
datanodes (where task trackers are working). 
When i clicked on each link under that Task Id, i can see there's "Input
Split Locations" stated under each task details, if the inputs are
splitted between data nodes, does that means that everything is working
well?

I need to make sure i got everything running well because my MR took
around 6 hours to finish despite the input size is small.. (Well, i know
hadoop is not meant for small data), I'm not sure whether it's my
configuration that goes wrong or hadoop is just not suitable for my case.
I'm actually running a mahout kmeans analysis.

Thank you for your time.





Re: Understanding of the hadoop distribution system (tuning)

Posted by Jagat Singh <ja...@gmail.com>.
Hello Elaine,

You did not tell your cluster size. Number of nodes , cores in each node.

What sort of work you are doing , 6 hours for 518MB data is huge time.

The number of map tasks would be 518/64

So this many map tasks needs to run to process your data.

Now they can run on single node or multiple nodes depending on available
slots. Did you check job tracker page while execution is taking place ,
there you can see at which node its being processed. You can go to Running
tasks page.

Regards,

Jagat Singh


On Tue, Sep 11, 2012 at 11:56 AM, Elaine Gan <el...@gmo.jp> wrote:

> Hi,
>
> I'm new to hadoop and i've just played around with map reduce.
> I would like to check if my understanding to hadoop is correct and i
> would appreciate if anyone could correct me if i'm wrong.
>
> I have a data of around 518MB, and i wrote a MR program to process it.
> Here are some of my settings in my mapred-site.xml.
> ---------------------------------------------------------------
> mapred.tasktracker.map.tasks.maximum = 20
> mapred.tasktracker.reduce.tasks.maximum = 20
> ---------------------------------------------------------------
> My block size is default, 64MB
> With my data size = 518MB, i guess setting the maximum for MR task to 20
> is far more than enough (518/64 = 8) , did i get it correctly?
>
> When i run the MR program, i could see in the Map/Reduce Administration
> page that the number of Maps Total = 8, so i assume that everything is
> going well here, once again if i'm wrong please correct me.
> (Sometimes it shows only Maps Total = 3)
>
> There's one thing which i'm uncertain about hadoop distribution.
> Is the Maps Total = 8 means that there are 8 map tasks split among all
> the data nodes (task trackers)?
> Is there anyway i can checked whether all the tasks are shared among
> datanodes (where task trackers are working).
> When i clicked on each link under that Task Id, i can see there's "Input
> Split Locations" stated under each task details, if the inputs are
> splitted between data nodes, does that means that everything is working
> well?
>
> I need to make sure i got everything running well because my MR took
> around 6 hours to finish despite the input size is small.. (Well, i know
> hadoop is not meant for small data), I'm not sure whether it's my
> configuration that goes wrong or hadoop is just not suitable for my case.
> I'm actually running a mahout kmeans analysis.
>
> Thank you for your time.
>
>
>
>
>

Re: Understanding of the hadoop distribution system (tuning)

Posted by Jagat Singh <ja...@gmail.com>.
Hello Elaine,

You did not tell your cluster size. Number of nodes , cores in each node.

What sort of work you are doing , 6 hours for 518MB data is huge time.

The number of map tasks would be 518/64

So this many map tasks needs to run to process your data.

Now they can run on single node or multiple nodes depending on available
slots. Did you check job tracker page while execution is taking place ,
there you can see at which node its being processed. You can go to Running
tasks page.

Regards,

Jagat Singh


On Tue, Sep 11, 2012 at 11:56 AM, Elaine Gan <el...@gmo.jp> wrote:

> Hi,
>
> I'm new to hadoop and i've just played around with map reduce.
> I would like to check if my understanding to hadoop is correct and i
> would appreciate if anyone could correct me if i'm wrong.
>
> I have a data of around 518MB, and i wrote a MR program to process it.
> Here are some of my settings in my mapred-site.xml.
> ---------------------------------------------------------------
> mapred.tasktracker.map.tasks.maximum = 20
> mapred.tasktracker.reduce.tasks.maximum = 20
> ---------------------------------------------------------------
> My block size is default, 64MB
> With my data size = 518MB, i guess setting the maximum for MR task to 20
> is far more than enough (518/64 = 8) , did i get it correctly?
>
> When i run the MR program, i could see in the Map/Reduce Administration
> page that the number of Maps Total = 8, so i assume that everything is
> going well here, once again if i'm wrong please correct me.
> (Sometimes it shows only Maps Total = 3)
>
> There's one thing which i'm uncertain about hadoop distribution.
> Is the Maps Total = 8 means that there are 8 map tasks split among all
> the data nodes (task trackers)?
> Is there anyway i can checked whether all the tasks are shared among
> datanodes (where task trackers are working).
> When i clicked on each link under that Task Id, i can see there's "Input
> Split Locations" stated under each task details, if the inputs are
> splitted between data nodes, does that means that everything is working
> well?
>
> I need to make sure i got everything running well because my MR took
> around 6 hours to finish despite the input size is small.. (Well, i know
> hadoop is not meant for small data), I'm not sure whether it's my
> configuration that goes wrong or hadoop is just not suitable for my case.
> I'm actually running a mahout kmeans analysis.
>
> Thank you for your time.
>
>
>
>
>

Re: Understanding of the hadoop distribution system (tuning)

Posted by Bejoy Ks <be...@gmail.com>.
Hi Elaine

Slots (mapred.tasktracker.[map/reduce].tasks.maximum) are configured on a
cluster/node/TaskTracker level and not on a job level. You configure this
based on the available resources on each of the nodes. Of this you need to
consider the number of cores not number of CPUs. Say you have 4 quad
core processors then you have 16 cores, if they are hyper threaded you can
consider the effective number of cores as 1 - 1.5 times the actual number.
Also when you specify slots you need to consider memory, if a task jvm
(mapred.child.java.opts) is configured with 2Gigs and you have just 16GB
memory available at your disposal then you can have only 8*2=16 slots. If
you configure more number of slots it can lead to swapping and OOM issues
when all slots are used in parallel.

When map task are run you get good level of data local map tasks if you
have a good replication factor, the default of 3 is good.

Map tasks are scheduled on nodes by JT much based on
data locality and available slots. So you cannot say that the map tasks
will be uniformly distributed across the cluster. If you have 8 map slots
on a TT, assume a map reduce job having 8 map tasks and the data
corresponding to these 8 map tasks on the same node then all the 8 map
tasks can be on the same node as well.

Few Responses inline

Case (2)
Launched map tasks 0 0 2
Data-local map tasks 0 0 1

Hmm.. not quite understand this, if case (2) it means two map tasks are
actually reading data from same datanode?
[Bejoy] yes 2 had data on the same node where map tasks are executed and 1
had the task running on one node and is pulling the data from another node.

But anyway, is this monitoring needed for tuning performance?
[Bejoy] If you see less number of data local map tasks the you need
to seriously look into as it can degrade the performance to a greater
extent. In larger data volumes a few non data local map tasks are common.

Regards
Bejoy KS

On Tue, Sep 11, 2012 at 11:37 AM, Elaine Gan <el...@gmo.jp> wrote:

> Hi Hermanth
>
> Thank you for your detailed answered. Your answers helped me much in
> understanding, especially on the Job UI.
>
> Sorry, i missed out my specs.
> NameNode (JobTracker) : CPUx4
> DataNode (TaskTracker) : CPUx4
>
> I am replying inline too.
>
> > > I have a data of around 518MB, and i wrote a MR program to process it.
> > > Here are some of my settings in my mapred-site.xml.
> > > ---------------------------------------------------------------
> > > mapred.tasktracker.map.tasks.maximum = 20
> > > mapred.tasktracker.reduce.tasks.maximum = 20
> > > ---------------------------------------------------------------
> > >
> >
> > These two configurations essentially tell the tasktrackers that they can
> > run 20 maps and 20 reduces in parallel on a machine. Is this what you
> > intended ? (Generally the sum of these two values should equal the number
> > of cores on your tasktracker node, or a little more).
> >
> > Also, would help if you can tell us your cluster size - i.e. number of
> > slaves.
>
> Cluster size (No of slaves) = 4
>
> Yes, i meant the maximum tasks that could be run in A machine is 20
> tasks, both map & reduce.
>
> > > My block size is default, 64MB
> > > With my data size = 518MB, i guess setting the maximum for MR task to
> 20
> > > is far more than enough (518/64 = 8) , did i get it correctly?
> > >
> > >
> > I suppose what you want is to run all the maps in parallel. For that, the
> > number of map slots in your cluster should be more than the number of
> maps
> > of your job (assuming there's a single job running). If the number of
> slots
> > is less than number of maps, the maps would be scheduled in multiple
> waves.
> > On your jobtracker main page, the Cluster Summary > Map Task Capacity
> gives
> > you the total slots available in your cluster.
>
> My Map Task Capacity = 80%
> So, from the explanation and from my data size and configuration,
> Data size = 518MB
> Number of map tasks required =  518/64 = 8 tasks
> This 8 tasks should be spread among 4 slaves, which means each nodes
> should be able to handle at least 2 tasks.
> And my settings was mapred.tasktracker.map.tasks.maximum = 20, which is
> more than enough, so it means the approach is correct?
> (Well i have CPUx4 in my machine, so in case of large data, i should
> divide it by 4 in order to determine the smallest figure for
> mapred.tasktracker.map.tasks.maximum)
>
> > > When i run the MR program, i could see in the Map/Reduce Administration
> > > page that the number of Maps Total = 8, so i assume that everything is
> > > going well here, once again if i'm wrong please correct me.
> > > (Sometimes it shows only Maps Total = 3)
> > >
> > This value tells us the number of maps that will run for the job.
>
> OK
>
>
> > > There's one thing which i'm uncertain about hadoop distribution.
> > > Is the Maps Total = 8 means that there are 8 map tasks split among all
> > > the data nodes (task trackers)?
> > > Is there anyway i can checked whether all the tasks are shared among
> > > datanodes (where task trackers are working).
> > >
> > There's no easy way to check this. The task page for every task shows the
> > attempts that ran for each task and where they ran under the 'Machine'
> > column.
> >
>
> Thank you, i see that they're processed on different "Machine", so i
> guess it's working correctly :)
>
> >
> > > When i clicked on each link under that Task Id, i can see there's
> "Input
> > > Split Locations" stated under each task details, if the inputs are
> > > splitted between data nodes, does that means that everything is working
> > > well?
> > >
> > >
> > I think this is just the location of the splits, including the replicas.
> > What you could see is if enough data local maps ran - which means that
> the
> > tasks mostly got their inputs from datanodes running on the same machine
> as
> > themselves. This is given by the counter "Data-local map tasks" on the
> job
> > UI page.
> >
> There are two cases under the Job UI.
> Counter                   Map Reduce Total
> -----------------------------------------
> Case (1)
> Launched map tasks 0 0 4
> Data-local map tasks 0 0 4
>
> Case (2)
> Launched map tasks 0 0 2
> Data-local map tasks 0 0 1
>
> Hmm.. not quite understand this, if case (2) it means two map tasks are
> actually reading data from same datanode?
>
> But anyway, is this monitoring needed for tuning performance?
>
>
> Thank you.
>
>
>

Re: Understanding of the hadoop distribution system (tuning)

Posted by Bejoy Ks <be...@gmail.com>.
Hi Elaine

Slots (mapred.tasktracker.[map/reduce].tasks.maximum) are configured on a
cluster/node/TaskTracker level and not on a job level. You configure this
based on the available resources on each of the nodes. Of this you need to
consider the number of cores not number of CPUs. Say you have 4 quad
core processors then you have 16 cores, if they are hyper threaded you can
consider the effective number of cores as 1 - 1.5 times the actual number.
Also when you specify slots you need to consider memory, if a task jvm
(mapred.child.java.opts) is configured with 2Gigs and you have just 16GB
memory available at your disposal then you can have only 8*2=16 slots. If
you configure more number of slots it can lead to swapping and OOM issues
when all slots are used in parallel.

When map task are run you get good level of data local map tasks if you
have a good replication factor, the default of 3 is good.

Map tasks are scheduled on nodes by JT much based on
data locality and available slots. So you cannot say that the map tasks
will be uniformly distributed across the cluster. If you have 8 map slots
on a TT, assume a map reduce job having 8 map tasks and the data
corresponding to these 8 map tasks on the same node then all the 8 map
tasks can be on the same node as well.

Few Responses inline

Case (2)
Launched map tasks 0 0 2
Data-local map tasks 0 0 1

Hmm.. not quite understand this, if case (2) it means two map tasks are
actually reading data from same datanode?
[Bejoy] yes 2 had data on the same node where map tasks are executed and 1
had the task running on one node and is pulling the data from another node.

But anyway, is this monitoring needed for tuning performance?
[Bejoy] If you see less number of data local map tasks the you need
to seriously look into as it can degrade the performance to a greater
extent. In larger data volumes a few non data local map tasks are common.

Regards
Bejoy KS

On Tue, Sep 11, 2012 at 11:37 AM, Elaine Gan <el...@gmo.jp> wrote:

> Hi Hermanth
>
> Thank you for your detailed answered. Your answers helped me much in
> understanding, especially on the Job UI.
>
> Sorry, i missed out my specs.
> NameNode (JobTracker) : CPUx4
> DataNode (TaskTracker) : CPUx4
>
> I am replying inline too.
>
> > > I have a data of around 518MB, and i wrote a MR program to process it.
> > > Here are some of my settings in my mapred-site.xml.
> > > ---------------------------------------------------------------
> > > mapred.tasktracker.map.tasks.maximum = 20
> > > mapred.tasktracker.reduce.tasks.maximum = 20
> > > ---------------------------------------------------------------
> > >
> >
> > These two configurations essentially tell the tasktrackers that they can
> > run 20 maps and 20 reduces in parallel on a machine. Is this what you
> > intended ? (Generally the sum of these two values should equal the number
> > of cores on your tasktracker node, or a little more).
> >
> > Also, would help if you can tell us your cluster size - i.e. number of
> > slaves.
>
> Cluster size (No of slaves) = 4
>
> Yes, i meant the maximum tasks that could be run in A machine is 20
> tasks, both map & reduce.
>
> > > My block size is default, 64MB
> > > With my data size = 518MB, i guess setting the maximum for MR task to
> 20
> > > is far more than enough (518/64 = 8) , did i get it correctly?
> > >
> > >
> > I suppose what you want is to run all the maps in parallel. For that, the
> > number of map slots in your cluster should be more than the number of
> maps
> > of your job (assuming there's a single job running). If the number of
> slots
> > is less than number of maps, the maps would be scheduled in multiple
> waves.
> > On your jobtracker main page, the Cluster Summary > Map Task Capacity
> gives
> > you the total slots available in your cluster.
>
> My Map Task Capacity = 80%
> So, from the explanation and from my data size and configuration,
> Data size = 518MB
> Number of map tasks required =  518/64 = 8 tasks
> This 8 tasks should be spread among 4 slaves, which means each nodes
> should be able to handle at least 2 tasks.
> And my settings was mapred.tasktracker.map.tasks.maximum = 20, which is
> more than enough, so it means the approach is correct?
> (Well i have CPUx4 in my machine, so in case of large data, i should
> divide it by 4 in order to determine the smallest figure for
> mapred.tasktracker.map.tasks.maximum)
>
> > > When i run the MR program, i could see in the Map/Reduce Administration
> > > page that the number of Maps Total = 8, so i assume that everything is
> > > going well here, once again if i'm wrong please correct me.
> > > (Sometimes it shows only Maps Total = 3)
> > >
> > This value tells us the number of maps that will run for the job.
>
> OK
>
>
> > > There's one thing which i'm uncertain about hadoop distribution.
> > > Is the Maps Total = 8 means that there are 8 map tasks split among all
> > > the data nodes (task trackers)?
> > > Is there anyway i can checked whether all the tasks are shared among
> > > datanodes (where task trackers are working).
> > >
> > There's no easy way to check this. The task page for every task shows the
> > attempts that ran for each task and where they ran under the 'Machine'
> > column.
> >
>
> Thank you, i see that they're processed on different "Machine", so i
> guess it's working correctly :)
>
> >
> > > When i clicked on each link under that Task Id, i can see there's
> "Input
> > > Split Locations" stated under each task details, if the inputs are
> > > splitted between data nodes, does that means that everything is working
> > > well?
> > >
> > >
> > I think this is just the location of the splits, including the replicas.
> > What you could see is if enough data local maps ran - which means that
> the
> > tasks mostly got their inputs from datanodes running on the same machine
> as
> > themselves. This is given by the counter "Data-local map tasks" on the
> job
> > UI page.
> >
> There are two cases under the Job UI.
> Counter                   Map Reduce Total
> -----------------------------------------
> Case (1)
> Launched map tasks 0 0 4
> Data-local map tasks 0 0 4
>
> Case (2)
> Launched map tasks 0 0 2
> Data-local map tasks 0 0 1
>
> Hmm.. not quite understand this, if case (2) it means two map tasks are
> actually reading data from same datanode?
>
> But anyway, is this monitoring needed for tuning performance?
>
>
> Thank you.
>
>
>

Re: Understanding of the hadoop distribution system (tuning)

Posted by Bejoy Ks <be...@gmail.com>.
Hi Elaine

Slots (mapred.tasktracker.[map/reduce].tasks.maximum) are configured on a
cluster/node/TaskTracker level and not on a job level. You configure this
based on the available resources on each of the nodes. Of this you need to
consider the number of cores not number of CPUs. Say you have 4 quad
core processors then you have 16 cores, if they are hyper threaded you can
consider the effective number of cores as 1 - 1.5 times the actual number.
Also when you specify slots you need to consider memory, if a task jvm
(mapred.child.java.opts) is configured with 2Gigs and you have just 16GB
memory available at your disposal then you can have only 8*2=16 slots. If
you configure more number of slots it can lead to swapping and OOM issues
when all slots are used in parallel.

When map task are run you get good level of data local map tasks if you
have a good replication factor, the default of 3 is good.

Map tasks are scheduled on nodes by JT much based on
data locality and available slots. So you cannot say that the map tasks
will be uniformly distributed across the cluster. If you have 8 map slots
on a TT, assume a map reduce job having 8 map tasks and the data
corresponding to these 8 map tasks on the same node then all the 8 map
tasks can be on the same node as well.

Few Responses inline

Case (2)
Launched map tasks 0 0 2
Data-local map tasks 0 0 1

Hmm.. not quite understand this, if case (2) it means two map tasks are
actually reading data from same datanode?
[Bejoy] yes 2 had data on the same node where map tasks are executed and 1
had the task running on one node and is pulling the data from another node.

But anyway, is this monitoring needed for tuning performance?
[Bejoy] If you see less number of data local map tasks the you need
to seriously look into as it can degrade the performance to a greater
extent. In larger data volumes a few non data local map tasks are common.

Regards
Bejoy KS

On Tue, Sep 11, 2012 at 11:37 AM, Elaine Gan <el...@gmo.jp> wrote:

> Hi Hermanth
>
> Thank you for your detailed answered. Your answers helped me much in
> understanding, especially on the Job UI.
>
> Sorry, i missed out my specs.
> NameNode (JobTracker) : CPUx4
> DataNode (TaskTracker) : CPUx4
>
> I am replying inline too.
>
> > > I have a data of around 518MB, and i wrote a MR program to process it.
> > > Here are some of my settings in my mapred-site.xml.
> > > ---------------------------------------------------------------
> > > mapred.tasktracker.map.tasks.maximum = 20
> > > mapred.tasktracker.reduce.tasks.maximum = 20
> > > ---------------------------------------------------------------
> > >
> >
> > These two configurations essentially tell the tasktrackers that they can
> > run 20 maps and 20 reduces in parallel on a machine. Is this what you
> > intended ? (Generally the sum of these two values should equal the number
> > of cores on your tasktracker node, or a little more).
> >
> > Also, would help if you can tell us your cluster size - i.e. number of
> > slaves.
>
> Cluster size (No of slaves) = 4
>
> Yes, i meant the maximum tasks that could be run in A machine is 20
> tasks, both map & reduce.
>
> > > My block size is default, 64MB
> > > With my data size = 518MB, i guess setting the maximum for MR task to
> 20
> > > is far more than enough (518/64 = 8) , did i get it correctly?
> > >
> > >
> > I suppose what you want is to run all the maps in parallel. For that, the
> > number of map slots in your cluster should be more than the number of
> maps
> > of your job (assuming there's a single job running). If the number of
> slots
> > is less than number of maps, the maps would be scheduled in multiple
> waves.
> > On your jobtracker main page, the Cluster Summary > Map Task Capacity
> gives
> > you the total slots available in your cluster.
>
> My Map Task Capacity = 80%
> So, from the explanation and from my data size and configuration,
> Data size = 518MB
> Number of map tasks required =  518/64 = 8 tasks
> This 8 tasks should be spread among 4 slaves, which means each nodes
> should be able to handle at least 2 tasks.
> And my settings was mapred.tasktracker.map.tasks.maximum = 20, which is
> more than enough, so it means the approach is correct?
> (Well i have CPUx4 in my machine, so in case of large data, i should
> divide it by 4 in order to determine the smallest figure for
> mapred.tasktracker.map.tasks.maximum)
>
> > > When i run the MR program, i could see in the Map/Reduce Administration
> > > page that the number of Maps Total = 8, so i assume that everything is
> > > going well here, once again if i'm wrong please correct me.
> > > (Sometimes it shows only Maps Total = 3)
> > >
> > This value tells us the number of maps that will run for the job.
>
> OK
>
>
> > > There's one thing which i'm uncertain about hadoop distribution.
> > > Is the Maps Total = 8 means that there are 8 map tasks split among all
> > > the data nodes (task trackers)?
> > > Is there anyway i can checked whether all the tasks are shared among
> > > datanodes (where task trackers are working).
> > >
> > There's no easy way to check this. The task page for every task shows the
> > attempts that ran for each task and where they ran under the 'Machine'
> > column.
> >
>
> Thank you, i see that they're processed on different "Machine", so i
> guess it's working correctly :)
>
> >
> > > When i clicked on each link under that Task Id, i can see there's
> "Input
> > > Split Locations" stated under each task details, if the inputs are
> > > splitted between data nodes, does that means that everything is working
> > > well?
> > >
> > >
> > I think this is just the location of the splits, including the replicas.
> > What you could see is if enough data local maps ran - which means that
> the
> > tasks mostly got their inputs from datanodes running on the same machine
> as
> > themselves. This is given by the counter "Data-local map tasks" on the
> job
> > UI page.
> >
> There are two cases under the Job UI.
> Counter                   Map Reduce Total
> -----------------------------------------
> Case (1)
> Launched map tasks 0 0 4
> Data-local map tasks 0 0 4
>
> Case (2)
> Launched map tasks 0 0 2
> Data-local map tasks 0 0 1
>
> Hmm.. not quite understand this, if case (2) it means two map tasks are
> actually reading data from same datanode?
>
> But anyway, is this monitoring needed for tuning performance?
>
>
> Thank you.
>
>
>

Re: Understanding of the hadoop distribution system (tuning)

Posted by Bejoy Ks <be...@gmail.com>.
Hi Elaine

Slots (mapred.tasktracker.[map/reduce].tasks.maximum) are configured on a
cluster/node/TaskTracker level and not on a job level. You configure this
based on the available resources on each of the nodes. Of this you need to
consider the number of cores not number of CPUs. Say you have 4 quad
core processors then you have 16 cores, if they are hyper threaded you can
consider the effective number of cores as 1 - 1.5 times the actual number.
Also when you specify slots you need to consider memory, if a task jvm
(mapred.child.java.opts) is configured with 2Gigs and you have just 16GB
memory available at your disposal then you can have only 8*2=16 slots. If
you configure more number of slots it can lead to swapping and OOM issues
when all slots are used in parallel.

When map task are run you get good level of data local map tasks if you
have a good replication factor, the default of 3 is good.

Map tasks are scheduled on nodes by JT much based on
data locality and available slots. So you cannot say that the map tasks
will be uniformly distributed across the cluster. If you have 8 map slots
on a TT, assume a map reduce job having 8 map tasks and the data
corresponding to these 8 map tasks on the same node then all the 8 map
tasks can be on the same node as well.

Few Responses inline

Case (2)
Launched map tasks 0 0 2
Data-local map tasks 0 0 1

Hmm.. not quite understand this, if case (2) it means two map tasks are
actually reading data from same datanode?
[Bejoy] yes 2 had data on the same node where map tasks are executed and 1
had the task running on one node and is pulling the data from another node.

But anyway, is this monitoring needed for tuning performance?
[Bejoy] If you see less number of data local map tasks the you need
to seriously look into as it can degrade the performance to a greater
extent. In larger data volumes a few non data local map tasks are common.

Regards
Bejoy KS

On Tue, Sep 11, 2012 at 11:37 AM, Elaine Gan <el...@gmo.jp> wrote:

> Hi Hermanth
>
> Thank you for your detailed answered. Your answers helped me much in
> understanding, especially on the Job UI.
>
> Sorry, i missed out my specs.
> NameNode (JobTracker) : CPUx4
> DataNode (TaskTracker) : CPUx4
>
> I am replying inline too.
>
> > > I have a data of around 518MB, and i wrote a MR program to process it.
> > > Here are some of my settings in my mapred-site.xml.
> > > ---------------------------------------------------------------
> > > mapred.tasktracker.map.tasks.maximum = 20
> > > mapred.tasktracker.reduce.tasks.maximum = 20
> > > ---------------------------------------------------------------
> > >
> >
> > These two configurations essentially tell the tasktrackers that they can
> > run 20 maps and 20 reduces in parallel on a machine. Is this what you
> > intended ? (Generally the sum of these two values should equal the number
> > of cores on your tasktracker node, or a little more).
> >
> > Also, would help if you can tell us your cluster size - i.e. number of
> > slaves.
>
> Cluster size (No of slaves) = 4
>
> Yes, i meant the maximum tasks that could be run in A machine is 20
> tasks, both map & reduce.
>
> > > My block size is default, 64MB
> > > With my data size = 518MB, i guess setting the maximum for MR task to
> 20
> > > is far more than enough (518/64 = 8) , did i get it correctly?
> > >
> > >
> > I suppose what you want is to run all the maps in parallel. For that, the
> > number of map slots in your cluster should be more than the number of
> maps
> > of your job (assuming there's a single job running). If the number of
> slots
> > is less than number of maps, the maps would be scheduled in multiple
> waves.
> > On your jobtracker main page, the Cluster Summary > Map Task Capacity
> gives
> > you the total slots available in your cluster.
>
> My Map Task Capacity = 80%
> So, from the explanation and from my data size and configuration,
> Data size = 518MB
> Number of map tasks required =  518/64 = 8 tasks
> This 8 tasks should be spread among 4 slaves, which means each nodes
> should be able to handle at least 2 tasks.
> And my settings was mapred.tasktracker.map.tasks.maximum = 20, which is
> more than enough, so it means the approach is correct?
> (Well i have CPUx4 in my machine, so in case of large data, i should
> divide it by 4 in order to determine the smallest figure for
> mapred.tasktracker.map.tasks.maximum)
>
> > > When i run the MR program, i could see in the Map/Reduce Administration
> > > page that the number of Maps Total = 8, so i assume that everything is
> > > going well here, once again if i'm wrong please correct me.
> > > (Sometimes it shows only Maps Total = 3)
> > >
> > This value tells us the number of maps that will run for the job.
>
> OK
>
>
> > > There's one thing which i'm uncertain about hadoop distribution.
> > > Is the Maps Total = 8 means that there are 8 map tasks split among all
> > > the data nodes (task trackers)?
> > > Is there anyway i can checked whether all the tasks are shared among
> > > datanodes (where task trackers are working).
> > >
> > There's no easy way to check this. The task page for every task shows the
> > attempts that ran for each task and where they ran under the 'Machine'
> > column.
> >
>
> Thank you, i see that they're processed on different "Machine", so i
> guess it's working correctly :)
>
> >
> > > When i clicked on each link under that Task Id, i can see there's
> "Input
> > > Split Locations" stated under each task details, if the inputs are
> > > splitted between data nodes, does that means that everything is working
> > > well?
> > >
> > >
> > I think this is just the location of the splits, including the replicas.
> > What you could see is if enough data local maps ran - which means that
> the
> > tasks mostly got their inputs from datanodes running on the same machine
> as
> > themselves. This is given by the counter "Data-local map tasks" on the
> job
> > UI page.
> >
> There are two cases under the Job UI.
> Counter                   Map Reduce Total
> -----------------------------------------
> Case (1)
> Launched map tasks 0 0 4
> Data-local map tasks 0 0 4
>
> Case (2)
> Launched map tasks 0 0 2
> Data-local map tasks 0 0 1
>
> Hmm.. not quite understand this, if case (2) it means two map tasks are
> actually reading data from same datanode?
>
> But anyway, is this monitoring needed for tuning performance?
>
>
> Thank you.
>
>
>

Re: Understanding of the hadoop distribution system (tuning)

Posted by Elaine Gan <el...@gmo.jp>.
Hi Hermanth

Thank you for your detailed answered. Your answers helped me much in
understanding, especially on the Job UI.

Sorry, i missed out my specs.
NameNode (JobTracker) : CPUx4
DataNode (TaskTracker) : CPUx4

I am replying inline too.

> > I have a data of around 518MB, and i wrote a MR program to process it.
> > Here are some of my settings in my mapred-site.xml.
> > ---------------------------------------------------------------
> > mapred.tasktracker.map.tasks.maximum = 20
> > mapred.tasktracker.reduce.tasks.maximum = 20
> > ---------------------------------------------------------------
> >
> 
> These two configurations essentially tell the tasktrackers that they can
> run 20 maps and 20 reduces in parallel on a machine. Is this what you
> intended ? (Generally the sum of these two values should equal the number
> of cores on your tasktracker node, or a little more).
> 
> Also, would help if you can tell us your cluster size - i.e. number of
> slaves.

Cluster size (No of slaves) = 4

Yes, i meant the maximum tasks that could be run in A machine is 20
tasks, both map & reduce.

> > My block size is default, 64MB
> > With my data size = 518MB, i guess setting the maximum for MR task to 20
> > is far more than enough (518/64 = 8) , did i get it correctly?
> >
> >
> I suppose what you want is to run all the maps in parallel. For that, the
> number of map slots in your cluster should be more than the number of maps
> of your job (assuming there's a single job running). If the number of slots
> is less than number of maps, the maps would be scheduled in multiple waves.
> On your jobtracker main page, the Cluster Summary > Map Task Capacity gives
> you the total slots available in your cluster.

My Map Task Capacity = 80%
So, from the explanation and from my data size and configuration,
Data size = 518MB
Number of map tasks required =  518/64 = 8 tasks
This 8 tasks should be spread among 4 slaves, which means each nodes
should be able to handle at least 2 tasks.
And my settings was mapred.tasktracker.map.tasks.maximum = 20, which is
more than enough, so it means the approach is correct?
(Well i have CPUx4 in my machine, so in case of large data, i should
divide it by 4 in order to determine the smallest figure for mapred.tasktracker.map.tasks.maximum)

> > When i run the MR program, i could see in the Map/Reduce Administration
> > page that the number of Maps Total = 8, so i assume that everything is
> > going well here, once again if i'm wrong please correct me.
> > (Sometimes it shows only Maps Total = 3)
> >
> This value tells us the number of maps that will run for the job.

OK


> > There's one thing which i'm uncertain about hadoop distribution.
> > Is the Maps Total = 8 means that there are 8 map tasks split among all
> > the data nodes (task trackers)?
> > Is there anyway i can checked whether all the tasks are shared among
> > datanodes (where task trackers are working).
> >
> There's no easy way to check this. The task page for every task shows the
> attempts that ran for each task and where they ran under the 'Machine'
> column.
> 

Thank you, i see that they're processed on different "Machine", so i
guess it's working correctly :)

> 
> > When i clicked on each link under that Task Id, i can see there's "Input
> > Split Locations" stated under each task details, if the inputs are
> > splitted between data nodes, does that means that everything is working
> > well?
> >
> >
> I think this is just the location of the splits, including the replicas.
> What you could see is if enough data local maps ran - which means that the
> tasks mostly got their inputs from datanodes running on the same machine as
> themselves. This is given by the counter "Data-local map tasks" on the job
> UI page.
> 
There are two cases under the Job UI.
Counter                   Map Reduce Total
-----------------------------------------
Case (1)
Launched map tasks 0 0 4 
Data-local map tasks 0 0 4 

Case (2)
Launched map tasks 0 0 2 
Data-local map tasks 0 0 1 

Hmm.. not quite understand this, if case (2) it means two map tasks are
actually reading data from same datanode?

But anyway, is this monitoring needed for tuning performance?


Thank you.



Re: Understanding of the hadoop distribution system (tuning)

Posted by Elaine Gan <el...@gmo.jp>.
Hi Hermanth

Thank you for your detailed answered. Your answers helped me much in
understanding, especially on the Job UI.

Sorry, i missed out my specs.
NameNode (JobTracker) : CPUx4
DataNode (TaskTracker) : CPUx4

I am replying inline too.

> > I have a data of around 518MB, and i wrote a MR program to process it.
> > Here are some of my settings in my mapred-site.xml.
> > ---------------------------------------------------------------
> > mapred.tasktracker.map.tasks.maximum = 20
> > mapred.tasktracker.reduce.tasks.maximum = 20
> > ---------------------------------------------------------------
> >
> 
> These two configurations essentially tell the tasktrackers that they can
> run 20 maps and 20 reduces in parallel on a machine. Is this what you
> intended ? (Generally the sum of these two values should equal the number
> of cores on your tasktracker node, or a little more).
> 
> Also, would help if you can tell us your cluster size - i.e. number of
> slaves.

Cluster size (No of slaves) = 4

Yes, i meant the maximum tasks that could be run in A machine is 20
tasks, both map & reduce.

> > My block size is default, 64MB
> > With my data size = 518MB, i guess setting the maximum for MR task to 20
> > is far more than enough (518/64 = 8) , did i get it correctly?
> >
> >
> I suppose what you want is to run all the maps in parallel. For that, the
> number of map slots in your cluster should be more than the number of maps
> of your job (assuming there's a single job running). If the number of slots
> is less than number of maps, the maps would be scheduled in multiple waves.
> On your jobtracker main page, the Cluster Summary > Map Task Capacity gives
> you the total slots available in your cluster.

My Map Task Capacity = 80%
So, from the explanation and from my data size and configuration,
Data size = 518MB
Number of map tasks required =  518/64 = 8 tasks
This 8 tasks should be spread among 4 slaves, which means each nodes
should be able to handle at least 2 tasks.
And my settings was mapred.tasktracker.map.tasks.maximum = 20, which is
more than enough, so it means the approach is correct?
(Well i have CPUx4 in my machine, so in case of large data, i should
divide it by 4 in order to determine the smallest figure for mapred.tasktracker.map.tasks.maximum)

> > When i run the MR program, i could see in the Map/Reduce Administration
> > page that the number of Maps Total = 8, so i assume that everything is
> > going well here, once again if i'm wrong please correct me.
> > (Sometimes it shows only Maps Total = 3)
> >
> This value tells us the number of maps that will run for the job.

OK


> > There's one thing which i'm uncertain about hadoop distribution.
> > Is the Maps Total = 8 means that there are 8 map tasks split among all
> > the data nodes (task trackers)?
> > Is there anyway i can checked whether all the tasks are shared among
> > datanodes (where task trackers are working).
> >
> There's no easy way to check this. The task page for every task shows the
> attempts that ran for each task and where they ran under the 'Machine'
> column.
> 

Thank you, i see that they're processed on different "Machine", so i
guess it's working correctly :)

> 
> > When i clicked on each link under that Task Id, i can see there's "Input
> > Split Locations" stated under each task details, if the inputs are
> > splitted between data nodes, does that means that everything is working
> > well?
> >
> >
> I think this is just the location of the splits, including the replicas.
> What you could see is if enough data local maps ran - which means that the
> tasks mostly got their inputs from datanodes running on the same machine as
> themselves. This is given by the counter "Data-local map tasks" on the job
> UI page.
> 
There are two cases under the Job UI.
Counter                   Map Reduce Total
-----------------------------------------
Case (1)
Launched map tasks 0 0 4 
Data-local map tasks 0 0 4 

Case (2)
Launched map tasks 0 0 2 
Data-local map tasks 0 0 1 

Hmm.. not quite understand this, if case (2) it means two map tasks are
actually reading data from same datanode?

But anyway, is this monitoring needed for tuning performance?


Thank you.



Re: Understanding of the hadoop distribution system (tuning)

Posted by Elaine Gan <el...@gmo.jp>.
Hi Hermanth

Thank you for your detailed answered. Your answers helped me much in
understanding, especially on the Job UI.

Sorry, i missed out my specs.
NameNode (JobTracker) : CPUx4
DataNode (TaskTracker) : CPUx4

I am replying inline too.

> > I have a data of around 518MB, and i wrote a MR program to process it.
> > Here are some of my settings in my mapred-site.xml.
> > ---------------------------------------------------------------
> > mapred.tasktracker.map.tasks.maximum = 20
> > mapred.tasktracker.reduce.tasks.maximum = 20
> > ---------------------------------------------------------------
> >
> 
> These two configurations essentially tell the tasktrackers that they can
> run 20 maps and 20 reduces in parallel on a machine. Is this what you
> intended ? (Generally the sum of these two values should equal the number
> of cores on your tasktracker node, or a little more).
> 
> Also, would help if you can tell us your cluster size - i.e. number of
> slaves.

Cluster size (No of slaves) = 4

Yes, i meant the maximum tasks that could be run in A machine is 20
tasks, both map & reduce.

> > My block size is default, 64MB
> > With my data size = 518MB, i guess setting the maximum for MR task to 20
> > is far more than enough (518/64 = 8) , did i get it correctly?
> >
> >
> I suppose what you want is to run all the maps in parallel. For that, the
> number of map slots in your cluster should be more than the number of maps
> of your job (assuming there's a single job running). If the number of slots
> is less than number of maps, the maps would be scheduled in multiple waves.
> On your jobtracker main page, the Cluster Summary > Map Task Capacity gives
> you the total slots available in your cluster.

My Map Task Capacity = 80%
So, from the explanation and from my data size and configuration,
Data size = 518MB
Number of map tasks required =  518/64 = 8 tasks
This 8 tasks should be spread among 4 slaves, which means each nodes
should be able to handle at least 2 tasks.
And my settings was mapred.tasktracker.map.tasks.maximum = 20, which is
more than enough, so it means the approach is correct?
(Well i have CPUx4 in my machine, so in case of large data, i should
divide it by 4 in order to determine the smallest figure for mapred.tasktracker.map.tasks.maximum)

> > When i run the MR program, i could see in the Map/Reduce Administration
> > page that the number of Maps Total = 8, so i assume that everything is
> > going well here, once again if i'm wrong please correct me.
> > (Sometimes it shows only Maps Total = 3)
> >
> This value tells us the number of maps that will run for the job.

OK


> > There's one thing which i'm uncertain about hadoop distribution.
> > Is the Maps Total = 8 means that there are 8 map tasks split among all
> > the data nodes (task trackers)?
> > Is there anyway i can checked whether all the tasks are shared among
> > datanodes (where task trackers are working).
> >
> There's no easy way to check this. The task page for every task shows the
> attempts that ran for each task and where they ran under the 'Machine'
> column.
> 

Thank you, i see that they're processed on different "Machine", so i
guess it's working correctly :)

> 
> > When i clicked on each link under that Task Id, i can see there's "Input
> > Split Locations" stated under each task details, if the inputs are
> > splitted between data nodes, does that means that everything is working
> > well?
> >
> >
> I think this is just the location of the splits, including the replicas.
> What you could see is if enough data local maps ran - which means that the
> tasks mostly got their inputs from datanodes running on the same machine as
> themselves. This is given by the counter "Data-local map tasks" on the job
> UI page.
> 
There are two cases under the Job UI.
Counter                   Map Reduce Total
-----------------------------------------
Case (1)
Launched map tasks 0 0 4 
Data-local map tasks 0 0 4 

Case (2)
Launched map tasks 0 0 2 
Data-local map tasks 0 0 1 

Hmm.. not quite understand this, if case (2) it means two map tasks are
actually reading data from same datanode?

But anyway, is this monitoring needed for tuning performance?


Thank you.



Re: Understanding of the hadoop distribution system (tuning)

Posted by Elaine Gan <el...@gmo.jp>.
Hi Hermanth

Thank you for your detailed answered. Your answers helped me much in
understanding, especially on the Job UI.

Sorry, i missed out my specs.
NameNode (JobTracker) : CPUx4
DataNode (TaskTracker) : CPUx4

I am replying inline too.

> > I have a data of around 518MB, and i wrote a MR program to process it.
> > Here are some of my settings in my mapred-site.xml.
> > ---------------------------------------------------------------
> > mapred.tasktracker.map.tasks.maximum = 20
> > mapred.tasktracker.reduce.tasks.maximum = 20
> > ---------------------------------------------------------------
> >
> 
> These two configurations essentially tell the tasktrackers that they can
> run 20 maps and 20 reduces in parallel on a machine. Is this what you
> intended ? (Generally the sum of these two values should equal the number
> of cores on your tasktracker node, or a little more).
> 
> Also, would help if you can tell us your cluster size - i.e. number of
> slaves.

Cluster size (No of slaves) = 4

Yes, i meant the maximum tasks that could be run in A machine is 20
tasks, both map & reduce.

> > My block size is default, 64MB
> > With my data size = 518MB, i guess setting the maximum for MR task to 20
> > is far more than enough (518/64 = 8) , did i get it correctly?
> >
> >
> I suppose what you want is to run all the maps in parallel. For that, the
> number of map slots in your cluster should be more than the number of maps
> of your job (assuming there's a single job running). If the number of slots
> is less than number of maps, the maps would be scheduled in multiple waves.
> On your jobtracker main page, the Cluster Summary > Map Task Capacity gives
> you the total slots available in your cluster.

My Map Task Capacity = 80%
So, from the explanation and from my data size and configuration,
Data size = 518MB
Number of map tasks required =  518/64 = 8 tasks
This 8 tasks should be spread among 4 slaves, which means each nodes
should be able to handle at least 2 tasks.
And my settings was mapred.tasktracker.map.tasks.maximum = 20, which is
more than enough, so it means the approach is correct?
(Well i have CPUx4 in my machine, so in case of large data, i should
divide it by 4 in order to determine the smallest figure for mapred.tasktracker.map.tasks.maximum)

> > When i run the MR program, i could see in the Map/Reduce Administration
> > page that the number of Maps Total = 8, so i assume that everything is
> > going well here, once again if i'm wrong please correct me.
> > (Sometimes it shows only Maps Total = 3)
> >
> This value tells us the number of maps that will run for the job.

OK


> > There's one thing which i'm uncertain about hadoop distribution.
> > Is the Maps Total = 8 means that there are 8 map tasks split among all
> > the data nodes (task trackers)?
> > Is there anyway i can checked whether all the tasks are shared among
> > datanodes (where task trackers are working).
> >
> There's no easy way to check this. The task page for every task shows the
> attempts that ran for each task and where they ran under the 'Machine'
> column.
> 

Thank you, i see that they're processed on different "Machine", so i
guess it's working correctly :)

> 
> > When i clicked on each link under that Task Id, i can see there's "Input
> > Split Locations" stated under each task details, if the inputs are
> > splitted between data nodes, does that means that everything is working
> > well?
> >
> >
> I think this is just the location of the splits, including the replicas.
> What you could see is if enough data local maps ran - which means that the
> tasks mostly got their inputs from datanodes running on the same machine as
> themselves. This is given by the counter "Data-local map tasks" on the job
> UI page.
> 
There are two cases under the Job UI.
Counter                   Map Reduce Total
-----------------------------------------
Case (1)
Launched map tasks 0 0 4 
Data-local map tasks 0 0 4 

Case (2)
Launched map tasks 0 0 2 
Data-local map tasks 0 0 1 

Hmm.. not quite understand this, if case (2) it means two map tasks are
actually reading data from same datanode?

But anyway, is this monitoring needed for tuning performance?


Thank you.



Re: Understanding of the hadoop distribution system (tuning)

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

Responses inline to some points.

On Tue, Sep 11, 2012 at 7:26 AM, Elaine Gan <el...@gmo.jp> wrote:

> Hi,
>
> I'm new to hadoop and i've just played around with map reduce.
> I would like to check if my understanding to hadoop is correct and i
> would appreciate if anyone could correct me if i'm wrong.
>
> I have a data of around 518MB, and i wrote a MR program to process it.
> Here are some of my settings in my mapred-site.xml.
> ---------------------------------------------------------------
> mapred.tasktracker.map.tasks.maximum = 20
> mapred.tasktracker.reduce.tasks.maximum = 20
> ---------------------------------------------------------------
>

These two configurations essentially tell the tasktrackers that they can
run 20 maps and 20 reduces in parallel on a machine. Is this what you
intended ? (Generally the sum of these two values should equal the number
of cores on your tasktracker node, or a little more).

Also, would help if you can tell us your cluster size - i.e. number of
slaves.


> My block size is default, 64MB
> With my data size = 518MB, i guess setting the maximum for MR task to 20
> is far more than enough (518/64 = 8) , did i get it correctly?
>
>
I suppose what you want is to run all the maps in parallel. For that, the
number of map slots in your cluster should be more than the number of maps
of your job (assuming there's a single job running). If the number of slots
is less than number of maps, the maps would be scheduled in multiple waves.
On your jobtracker main page, the Cluster Summary > Map Task Capacity gives
you the total slots available in your cluster.



> When i run the MR program, i could see in the Map/Reduce Administration
> page that the number of Maps Total = 8, so i assume that everything is
> going well here, once again if i'm wrong please correct me.
> (Sometimes it shows only Maps Total = 3)
>
>
This value tells us the number of maps that will run for the job.


> There's one thing which i'm uncertain about hadoop distribution.
> Is the Maps Total = 8 means that there are 8 map tasks split among all
> the data nodes (task trackers)?
> Is there anyway i can checked whether all the tasks are shared among
> datanodes (where task trackers are working).
>

There's no easy way to check this. The task page for every task shows the
attempts that ran for each task and where they ran under the 'Machine'
column.


> When i clicked on each link under that Task Id, i can see there's "Input
> Split Locations" stated under each task details, if the inputs are
> splitted between data nodes, does that means that everything is working
> well?
>
>
I think this is just the location of the splits, including the replicas.
What you could see is if enough data local maps ran - which means that the
tasks mostly got their inputs from datanodes running on the same machine as
themselves. This is given by the counter "Data-local map tasks" on the job
UI page.


> I need to make sure i got everything running well because my MR took
> around 6 hours to finish despite the input size is small.. (Well, i know
> hadoop is not meant for small data), I'm not sure whether it's my
> configuration that goes wrong or hadoop is just not suitable for my case.
> I'm actually running a mahout kmeans analysis.
>
> Thank you for your time.
>
>
>
>
>

Re: Understanding of the hadoop distribution system (tuning)

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

Responses inline to some points.

On Tue, Sep 11, 2012 at 7:26 AM, Elaine Gan <el...@gmo.jp> wrote:

> Hi,
>
> I'm new to hadoop and i've just played around with map reduce.
> I would like to check if my understanding to hadoop is correct and i
> would appreciate if anyone could correct me if i'm wrong.
>
> I have a data of around 518MB, and i wrote a MR program to process it.
> Here are some of my settings in my mapred-site.xml.
> ---------------------------------------------------------------
> mapred.tasktracker.map.tasks.maximum = 20
> mapred.tasktracker.reduce.tasks.maximum = 20
> ---------------------------------------------------------------
>

These two configurations essentially tell the tasktrackers that they can
run 20 maps and 20 reduces in parallel on a machine. Is this what you
intended ? (Generally the sum of these two values should equal the number
of cores on your tasktracker node, or a little more).

Also, would help if you can tell us your cluster size - i.e. number of
slaves.


> My block size is default, 64MB
> With my data size = 518MB, i guess setting the maximum for MR task to 20
> is far more than enough (518/64 = 8) , did i get it correctly?
>
>
I suppose what you want is to run all the maps in parallel. For that, the
number of map slots in your cluster should be more than the number of maps
of your job (assuming there's a single job running). If the number of slots
is less than number of maps, the maps would be scheduled in multiple waves.
On your jobtracker main page, the Cluster Summary > Map Task Capacity gives
you the total slots available in your cluster.



> When i run the MR program, i could see in the Map/Reduce Administration
> page that the number of Maps Total = 8, so i assume that everything is
> going well here, once again if i'm wrong please correct me.
> (Sometimes it shows only Maps Total = 3)
>
>
This value tells us the number of maps that will run for the job.


> There's one thing which i'm uncertain about hadoop distribution.
> Is the Maps Total = 8 means that there are 8 map tasks split among all
> the data nodes (task trackers)?
> Is there anyway i can checked whether all the tasks are shared among
> datanodes (where task trackers are working).
>

There's no easy way to check this. The task page for every task shows the
attempts that ran for each task and where they ran under the 'Machine'
column.


> When i clicked on each link under that Task Id, i can see there's "Input
> Split Locations" stated under each task details, if the inputs are
> splitted between data nodes, does that means that everything is working
> well?
>
>
I think this is just the location of the splits, including the replicas.
What you could see is if enough data local maps ran - which means that the
tasks mostly got their inputs from datanodes running on the same machine as
themselves. This is given by the counter "Data-local map tasks" on the job
UI page.


> I need to make sure i got everything running well because my MR took
> around 6 hours to finish despite the input size is small.. (Well, i know
> hadoop is not meant for small data), I'm not sure whether it's my
> configuration that goes wrong or hadoop is just not suitable for my case.
> I'm actually running a mahout kmeans analysis.
>
> Thank you for your time.
>
>
>
>
>

Re: Understanding of the hadoop distribution system (tuning)

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

Responses inline to some points.

On Tue, Sep 11, 2012 at 7:26 AM, Elaine Gan <el...@gmo.jp> wrote:

> Hi,
>
> I'm new to hadoop and i've just played around with map reduce.
> I would like to check if my understanding to hadoop is correct and i
> would appreciate if anyone could correct me if i'm wrong.
>
> I have a data of around 518MB, and i wrote a MR program to process it.
> Here are some of my settings in my mapred-site.xml.
> ---------------------------------------------------------------
> mapred.tasktracker.map.tasks.maximum = 20
> mapred.tasktracker.reduce.tasks.maximum = 20
> ---------------------------------------------------------------
>

These two configurations essentially tell the tasktrackers that they can
run 20 maps and 20 reduces in parallel on a machine. Is this what you
intended ? (Generally the sum of these two values should equal the number
of cores on your tasktracker node, or a little more).

Also, would help if you can tell us your cluster size - i.e. number of
slaves.


> My block size is default, 64MB
> With my data size = 518MB, i guess setting the maximum for MR task to 20
> is far more than enough (518/64 = 8) , did i get it correctly?
>
>
I suppose what you want is to run all the maps in parallel. For that, the
number of map slots in your cluster should be more than the number of maps
of your job (assuming there's a single job running). If the number of slots
is less than number of maps, the maps would be scheduled in multiple waves.
On your jobtracker main page, the Cluster Summary > Map Task Capacity gives
you the total slots available in your cluster.



> When i run the MR program, i could see in the Map/Reduce Administration
> page that the number of Maps Total = 8, so i assume that everything is
> going well here, once again if i'm wrong please correct me.
> (Sometimes it shows only Maps Total = 3)
>
>
This value tells us the number of maps that will run for the job.


> There's one thing which i'm uncertain about hadoop distribution.
> Is the Maps Total = 8 means that there are 8 map tasks split among all
> the data nodes (task trackers)?
> Is there anyway i can checked whether all the tasks are shared among
> datanodes (where task trackers are working).
>

There's no easy way to check this. The task page for every task shows the
attempts that ran for each task and where they ran under the 'Machine'
column.


> When i clicked on each link under that Task Id, i can see there's "Input
> Split Locations" stated under each task details, if the inputs are
> splitted between data nodes, does that means that everything is working
> well?
>
>
I think this is just the location of the splits, including the replicas.
What you could see is if enough data local maps ran - which means that the
tasks mostly got their inputs from datanodes running on the same machine as
themselves. This is given by the counter "Data-local map tasks" on the job
UI page.


> I need to make sure i got everything running well because my MR took
> around 6 hours to finish despite the input size is small.. (Well, i know
> hadoop is not meant for small data), I'm not sure whether it's my
> configuration that goes wrong or hadoop is just not suitable for my case.
> I'm actually running a mahout kmeans analysis.
>
> Thank you for your time.
>
>
>
>
>

Re: Understanding of the hadoop distribution system (tuning)

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

Responses inline to some points.

On Tue, Sep 11, 2012 at 7:26 AM, Elaine Gan <el...@gmo.jp> wrote:

> Hi,
>
> I'm new to hadoop and i've just played around with map reduce.
> I would like to check if my understanding to hadoop is correct and i
> would appreciate if anyone could correct me if i'm wrong.
>
> I have a data of around 518MB, and i wrote a MR program to process it.
> Here are some of my settings in my mapred-site.xml.
> ---------------------------------------------------------------
> mapred.tasktracker.map.tasks.maximum = 20
> mapred.tasktracker.reduce.tasks.maximum = 20
> ---------------------------------------------------------------
>

These two configurations essentially tell the tasktrackers that they can
run 20 maps and 20 reduces in parallel on a machine. Is this what you
intended ? (Generally the sum of these two values should equal the number
of cores on your tasktracker node, or a little more).

Also, would help if you can tell us your cluster size - i.e. number of
slaves.


> My block size is default, 64MB
> With my data size = 518MB, i guess setting the maximum for MR task to 20
> is far more than enough (518/64 = 8) , did i get it correctly?
>
>
I suppose what you want is to run all the maps in parallel. For that, the
number of map slots in your cluster should be more than the number of maps
of your job (assuming there's a single job running). If the number of slots
is less than number of maps, the maps would be scheduled in multiple waves.
On your jobtracker main page, the Cluster Summary > Map Task Capacity gives
you the total slots available in your cluster.



> When i run the MR program, i could see in the Map/Reduce Administration
> page that the number of Maps Total = 8, so i assume that everything is
> going well here, once again if i'm wrong please correct me.
> (Sometimes it shows only Maps Total = 3)
>
>
This value tells us the number of maps that will run for the job.


> There's one thing which i'm uncertain about hadoop distribution.
> Is the Maps Total = 8 means that there are 8 map tasks split among all
> the data nodes (task trackers)?
> Is there anyway i can checked whether all the tasks are shared among
> datanodes (where task trackers are working).
>

There's no easy way to check this. The task page for every task shows the
attempts that ran for each task and where they ran under the 'Machine'
column.


> When i clicked on each link under that Task Id, i can see there's "Input
> Split Locations" stated under each task details, if the inputs are
> splitted between data nodes, does that means that everything is working
> well?
>
>
I think this is just the location of the splits, including the replicas.
What you could see is if enough data local maps ran - which means that the
tasks mostly got their inputs from datanodes running on the same machine as
themselves. This is given by the counter "Data-local map tasks" on the job
UI page.


> I need to make sure i got everything running well because my MR took
> around 6 hours to finish despite the input size is small.. (Well, i know
> hadoop is not meant for small data), I'm not sure whether it's my
> configuration that goes wrong or hadoop is just not suitable for my case.
> I'm actually running a mahout kmeans analysis.
>
> Thank you for your time.
>
>
>
>
>

Re: Understanding of the hadoop distribution system (tuning)

Posted by Jagat Singh <ja...@gmail.com>.
Hello Elaine,

You did not tell your cluster size. Number of nodes , cores in each node.

What sort of work you are doing , 6 hours for 518MB data is huge time.

The number of map tasks would be 518/64

So this many map tasks needs to run to process your data.

Now they can run on single node or multiple nodes depending on available
slots. Did you check job tracker page while execution is taking place ,
there you can see at which node its being processed. You can go to Running
tasks page.

Regards,

Jagat Singh


On Tue, Sep 11, 2012 at 11:56 AM, Elaine Gan <el...@gmo.jp> wrote:

> Hi,
>
> I'm new to hadoop and i've just played around with map reduce.
> I would like to check if my understanding to hadoop is correct and i
> would appreciate if anyone could correct me if i'm wrong.
>
> I have a data of around 518MB, and i wrote a MR program to process it.
> Here are some of my settings in my mapred-site.xml.
> ---------------------------------------------------------------
> mapred.tasktracker.map.tasks.maximum = 20
> mapred.tasktracker.reduce.tasks.maximum = 20
> ---------------------------------------------------------------
> My block size is default, 64MB
> With my data size = 518MB, i guess setting the maximum for MR task to 20
> is far more than enough (518/64 = 8) , did i get it correctly?
>
> When i run the MR program, i could see in the Map/Reduce Administration
> page that the number of Maps Total = 8, so i assume that everything is
> going well here, once again if i'm wrong please correct me.
> (Sometimes it shows only Maps Total = 3)
>
> There's one thing which i'm uncertain about hadoop distribution.
> Is the Maps Total = 8 means that there are 8 map tasks split among all
> the data nodes (task trackers)?
> Is there anyway i can checked whether all the tasks are shared among
> datanodes (where task trackers are working).
> When i clicked on each link under that Task Id, i can see there's "Input
> Split Locations" stated under each task details, if the inputs are
> splitted between data nodes, does that means that everything is working
> well?
>
> I need to make sure i got everything running well because my MR took
> around 6 hours to finish despite the input size is small.. (Well, i know
> hadoop is not meant for small data), I'm not sure whether it's my
> configuration that goes wrong or hadoop is just not suitable for my case.
> I'm actually running a mahout kmeans analysis.
>
> Thank you for your time.
>
>
>
>
>

Re: Understanding of the hadoop distribution system (tuning)

Posted by Jagat Singh <ja...@gmail.com>.
Hello Elaine,

You did not tell your cluster size. Number of nodes , cores in each node.

What sort of work you are doing , 6 hours for 518MB data is huge time.

The number of map tasks would be 518/64

So this many map tasks needs to run to process your data.

Now they can run on single node or multiple nodes depending on available
slots. Did you check job tracker page while execution is taking place ,
there you can see at which node its being processed. You can go to Running
tasks page.

Regards,

Jagat Singh


On Tue, Sep 11, 2012 at 11:56 AM, Elaine Gan <el...@gmo.jp> wrote:

> Hi,
>
> I'm new to hadoop and i've just played around with map reduce.
> I would like to check if my understanding to hadoop is correct and i
> would appreciate if anyone could correct me if i'm wrong.
>
> I have a data of around 518MB, and i wrote a MR program to process it.
> Here are some of my settings in my mapred-site.xml.
> ---------------------------------------------------------------
> mapred.tasktracker.map.tasks.maximum = 20
> mapred.tasktracker.reduce.tasks.maximum = 20
> ---------------------------------------------------------------
> My block size is default, 64MB
> With my data size = 518MB, i guess setting the maximum for MR task to 20
> is far more than enough (518/64 = 8) , did i get it correctly?
>
> When i run the MR program, i could see in the Map/Reduce Administration
> page that the number of Maps Total = 8, so i assume that everything is
> going well here, once again if i'm wrong please correct me.
> (Sometimes it shows only Maps Total = 3)
>
> There's one thing which i'm uncertain about hadoop distribution.
> Is the Maps Total = 8 means that there are 8 map tasks split among all
> the data nodes (task trackers)?
> Is there anyway i can checked whether all the tasks are shared among
> datanodes (where task trackers are working).
> When i clicked on each link under that Task Id, i can see there's "Input
> Split Locations" stated under each task details, if the inputs are
> splitted between data nodes, does that means that everything is working
> well?
>
> I need to make sure i got everything running well because my MR took
> around 6 hours to finish despite the input size is small.. (Well, i know
> hadoop is not meant for small data), I'm not sure whether it's my
> configuration that goes wrong or hadoop is just not suitable for my case.
> I'm actually running a mahout kmeans analysis.
>
> Thank you for your time.
>
>
>
>
>