You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by Samaneh Shokuhi <sa...@gmail.com> on 2013/04/07 16:08:24 UTC

configuring number of mappers and reducers

Hi All,
I am doing some experiments by running WordCount example on hadoop.
I have a cluster with 7 nodes .I want to run WordCount example with
3mappers and 3 reducers and compare the response time with another
experiments when number of mappers and reducers increased to 6 and 12 and
so on.
For first experiment i set number of the mappers and reducer to 3 in
wordCount example source code .and also set the number of replications to 3
in hadoop configurations.Also  the maximum number of tasks per node is set
to 1 .
But when i run the sample with a big data like 2.5 G ,i can see 44 map
tasks and 3 reduce tasks are running !!

What parameters do i need to set to have like (3Mappers,3 Reducers),
(6M,6R) and (12M,12R) and as i mentioned i have a cluster with 1 namenode
and 6 datanodes.
Is number of replications related to the number of mappers and reducers ?!
Regards,
Samaneh

Re: configuring number of mappers and reducers

Posted by Samaneh Shokuhi <sa...@gmail.com>.

Sudhakara,thanks again for your information.

Actually the reason i am focused on response time is i am going to modify
hadoop to skip the sort phase in mapTask and run a sample like wordCount
example on modified hadoop (skipped sort in map task) and compare its
performance with unmodified hadoop .In fact i need to know how  sorting
part affects on performance and if in any cases we can skip the sort part
in map phase and get better performance .
So to do this experiment i need a way to measure the performance .I wonder
response time is a proper factor in this case to measure the performance or
not.Do you suggest any way to measure the performance in this experiment?

Samaneh




On Tue, Apr 9, 2013 at 5:43 PM, sudhakara st <su...@gmail.com> wrote:

> Hi Samanesh,
>
> Increasing the reducer for a job would not help as you excepting. In most
> of MR jobs  more then 60% time will spent in mapper phase(it depends upon
> what type of operation performing on data in map and reducer phase).
>
> Increasing the number of reduces increases the framework overhead, but
> increases load balancing, available map-reduce slots allocation, system
> resource utilization by considering job processes requirement we can
> optimize the jobs for best performance with lowers the cost of failures.
>
> One more i cannot understand is why your so much worrying about response
> time ?. The response time purely depends upon the how much data you are
> processing in the job, what type of operation performing on the data, how
> data distributed in the cluster and capacity of your cluster.  A MR job
> should says it is optimized it  contains balanced number of mapper and
> reducer.  As per normal MR applications like word count i suggest to mapper
> and reducer ratio 4:1(if your jobs running without  combiner, In word count
> like program with combiner defined, then i will suggest use 10:1 ) .
>
> While tuning the MR jobs we cannot consider only response time as parameter
> to optimize the job, there so many other factors need consider, and
> response time not only depends on number of reducer we configure for job,
> it depends on numerous other factors as mentioned above.
>
>
>
> On Tue, Apr 9, 2013 at 2:05 PM, Samaneh Shokuhi
> <sa...@gmail.com>wrote:
>
> > Thanks Sudhakara for your reply.
> > I did my experminets by varing number of reducers and made it double in
> > each experiments .I have a qustion regarding to the response time.Suppose
> > there is 6 cluster nodes and in first experminet i have 3 reducers and it
> > gets doubled (6 ) in second experiment  and in third one 12 .So what do
> we
> > expect to see in response time ? Should it get changed approximately like
> > T,T/2,T/4,.. ?!
> > What i get as response time is not changed like that,  decreasion is like
> > 2% or 3% .So i want to know by increasing the number of reducers how much
> > decreasion normally we should get in response time ?
> >
> > Samaneh
> >
> >
> > On Sun, Apr 7, 2013 at 7:53 PM, sudhakara st <su...@gmail.com>
> > wrote:
> >
> > > Hi Samanesh,
> > >
> > > You can experiment with
> > > 1. By varying  number reducer(mapred.reduce.tasks)
> > >
> > > (Configure these parameters depends to you system capacity) .
> > > mapred.tasktracker.map.tasks.maximum
> > > mapred.tasktracker.reduce.tasks.maximum
> > >
> > > Tasktrackers have a fixed number of slots for map tasks and for reduce
> > > tasks,The precise number depends on the number of cores and the amount
> of
> > > memory on the tasktracker nodes, for example,a a quad- core with8GM
> > memory
> > > may be able to run 3 map tasks and 2 reduce tasks (not precise, it
> depend
> > > what type job you are running) simultaneously.
> > >
> > >
> > > The right number of reduces seems to be 0.95 or 1.75 * (nodes *
> > > mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can
> launch
> > > immediately and start transferring map outputs as the maps finish. At
> > 1.75
> > > the faster nodes will finish their first round of reduces and launch a
> > > second round of reduces doing a much better job of load balancing.
> > >
> > > 2. These are some main job tuning factors in term cluster resource
> > > utilization(CPU, memory,I/O, network) and response time.
> > >    A)  io.sort.mb
> > >          io.sort.record.percent
> > >          io.sort.spill.percent
> > >          io.sort.factor
> > >           mapred.reduce.parallel.copies
> > >
> > >    B) Compression of Mapper and reducer outputs
> > >         mapred.map.output.compression.codec
> > >
> > >     C)Enabling/Disabling   Speculative job execution
> > >           mapred.map.tasks.speculative.execution.
> > >           mapred.reduce.tasks.speculative.execution
> > >
> > >     D) Enabling JVM reuse
> > >            mapred.job.reuse.jvm.num.tasks
> > >
> > >
> > > On Sun, Apr 7, 2013 at 10:31 PM, Samaneh Shokuhi
> > > <sa...@gmail.com>wrote:
> > >
> > > > Thanks Sudhakara for your reply.
> > > > So if number of mappers depends on the data size ,maybe the best way
> to
> > > do
> > > > my experiments is to increase the number of reducers based on the
> > number
> > > of
> > > > estimated blocks in data file.Actually i want to know how response
> time
> > > is
> > > > changed by changing the number of mappers and reducers.
> > > > Any idea about the way of  doing this kind of experiment?
> > > >
> > > > Samaneh
> > > >
> > > >
> > > > On Sun, Apr 7, 2013 at 6:29 PM, sudhakara st <sudhakara.st@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi Samaneh,
> > > > >
> > > > >             The number of map tasks for a given job is driven by
> the
> > > > number
> > > > > of input splits in the input data. ideally in default
> configurations
> > > >  each
> > > > > input split(for a block) a map task is spawned. So your 2.5G of
> data
> > > > > contains 44 blocks, therefore you jobs taking 44 map task. At
> > minimum,
> > > > with
> > > > > FileInputFormat derivatives, job will have at least one map per
> file
> > > and
> > > > > can have multiple maps per file if they extend beyond a single
> > > block(file
> > > > > size is more that block size). The *mapred.map.tasks* parameter is
> > > just a
> > > > > hint to the InputFormat for the number of maps. its does not have
> any
> > > > > effect if the number blocks in the input date more then specified
> > > value.
> > > > It
> > > > > not possible to specify number mapper need run for a job. But it
> > > possible
> > > > > to explicitly specify  number reduce can run for a job by using *
> > > > > mapred.reduce.tasks* property.
> > > > >
> > > > > The replication factor in not related in any to number of mapper
> and
> > > > > reducer.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Apr 7, 2013 at 7:38 PM, Samaneh Shokuhi
> > > > > <sa...@gmail.com>wrote:
> > > > >
> > > > > > Hi All,
> > > > > > I am doing some experiments by running WordCount example on
> hadoop.
> > > > > > I have a cluster with 7 nodes .I want to run WordCount example
> with
> > > > > > 3mappers and 3 reducers and compare the response time with
> another
> > > > > > experiments when number of mappers and reducers increased to 6
> and
> > 12
> > > > and
> > > > > > so on.
> > > > > > For first experiment i set number of the mappers and reducer to 3
> > in
> > > > > > wordCount example source code .and also set the number of
> > > replications
> > > > > to 3
> > > > > > in hadoop configurations.Also  the maximum number of tasks per
> node
> > > is
> > > > > set
> > > > > > to 1 .
> > > > > > But when i run the sample with a big data like 2.5 G ,i can see
> 44
> > > map
> > > > > > tasks and 3 reduce tasks are running !!
> > > > > >
> > > > > > What parameters do i need to set to have like (3Mappers,3
> > Reducers),
> > > > > > (6M,6R) and (12M,12R) and as i mentioned i have a cluster with 1
> > > > namenode
> > > > > > and 6 datanodes.
> > > > > > Is number of replications related to the number of mappers and
> > > reducers
> > > > > ?!
> > > > > > Regards,
> > > > > > Samaneh
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Regards,
> > > > > .....  Sudhakara.st
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Regards,
> > > .....  Sudhakara.st
> > >
> >
>
>
>
> --
>
> Regards,
> .....  Sudhakara.st
>

Re: configuring number of mappers and reducers

Posted by sudhakara st <su...@gmail.com>.

Hi Samanesh,

Increasing the reducer for a job would not help as you excepting. In most
of MR jobs  more then 60% time will spent in mapper phase(it depends upon
what type of operation performing on data in map and reducer phase).

Increasing the number of reduces increases the framework overhead, but
increases load balancing, available map-reduce slots allocation, system
resource utilization by considering job processes requirement we can
optimize the jobs for best performance with lowers the cost of failures.

One more i cannot understand is why your so much worrying about response
time ?. The response time purely depends upon the how much data you are
processing in the job, what type of operation performing on the data, how
data distributed in the cluster and capacity of your cluster.  A MR job
should says it is optimized it  contains balanced number of mapper and
reducer.  As per normal MR applications like word count i suggest to mapper
and reducer ratio 4:1(if your jobs running without  combiner, In word count
like program with combiner defined, then i will suggest use 10:1 ) .

While tuning the MR jobs we cannot consider only response time as parameter
to optimize the job, there so many other factors need consider, and
response time not only depends on number of reducer we configure for job,
it depends on numerous other factors as mentioned above.



On Tue, Apr 9, 2013 at 2:05 PM, Samaneh Shokuhi
<sa...@gmail.com>wrote:

> Thanks Sudhakara for your reply.
> I did my experminets by varing number of reducers and made it double in
> each experiments .I have a qustion regarding to the response time.Suppose
> there is 6 cluster nodes and in first experminet i have 3 reducers and it
> gets doubled (6 ) in second experiment  and in third one 12 .So what do we
> expect to see in response time ? Should it get changed approximately like
> T,T/2,T/4,.. ?!
> What i get as response time is not changed like that,  decreasion is like
> 2% or 3% .So i want to know by increasing the number of reducers how much
> decreasion normally we should get in response time ?
>
> Samaneh
>
>
> On Sun, Apr 7, 2013 at 7:53 PM, sudhakara st <su...@gmail.com>
> wrote:
>
> > Hi Samanesh,
> >
> > You can experiment with
> > 1. By varying  number reducer(mapred.reduce.tasks)
> >
> > (Configure these parameters depends to you system capacity) .
> > mapred.tasktracker.map.tasks.maximum
> > mapred.tasktracker.reduce.tasks.maximum
> >
> > Tasktrackers have a fixed number of slots for map tasks and for reduce
> > tasks,The precise number depends on the number of cores and the amount of
> > memory on the tasktracker nodes, for example,a a quad- core with8GM
> memory
> > may be able to run 3 map tasks and 2 reduce tasks (not precise, it depend
> > what type job you are running) simultaneously.
> >
> >
> > The right number of reduces seems to be 0.95 or 1.75 * (nodes *
> > mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can launch
> > immediately and start transferring map outputs as the maps finish. At
> 1.75
> > the faster nodes will finish their first round of reduces and launch a
> > second round of reduces doing a much better job of load balancing.
> >
> > 2. These are some main job tuning factors in term cluster resource
> > utilization(CPU, memory,I/O, network) and response time.
> >    A)  io.sort.mb
> >          io.sort.record.percent
> >          io.sort.spill.percent
> >          io.sort.factor
> >           mapred.reduce.parallel.copies
> >
> >    B) Compression of Mapper and reducer outputs
> >         mapred.map.output.compression.codec
> >
> >     C)Enabling/Disabling   Speculative job execution
> >           mapred.map.tasks.speculative.execution.
> >           mapred.reduce.tasks.speculative.execution
> >
> >     D) Enabling JVM reuse
> >            mapred.job.reuse.jvm.num.tasks
> >
> >
> > On Sun, Apr 7, 2013 at 10:31 PM, Samaneh Shokuhi
> > <sa...@gmail.com>wrote:
> >
> > > Thanks Sudhakara for your reply.
> > > So if number of mappers depends on the data size ,maybe the best way to
> > do
> > > my experiments is to increase the number of reducers based on the
> number
> > of
> > > estimated blocks in data file.Actually i want to know how response time
> > is
> > > changed by changing the number of mappers and reducers.
> > > Any idea about the way of  doing this kind of experiment?
> > >
> > > Samaneh
> > >
> > >
> > > On Sun, Apr 7, 2013 at 6:29 PM, sudhakara st <su...@gmail.com>
> > > wrote:
> > >
> > > > Hi Samaneh,
> > > >
> > > >             The number of map tasks for a given job is driven by the
> > > number
> > > > of input splits in the input data. ideally in default configurations
> > >  each
> > > > input split(for a block) a map task is spawned. So your 2.5G of data
> > > > contains 44 blocks, therefore you jobs taking 44 map task. At
> minimum,
> > > with
> > > > FileInputFormat derivatives, job will have at least one map per file
> > and
> > > > can have multiple maps per file if they extend beyond a single
> > block(file
> > > > size is more that block size). The *mapred.map.tasks* parameter is
> > just a
> > > > hint to the InputFormat for the number of maps. its does not have any
> > > > effect if the number blocks in the input date more then specified
> > value.
> > > It
> > > > not possible to specify number mapper need run for a job. But it
> > possible
> > > > to explicitly specify  number reduce can run for a job by using *
> > > > mapred.reduce.tasks* property.
> > > >
> > > > The replication factor in not related in any to number of mapper and
> > > > reducer.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Apr 7, 2013 at 7:38 PM, Samaneh Shokuhi
> > > > <sa...@gmail.com>wrote:
> > > >
> > > > > Hi All,
> > > > > I am doing some experiments by running WordCount example on hadoop.
> > > > > I have a cluster with 7 nodes .I want to run WordCount example with
> > > > > 3mappers and 3 reducers and compare the response time with another
> > > > > experiments when number of mappers and reducers increased to 6 and
> 12
> > > and
> > > > > so on.
> > > > > For first experiment i set number of the mappers and reducer to 3
> in
> > > > > wordCount example source code .and also set the number of
> > replications
> > > > to 3
> > > > > in hadoop configurations.Also  the maximum number of tasks per node
> > is
> > > > set
> > > > > to 1 .
> > > > > But when i run the sample with a big data like 2.5 G ,i can see 44
> > map
> > > > > tasks and 3 reduce tasks are running !!
> > > > >
> > > > > What parameters do i need to set to have like (3Mappers,3
> Reducers),
> > > > > (6M,6R) and (12M,12R) and as i mentioned i have a cluster with 1
> > > namenode
> > > > > and 6 datanodes.
> > > > > Is number of replications related to the number of mappers and
> > reducers
> > > > ?!
> > > > > Regards,
> > > > > Samaneh
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Regards,
> > > > .....  Sudhakara.st
> > > >
> > >
> >
> >
> >
> > --
> >
> > Regards,
> > .....  Sudhakara.st
> >
>



-- 

Regards,
.....  Sudhakara.st

Re: configuring number of mappers and reducers

Posted by Samaneh Shokuhi <sa...@gmail.com>.

Thanks Sudhakara for your reply.
I did my experminets by varing number of reducers and made it double in
each experiments .I have a qustion regarding to the response time.Suppose
there is 6 cluster nodes and in first experminet i have 3 reducers and it
gets doubled (6 ) in second experiment  and in third one 12 .So what do we
expect to see in response time ? Should it get changed approximately like
T,T/2,T/4,.. ?!
What i get as response time is not changed like that,  decreasion is like
2% or 3% .So i want to know by increasing the number of reducers how much
decreasion normally we should get in response time ?

Samaneh


On Sun, Apr 7, 2013 at 7:53 PM, sudhakara st <su...@gmail.com> wrote:

> Hi Samanesh,
>
> You can experiment with
> 1. By varying  number reducer(mapred.reduce.tasks)
>
> (Configure these parameters depends to you system capacity) .
> mapred.tasktracker.map.tasks.maximum
> mapred.tasktracker.reduce.tasks.maximum
>
> Tasktrackers have a fixed number of slots for map tasks and for reduce
> tasks,The precise number depends on the number of cores and the amount of
> memory on the tasktracker nodes, for example,a a quad- core with8GM memory
> may be able to run 3 map tasks and 2 reduce tasks (not precise, it depend
> what type job you are running) simultaneously.
>
>
> The right number of reduces seems to be 0.95 or 1.75 * (nodes *
> mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can launch
> immediately and start transferring map outputs as the maps finish. At 1.75
> the faster nodes will finish their first round of reduces and launch a
> second round of reduces doing a much better job of load balancing.
>
> 2. These are some main job tuning factors in term cluster resource
> utilization(CPU, memory,I/O, network) and response time.
>    A)  io.sort.mb
>          io.sort.record.percent
>          io.sort.spill.percent
>          io.sort.factor
>           mapred.reduce.parallel.copies
>
>    B) Compression of Mapper and reducer outputs
>         mapred.map.output.compression.codec
>
>     C)Enabling/Disabling   Speculative job execution
>           mapred.map.tasks.speculative.execution.
>           mapred.reduce.tasks.speculative.execution
>
>     D) Enabling JVM reuse
>            mapred.job.reuse.jvm.num.tasks
>
>
> On Sun, Apr 7, 2013 at 10:31 PM, Samaneh Shokuhi
> <sa...@gmail.com>wrote:
>
> > Thanks Sudhakara for your reply.
> > So if number of mappers depends on the data size ,maybe the best way to
> do
> > my experiments is to increase the number of reducers based on the number
> of
> > estimated blocks in data file.Actually i want to know how response time
> is
> > changed by changing the number of mappers and reducers.
> > Any idea about the way of  doing this kind of experiment?
> >
> > Samaneh
> >
> >
> > On Sun, Apr 7, 2013 at 6:29 PM, sudhakara st <su...@gmail.com>
> > wrote:
> >
> > > Hi Samaneh,
> > >
> > >             The number of map tasks for a given job is driven by the
> > number
> > > of input splits in the input data. ideally in default configurations
> >  each
> > > input split(for a block) a map task is spawned. So your 2.5G of data
> > > contains 44 blocks, therefore you jobs taking 44 map task. At minimum,
> > with
> > > FileInputFormat derivatives, job will have at least one map per file
> and
> > > can have multiple maps per file if they extend beyond a single
> block(file
> > > size is more that block size). The *mapred.map.tasks* parameter is
> just a
> > > hint to the InputFormat for the number of maps. its does not have any
> > > effect if the number blocks in the input date more then specified
> value.
> > It
> > > not possible to specify number mapper need run for a job. But it
> possible
> > > to explicitly specify  number reduce can run for a job by using *
> > > mapred.reduce.tasks* property.
> > >
> > > The replication factor in not related in any to number of mapper and
> > > reducer.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Sun, Apr 7, 2013 at 7:38 PM, Samaneh Shokuhi
> > > <sa...@gmail.com>wrote:
> > >
> > > > Hi All,
> > > > I am doing some experiments by running WordCount example on hadoop.
> > > > I have a cluster with 7 nodes .I want to run WordCount example with
> > > > 3mappers and 3 reducers and compare the response time with another
> > > > experiments when number of mappers and reducers increased to 6 and 12
> > and
> > > > so on.
> > > > For first experiment i set number of the mappers and reducer to 3 in
> > > > wordCount example source code .and also set the number of
> replications
> > > to 3
> > > > in hadoop configurations.Also  the maximum number of tasks per node
> is
> > > set
> > > > to 1 .
> > > > But when i run the sample with a big data like 2.5 G ,i can see 44
> map
> > > > tasks and 3 reduce tasks are running !!
> > > >
> > > > What parameters do i need to set to have like (3Mappers,3 Reducers),
> > > > (6M,6R) and (12M,12R) and as i mentioned i have a cluster with 1
> > namenode
> > > > and 6 datanodes.
> > > > Is number of replications related to the number of mappers and
> reducers
> > > ?!
> > > > Regards,
> > > > Samaneh
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Regards,
> > > .....  Sudhakara.st
> > >
> >
>
>
>
> --
>
> Regards,
> .....  Sudhakara.st
>

Re: configuring number of mappers and reducers

Posted by sudhakara st <su...@gmail.com>.

Hi Samanesh,

You can experiment with
1. By varying  number reducer(mapred.reduce.tasks)

(Configure these parameters depends to you system capacity) .
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum

Tasktrackers have a fixed number of slots for map tasks and for reduce
tasks,The precise number depends on the number of cores and the amount of
memory on the tasktracker nodes, for example,a a quad- core with8GM memory
may be able to run 3 map tasks and 2 reduce tasks (not precise, it depend
what type job you are running) simultaneously.


The right number of reduces seems to be 0.95 or 1.75 * (nodes *
mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can launch
immediately and start transferring map outputs as the maps finish. At 1.75
the faster nodes will finish their first round of reduces and launch a
second round of reduces doing a much better job of load balancing.

2. These are some main job tuning factors in term cluster resource
utilization(CPU, memory,I/O, network) and response time.
   A)  io.sort.mb
         io.sort.record.percent
         io.sort.spill.percent
         io.sort.factor
          mapred.reduce.parallel.copies

   B) Compression of Mapper and reducer outputs
        mapred.map.output.compression.codec

    C)Enabling/Disabling   Speculative job execution
          mapred.map.tasks.speculative.execution.
          mapred.reduce.tasks.speculative.execution

    D) Enabling JVM reuse
           mapred.job.reuse.jvm.num.tasks


On Sun, Apr 7, 2013 at 10:31 PM, Samaneh Shokuhi
<sa...@gmail.com>wrote:

> Thanks Sudhakara for your reply.
> So if number of mappers depends on the data size ,maybe the best way to do
> my experiments is to increase the number of reducers based on the number of
> estimated blocks in data file.Actually i want to know how response time is
> changed by changing the number of mappers and reducers.
> Any idea about the way of  doing this kind of experiment?
>
> Samaneh
>
>
> On Sun, Apr 7, 2013 at 6:29 PM, sudhakara st <su...@gmail.com>
> wrote:
>
> > Hi Samaneh,
> >
> >             The number of map tasks for a given job is driven by the
> number
> > of input splits in the input data. ideally in default configurations
>  each
> > input split(for a block) a map task is spawned. So your 2.5G of data
> > contains 44 blocks, therefore you jobs taking 44 map task. At minimum,
> with
> > FileInputFormat derivatives, job will have at least one map per file and
> > can have multiple maps per file if they extend beyond a single block(file
> > size is more that block size). The *mapred.map.tasks* parameter is just a
> > hint to the InputFormat for the number of maps. its does not have any
> > effect if the number blocks in the input date more then specified value.
> It
> > not possible to specify number mapper need run for a job. But it possible
> > to explicitly specify  number reduce can run for a job by using *
> > mapred.reduce.tasks* property.
> >
> > The replication factor in not related in any to number of mapper and
> > reducer.
> >
> >
> >
> >
> >
> >
> >
> > On Sun, Apr 7, 2013 at 7:38 PM, Samaneh Shokuhi
> > <sa...@gmail.com>wrote:
> >
> > > Hi All,
> > > I am doing some experiments by running WordCount example on hadoop.
> > > I have a cluster with 7 nodes .I want to run WordCount example with
> > > 3mappers and 3 reducers and compare the response time with another
> > > experiments when number of mappers and reducers increased to 6 and 12
> and
> > > so on.
> > > For first experiment i set number of the mappers and reducer to 3 in
> > > wordCount example source code .and also set the number of replications
> > to 3
> > > in hadoop configurations.Also  the maximum number of tasks per node is
> > set
> > > to 1 .
> > > But when i run the sample with a big data like 2.5 G ,i can see 44 map
> > > tasks and 3 reduce tasks are running !!
> > >
> > > What parameters do i need to set to have like (3Mappers,3 Reducers),
> > > (6M,6R) and (12M,12R) and as i mentioned i have a cluster with 1
> namenode
> > > and 6 datanodes.
> > > Is number of replications related to the number of mappers and reducers
> > ?!
> > > Regards,
> > > Samaneh
> > >
> >
> >
> >
> > --
> >
> > Regards,
> > .....  Sudhakara.st
> >
>



-- 

Regards,
.....  Sudhakara.st

Re: configuring number of mappers and reducers

Posted by Samaneh Shokuhi <sa...@gmail.com>.

Thanks Sudhakara for your reply.
So if number of mappers depends on the data size ,maybe the best way to do
my experiments is to increase the number of reducers based on the number of
estimated blocks in data file.Actually i want to know how response time is
changed by changing the number of mappers and reducers.
Any idea about the way of  doing this kind of experiment?

Samaneh


On Sun, Apr 7, 2013 at 6:29 PM, sudhakara st <su...@gmail.com> wrote:

> Hi Samaneh,
>
>             The number of map tasks for a given job is driven by the number
> of input splits in the input data. ideally in default configurations  each
> input split(for a block) a map task is spawned. So your 2.5G of data
> contains 44 blocks, therefore you jobs taking 44 map task. At minimum, with
> FileInputFormat derivatives, job will have at least one map per file and
> can have multiple maps per file if they extend beyond a single block(file
> size is more that block size). The *mapred.map.tasks* parameter is just a
> hint to the InputFormat for the number of maps. its does not have any
> effect if the number blocks in the input date more then specified value. It
> not possible to specify number mapper need run for a job. But it possible
> to explicitly specify  number reduce can run for a job by using *
> mapred.reduce.tasks* property.
>
> The replication factor in not related in any to number of mapper and
> reducer.
>
>
>
>
>
>
>
> On Sun, Apr 7, 2013 at 7:38 PM, Samaneh Shokuhi
> <sa...@gmail.com>wrote:
>
> > Hi All,
> > I am doing some experiments by running WordCount example on hadoop.
> > I have a cluster with 7 nodes .I want to run WordCount example with
> > 3mappers and 3 reducers and compare the response time with another
> > experiments when number of mappers and reducers increased to 6 and 12 and
> > so on.
> > For first experiment i set number of the mappers and reducer to 3 in
> > wordCount example source code .and also set the number of replications
> to 3
> > in hadoop configurations.Also  the maximum number of tasks per node is
> set
> > to 1 .
> > But when i run the sample with a big data like 2.5 G ,i can see 44 map
> > tasks and 3 reduce tasks are running !!
> >
> > What parameters do i need to set to have like (3Mappers,3 Reducers),
> > (6M,6R) and (12M,12R) and as i mentioned i have a cluster with 1 namenode
> > and 6 datanodes.
> > Is number of replications related to the number of mappers and reducers
> ?!
> > Regards,
> > Samaneh
> >
>
>
>
> --
>
> Regards,
> .....  Sudhakara.st
>

Re: configuring number of mappers and reducers

Posted by sudhakara st <su...@gmail.com>.

Hi Samaneh,

            The number of map tasks for a given job is driven by the number
of input splits in the input data. ideally in default configurations  each
input split(for a block) a map task is spawned. So your 2.5G of data
contains 44 blocks, therefore you jobs taking 44 map task. At minimum, with
FileInputFormat derivatives, job will have at least one map per file and
can have multiple maps per file if they extend beyond a single block(file
size is more that block size). The *mapred.map.tasks* parameter is just a
hint to the InputFormat for the number of maps. its does not have any
effect if the number blocks in the input date more then specified value. It
not possible to specify number mapper need run for a job. But it possible
to explicitly specify  number reduce can run for a job by using *
mapred.reduce.tasks* property.

The replication factor in not related in any to number of mapper and
reducer.







On Sun, Apr 7, 2013 at 7:38 PM, Samaneh Shokuhi
<sa...@gmail.com>wrote:

> Hi All,
> I am doing some experiments by running WordCount example on hadoop.
> I have a cluster with 7 nodes .I want to run WordCount example with
> 3mappers and 3 reducers and compare the response time with another
> experiments when number of mappers and reducers increased to 6 and 12 and
> so on.
> For first experiment i set number of the mappers and reducer to 3 in
> wordCount example source code .and also set the number of replications to 3
> in hadoop configurations.Also  the maximum number of tasks per node is set
> to 1 .
> But when i run the sample with a big data like 2.5 G ,i can see 44 map
> tasks and 3 reduce tasks are running !!
>
> What parameters do i need to set to have like (3Mappers,3 Reducers),
> (6M,6R) and (12M,12R) and as i mentioned i have a cluster with 1 namenode
> and 6 datanodes.
> Is number of replications related to the number of mappers and reducers ?!
> Regards,
> Samaneh
>



-- 

Regards,
.....  Sudhakara.st