You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Sandy <sn...@gmail.com> on 2009/03/04 23:46:04 UTC

wordcount getting slower with more mappers and reducers?

Hello all,

For the sake of benchmarking, I ran the standard hadoop wordcount example on
an input file using 2, 4, and 8 mappers and reducers for my job.
In other words,  I do:

time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
sample.txt output
time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
sample.txt output2
time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
sample.txt output3

Strangely enough, when this increase in mappers and reducers result in
slower running times!
-On 2 mappers and reducers it ran for 40 seconds
on 4 mappers and reducers it ran for 60 seconds
on 8 mappers and reducers it ran for 90 seconds!

Please note that the "sample.txt" file is identical in each of these runs.

I have the following questions:
- Shouldn't wordcount get -faster- with additional mappers and reducers,
instead of slower?
- If it does get faster for other people, why does it become slower for me?
  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro
with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs

I would greatly appreciate it if someone could explain this behavior to me,
and tell me if I'm running this wrong. How can I change my settings (if at
all) to get wordcount running faster when i increases that number of maps
and reduces?

Thanks,
-SM

Re: wordcount getting slower with more mappers and reducers?

Posted by Sandy <sn...@gmail.com>.

I specified a directory containing my 428MB file split into 8 files. Same
results.

I should summarize my hadoop-site.xml file:

mapred.tasktracker.tasks.maximum = 4
mapred.line.input.format.linespermap = 1
mapred.task.timeout = 0
mapred.min.split.size = 1
mapred.child.java.opts = -Xmx20000M
io.sort.factor = 200
io.sort.mb = 100
fs.inmemory.size.mb = 200
mapred.inmem.merge.threshold = 1000
dfs.replication = 1
mapred.reduce.parallel.copies = 5

I know the mapred.child.java.opts parameter is a little ridiculous, but I
was just playing around and seeing what could possibly make things faster.
For some reason, that did.

Nick, I'm going to try larger files and get back to you.

-SM

On Thu, Mar 5, 2009 at 10:37 AM, Nick Cen <ce...@gmail.com> wrote:

> Try to split your sample.txt into multi files.  and try it again.
> For text input format , the number of task is equals to the input size.
>
>
> 2009/3/6 Sandy <sn...@gmail.com>
>
> > I used three different sample.txt files, and was able to replicate the
> > error. The first was 1.5MB, the second 66MB, and the last 428MB. I get
> the
> > same problem despite what size of input file I use: the running time of
> > wordcount increases with the number of mappers and reducers specified. If
> > it
> > is the problem of the input file, how big do I have to go before it
> > disappears entirely?
> >
> > If it is psuedo-distributed mode that's the issue, what mode should I be
> > running on my machine, given it's specs? Once again, it is a SINGLE
> MacPro
> > with 16GB of RAM, 4  1TB hard disks, and 2 quad-core processors.
> >
> > I'm not sure if it's HADOOP-2771, since the sort/merge(shuffle) is what
> > seems to be taking the longest:
> > 2 M/R ==> map: 18 sec, shuffle: 15 sec, reduce: 9 sec
> > 4 M/R ==> map: 19 sec, shuffle: 37 sec, reduce: 2 sec
> > 8 M/R ==> map: 21 sec, shuffle: 1 min 10 sec, 1 sec
> >
> > To make sure it's not because of the combiner, I removed it and reran
> > everything again, and got the same bottom-line: With increasing maps and
> > reducers, running time goes up, with majority of time seeming to be in
> > sort/merge.
> >
> > Also, another thing we noticed is that the CPUs seem to be very active
> > during the map phase, but when the map phase reaches 100%, and only
> reduce
> > appears to be running, the CPUs all become idle. Furthermore, despite the
> > number of mappers I specify, all the CPUs become very active when a job
> is
> > running. Why is this so? If I specify 2 mappers and 2 reducers, won't
> there
> > be just 2 or 4 CPUs that should be active? Why are all 8 active?
> >
> > Since I can reproduce this error using Hadoop's standard word count
> > example,
> > I was hoping that someone else could tell me if they can reproduce this
> > too.
> > Is it true that when you increase the number of mappers and reducers on
> > your
> > systems, the running time of wordcount goes up?
> >
> > Thanks for the help! I'm looking forward to your responses.
> >
> > -SM
> >
> > On Thu, Mar 5, 2009 at 2:57 AM, Amareshwari Sriramadasu <
> > amarsri@yahoo-inc.com> wrote:
> >
> > > Are you hitting HADOOP-2771?
> > > -Amareshwari
> > >
> > > Sandy wrote:
> > >
> > >> Hello all,
> > >>
> > >> For the sake of benchmarking, I ran the standard hadoop wordcount
> > example
> > >> on
> > >> an input file using 2, 4, and 8 mappers and reducers for my job.
> > >> In other words,  I do:
> > >>
> > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
> > >> sample.txt output
> > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
> > >> sample.txt output2
> > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
> > >> sample.txt output3
> > >>
> > >> Strangely enough, when this increase in mappers and reducers result in
> > >> slower running times!
> > >> -On 2 mappers and reducers it ran for 40 seconds
> > >> on 4 mappers and reducers it ran for 60 seconds
> > >> on 8 mappers and reducers it ran for 90 seconds!
> > >>
> > >> Please note that the "sample.txt" file is identical in each of these
> > runs.
> > >>
> > >> I have the following questions:
> > >> - Shouldn't wordcount get -faster- with additional mappers and
> reducers,
> > >> instead of slower?
> > >> - If it does get faster for other people, why does it become slower
> for
> > >> me?
> > >>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac
> > Pro
> > >> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
> > >>
> > >> I would greatly appreciate it if someone could explain this behavior
> to
> > >> me,
> > >> and tell me if I'm running this wrong. How can I change my settings
> (if
> > at
> > >> all) to get wordcount running faster when i increases that number of
> > maps
> > >> and reduces?
> > >>
> > >> Thanks,
> > >> -SM
> > >>
> > >>
> > >>
> > >
> > >
> >
>
>
>
> --
> http://daily.appspot.com/food/
>

Re: wordcount getting slower with more mappers and reducers?

Posted by Nick Cen <ce...@gmail.com>.

Try to split your sample.txt into multi files.  and try it again.
For text input format , the number of task is equals to the input size.


2009/3/6 Sandy <sn...@gmail.com>

> I used three different sample.txt files, and was able to replicate the
> error. The first was 1.5MB, the second 66MB, and the last 428MB. I get the
> same problem despite what size of input file I use: the running time of
> wordcount increases with the number of mappers and reducers specified. If
> it
> is the problem of the input file, how big do I have to go before it
> disappears entirely?
>
> If it is psuedo-distributed mode that's the issue, what mode should I be
> running on my machine, given it's specs? Once again, it is a SINGLE MacPro
> with 16GB of RAM, 4  1TB hard disks, and 2 quad-core processors.
>
> I'm not sure if it's HADOOP-2771, since the sort/merge(shuffle) is what
> seems to be taking the longest:
> 2 M/R ==> map: 18 sec, shuffle: 15 sec, reduce: 9 sec
> 4 M/R ==> map: 19 sec, shuffle: 37 sec, reduce: 2 sec
> 8 M/R ==> map: 21 sec, shuffle: 1 min 10 sec, 1 sec
>
> To make sure it's not because of the combiner, I removed it and reran
> everything again, and got the same bottom-line: With increasing maps and
> reducers, running time goes up, with majority of time seeming to be in
> sort/merge.
>
> Also, another thing we noticed is that the CPUs seem to be very active
> during the map phase, but when the map phase reaches 100%, and only reduce
> appears to be running, the CPUs all become idle. Furthermore, despite the
> number of mappers I specify, all the CPUs become very active when a job is
> running. Why is this so? If I specify 2 mappers and 2 reducers, won't there
> be just 2 or 4 CPUs that should be active? Why are all 8 active?
>
> Since I can reproduce this error using Hadoop's standard word count
> example,
> I was hoping that someone else could tell me if they can reproduce this
> too.
> Is it true that when you increase the number of mappers and reducers on
> your
> systems, the running time of wordcount goes up?
>
> Thanks for the help! I'm looking forward to your responses.
>
> -SM
>
> On Thu, Mar 5, 2009 at 2:57 AM, Amareshwari Sriramadasu <
> amarsri@yahoo-inc.com> wrote:
>
> > Are you hitting HADOOP-2771?
> > -Amareshwari
> >
> > Sandy wrote:
> >
> >> Hello all,
> >>
> >> For the sake of benchmarking, I ran the standard hadoop wordcount
> example
> >> on
> >> an input file using 2, 4, and 8 mappers and reducers for my job.
> >> In other words,  I do:
> >>
> >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
> >> sample.txt output
> >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
> >> sample.txt output2
> >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
> >> sample.txt output3
> >>
> >> Strangely enough, when this increase in mappers and reducers result in
> >> slower running times!
> >> -On 2 mappers and reducers it ran for 40 seconds
> >> on 4 mappers and reducers it ran for 60 seconds
> >> on 8 mappers and reducers it ran for 90 seconds!
> >>
> >> Please note that the "sample.txt" file is identical in each of these
> runs.
> >>
> >> I have the following questions:
> >> - Shouldn't wordcount get -faster- with additional mappers and reducers,
> >> instead of slower?
> >> - If it does get faster for other people, why does it become slower for
> >> me?
> >>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac
> Pro
> >> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
> >>
> >> I would greatly appreciate it if someone could explain this behavior to
> >> me,
> >> and tell me if I'm running this wrong. How can I change my settings (if
> at
> >> all) to get wordcount running faster when i increases that number of
> maps
> >> and reduces?
> >>
> >> Thanks,
> >> -SM
> >>
> >>
> >>
> >
> >
>



-- 
http://daily.appspot.com/food/

Re: wordcount getting slower with more mappers and reducers?

Posted by Matt Ingenthron <Ma...@Sun.COM>.

Sandy wrote:
> I used three different sample.txt files, and was able to replicate the
> error. The first was 1.5MB, the second 66MB, and the last 428MB. I get the
> same problem despite what size of input file I use: the running time of
> wordcount increases with the number of mappers and reducers specified. If it
> is the problem of the input file, how big do I have to go before it
> disappears entirely?
>   

Keep in mind that as long as the file < memory, it's likely coming 
straight out of filesystem cache.  In your kind of system configuration, 
running as fast a possible, a core or two can saturate a memory 
controller and then there would be contention showing no speedup with 
more mappers.

If you really want a feel for what this would be like, you should 
probably have much more input data.  It will entirely change as soon as 
you have to wait on disk IO.

Hope that helps,

- Matt
> If it is psuedo-distributed mode that's the issue, what mode should I be
> running on my machine, given it's specs? Once again, it is a SINGLE MacPro
> with 16GB of RAM, 4  1TB hard disks, and 2 quad-core processors.
>
> I'm not sure if it's HADOOP-2771, since the sort/merge(shuffle) is what
> seems to be taking the longest:
> 2 M/R ==> map: 18 sec, shuffle: 15 sec, reduce: 9 sec
> 4 M/R ==> map: 19 sec, shuffle: 37 sec, reduce: 2 sec
> 8 M/R ==> map: 21 sec, shuffle: 1 min 10 sec, 1 sec
>
> To make sure it's not because of the combiner, I removed it and reran
> everything again, and got the same bottom-line: With increasing maps and
> reducers, running time goes up, with majority of time seeming to be in
> sort/merge.
>
> Also, another thing we noticed is that the CPUs seem to be very active
> during the map phase, but when the map phase reaches 100%, and only reduce
> appears to be running, the CPUs all become idle. Furthermore, despite the
> number of mappers I specify, all the CPUs become very active when a job is
> running. Why is this so? If I specify 2 mappers and 2 reducers, won't there
> be just 2 or 4 CPUs that should be active? Why are all 8 active?
>
> Since I can reproduce this error using Hadoop's standard word count example,
> I was hoping that someone else could tell me if they can reproduce this too.
> Is it true that when you increase the number of mappers and reducers on your
> systems, the running time of wordcount goes up?
>
> Thanks for the help! I'm looking forward to your responses.
>
> -SM
>
> On Thu, Mar 5, 2009 at 2:57 AM, Amareshwari Sriramadasu <
> amarsri@yahoo-inc.com> wrote:
>
>   
>> Are you hitting HADOOP-2771?
>> -Amareshwari
>>
>> Sandy wrote:
>>
>>     
>>> Hello all,
>>>
>>> For the sake of benchmarking, I ran the standard hadoop wordcount example
>>> on
>>> an input file using 2, 4, and 8 mappers and reducers for my job.
>>> In other words,  I do:
>>>
>>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
>>> sample.txt output
>>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
>>> sample.txt output2
>>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
>>> sample.txt output3
>>>
>>> Strangely enough, when this increase in mappers and reducers result in
>>> slower running times!
>>> -On 2 mappers and reducers it ran for 40 seconds
>>> on 4 mappers and reducers it ran for 60 seconds
>>> on 8 mappers and reducers it ran for 90 seconds!
>>>
>>> Please note that the "sample.txt" file is identical in each of these runs.
>>>
>>> I have the following questions:
>>> - Shouldn't wordcount get -faster- with additional mappers and reducers,
>>> instead of slower?
>>> - If it does get faster for other people, why does it become slower for
>>> me?
>>>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro
>>> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
>>>
>>> I would greatly appreciate it if someone could explain this behavior to
>>> me,
>>> and tell me if I'm running this wrong. How can I change my settings (if at
>>> all) to get wordcount running faster when i increases that number of maps
>>> and reduces?
>>>
>>> Thanks,
>>> -SM
>>>
>>>
>>>
>>>       
>>     
>
>

Re: wordcount getting slower with more mappers and reducers?

Posted by Sandy <sn...@gmail.com>.

I used three different sample.txt files, and was able to replicate the
error. The first was 1.5MB, the second 66MB, and the last 428MB. I get the
same problem despite what size of input file I use: the running time of
wordcount increases with the number of mappers and reducers specified. If it
is the problem of the input file, how big do I have to go before it
disappears entirely?

If it is psuedo-distributed mode that's the issue, what mode should I be
running on my machine, given it's specs? Once again, it is a SINGLE MacPro
with 16GB of RAM, 4  1TB hard disks, and 2 quad-core processors.

I'm not sure if it's HADOOP-2771, since the sort/merge(shuffle) is what
seems to be taking the longest:
2 M/R ==> map: 18 sec, shuffle: 15 sec, reduce: 9 sec
4 M/R ==> map: 19 sec, shuffle: 37 sec, reduce: 2 sec
8 M/R ==> map: 21 sec, shuffle: 1 min 10 sec, 1 sec

To make sure it's not because of the combiner, I removed it and reran
everything again, and got the same bottom-line: With increasing maps and
reducers, running time goes up, with majority of time seeming to be in
sort/merge.

Also, another thing we noticed is that the CPUs seem to be very active
during the map phase, but when the map phase reaches 100%, and only reduce
appears to be running, the CPUs all become idle. Furthermore, despite the
number of mappers I specify, all the CPUs become very active when a job is
running. Why is this so? If I specify 2 mappers and 2 reducers, won't there
be just 2 or 4 CPUs that should be active? Why are all 8 active?

Since I can reproduce this error using Hadoop's standard word count example,
I was hoping that someone else could tell me if they can reproduce this too.
Is it true that when you increase the number of mappers and reducers on your
systems, the running time of wordcount goes up?

Thanks for the help! I'm looking forward to your responses.

-SM

On Thu, Mar 5, 2009 at 2:57 AM, Amareshwari Sriramadasu <
amarsri@yahoo-inc.com> wrote:

> Are you hitting HADOOP-2771?
> -Amareshwari
>
> Sandy wrote:
>
>> Hello all,
>>
>> For the sake of benchmarking, I ran the standard hadoop wordcount example
>> on
>> an input file using 2, 4, and 8 mappers and reducers for my job.
>> In other words,  I do:
>>
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
>> sample.txt output
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
>> sample.txt output2
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
>> sample.txt output3
>>
>> Strangely enough, when this increase in mappers and reducers result in
>> slower running times!
>> -On 2 mappers and reducers it ran for 40 seconds
>> on 4 mappers and reducers it ran for 60 seconds
>> on 8 mappers and reducers it ran for 90 seconds!
>>
>> Please note that the "sample.txt" file is identical in each of these runs.
>>
>> I have the following questions:
>> - Shouldn't wordcount get -faster- with additional mappers and reducers,
>> instead of slower?
>> - If it does get faster for other people, why does it become slower for
>> me?
>>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro
>> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
>>
>> I would greatly appreciate it if someone could explain this behavior to
>> me,
>> and tell me if I'm running this wrong. How can I change my settings (if at
>> all) to get wordcount running faster when i increases that number of maps
>> and reduces?
>>
>> Thanks,
>> -SM
>>
>>
>>
>
>

Re: wordcount getting slower with more mappers and reducers?

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.

Are you hitting HADOOP-2771?
-Amareshwari
Sandy wrote:
> Hello all,
>
> For the sake of benchmarking, I ran the standard hadoop wordcount example on
> an input file using 2, 4, and 8 mappers and reducers for my job.
> In other words,  I do:
>
> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
> sample.txt output
> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
> sample.txt output2
> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
> sample.txt output3
>
> Strangely enough, when this increase in mappers and reducers result in
> slower running times!
> -On 2 mappers and reducers it ran for 40 seconds
> on 4 mappers and reducers it ran for 60 seconds
> on 8 mappers and reducers it ran for 90 seconds!
>
> Please note that the "sample.txt" file is identical in each of these runs.
>
> I have the following questions:
> - Shouldn't wordcount get -faster- with additional mappers and reducers,
> instead of slower?
> - If it does get faster for other people, why does it become slower for me?
>   I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro
> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
>
> I would greatly appreciate it if someone could explain this behavior to me,
> and tell me if I'm running this wrong. How can I change my settings (if at
> all) to get wordcount running faster when i increases that number of maps
> and reduces?
>
> Thanks,
> -SM
>
>

Re: wordcount getting slower with more mappers and reducers?

Posted by Nick Cen <ce...@gmail.com>.

i think this maybe not relatived to whether you are using psuedo-distributed
mode or truely distributed mode.

the speed not only relatived to the number of mapper and reducer count but
also relatived to the problem size and problem type.

 A simple example is the word count ,assume we only have 1 line in the
sample.txt. Then 1 mapper will be enouth, addon's mapper has no usage for
the whole process but still causing process time for create and destroy
mapper.


2009/3/5 haizhou zhao <ra...@gmail.com>

> Since you are running hadoop on psuedo-distributed mode, it is possible
> that
> just 1 reduce task will bing better performance, and this will depend on
> your input's size and content.
> 2009/3/5 Sandy <sn...@gmail.com>
>
> > Hello all,
> >
> > For the sake of benchmarking, I ran the standard hadoop wordcount example
> > on
> > an input file using 2, 4, and 8 mappers and reducers for my job.
> > In other words,  I do:
> >
> > time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
> > sample.txt output
> > time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
> > sample.txt output2
> > time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
> > sample.txt output3
> >
> > Strangely enough, when this increase in mappers and reducers result in
> > slower running times!
> > -On 2 mappers and reducers it ran for 40 seconds
> > on 4 mappers and reducers it ran for 60 seconds
> > on 8 mappers and reducers it ran for 90 seconds!
> >
> > Please note that the "sample.txt" file is identical in each of these
> runs.
> >
> > I have the following questions:
> > - Shouldn't wordcount get -faster- with additional mappers and reducers,
> > instead of slower?
> > - If it does get faster for other people, why does it become slower for
> me?
> >  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac
> Pro
> > with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
> >
> > I would greatly appreciate it if someone could explain this behavior to
> me,
> > and tell me if I'm running this wrong. How can I change my settings (if
> at
> > all) to get wordcount running faster when i increases that number of maps
> > and reduces?
> >
> > Thanks,
> > -SM
> >
>



-- 
http://daily.appspot.com/food/

Re: wordcount getting slower with more mappers and reducers?

Posted by haizhou zhao <ra...@gmail.com>.

Since you are running hadoop on psuedo-distributed mode, it is possible that
just 1 reduce task will bing better performance, and this will depend on
your input's size and content.
2009/3/5 Sandy <sn...@gmail.com>

> Hello all,
>
> For the sake of benchmarking, I ran the standard hadoop wordcount example
> on
> an input file using 2, 4, and 8 mappers and reducers for my job.
> In other words,  I do:
>
> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
> sample.txt output
> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
> sample.txt output2
> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
> sample.txt output3
>
> Strangely enough, when this increase in mappers and reducers result in
> slower running times!
> -On 2 mappers and reducers it ran for 40 seconds
> on 4 mappers and reducers it ran for 60 seconds
> on 8 mappers and reducers it ran for 90 seconds!
>
> Please note that the "sample.txt" file is identical in each of these runs.
>
> I have the following questions:
> - Shouldn't wordcount get -faster- with additional mappers and reducers,
> instead of slower?
> - If it does get faster for other people, why does it become slower for me?
>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro
> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
>
> I would greatly appreciate it if someone could explain this behavior to me,
> and tell me if I'm running this wrong. How can I change my settings (if at
> all) to get wordcount running faster when i increases that number of maps
> and reduces?
>
> Thanks,
> -SM
>

Re: wordcount getting slower with more mappers and reducers?

Posted by Jim Twensky <ji...@gmail.com>.

Sandy,

Correct me if I'm wrong but, if you have only two cores and you are running
your jobs in pseudo distributed mode, what is the point of having more than
2 mappers/reducers? Any number larger than 2 would make the mapper/reducer
threads serialize. That serialization would certainly be an overhead and
increase the running times. Do you get a performance increase if you run

map1 reduce1
map1 reduce2

and

map1 reduce2
map2 reduce2

in those particular orders?

-jim



On Fri, Mar 6, 2009 at 1:34 PM, Sandy <sn...@gmail.com> wrote:

> I'm trying to make sense of the results, but running it like this working
> at
> least a little better.
>
> map4  reduce1
> map4 reduce2
> map4 reduce4
> map4 reduce8
>
> I tried keeping the reduces constant, while varying the maps.. this results
> in an increase of running time.
>
> When I tried keeping the maps constant, and varying the reduces, I got
> something better, though when it hit something like map4 reduce4, the
> running time shoots up, even though previously it had been decreasing.
>
> This has been very helpful... though I am very curious: Is the reason one
> worked better than the other a function of the input only? Or what about
> pseudo-distibuted mode makes one way work better than the other?
>
> Thanks again!
>
> -SM
>
>
>
> On Thu, Mar 5, 2009 at 9:04 PM, haizhou zhao <ra...@gmail.com> wrote:
>
> > As I metioned above, you should at least try like this:
> > map2 reduce1
> > map4 reduce1
> > map8 reduce1
> >
> > map4 reduce1
> > map4 reduce2
> > map4 reduce4
> >
> > instead of :
> > map2 reduce2
> > map4 reduce4
> > map8 reduce8
> >
> > 2009/3/6 Sandy <sn...@gmail.com>
> >
> > > I was trying to control the maximum number of tasks per tasktracker by
> > > using
> > > the
> > > mapred.tasktracker.tasks.maximum parameter
> > >
> > > I am interpreting your comment to mean that maybe this parameter is
> > > malformed and should read:
> > > mapred.tasktracker.map.tasks.maximum = 8
> > > mapred.tasktracker.map.tasks.maximum = 8
> > >
> > > I did that, and reran on a 428MB input, and got the same results as
> > before.
> > > I also ran it on a 3.3G dataset, and got the same pattern.
> > >
> > > I am still trying to run it on a 20 GB input. This should confirm if
> the
> > > filesystem cache thing is true.
> > >
> > > -SM
> > >
> > > On Thu, Mar 5, 2009 at 12:22 PM, Sandy <sn...@gmail.com>
> > wrote:
> > >
> > > > Arun,
> > > >
> > > > How can I check the number of slots per tasktracker? Which parameter
> > > > controls that?
> > > >
> > > > Thanks,
> > > > -SM
> > > >
> > > >
> > > > On Thu, Mar 5, 2009 at 12:14 PM, Arun C Murthy <ac...@yahoo-inc.com>
> > > wrote:
> > > >
> > > >> I assume you have only 2 map and 2 reduce slots per tasktracker -
> > which
> > > >> totals to 2 maps/reduces for you cluster. This means with more
> > > maps/reduces
> > > >> they are serialized to 2 at a time.
> > > >>
> > > >> Also, the -m is only a hint to the JobTracker, you might see
> less/more
> > > >> than the number of maps you have specified on the command line.
> > > >> The -r however is followed faithfully.
> > > >>
> > > >> Arun
> > > >>
> > > >>
> > > >> On Mar 4, 2009, at 2:46 PM, Sandy wrote:
> > > >>
> > > >>  Hello all,
> > > >>>
> > > >>> For the sake of benchmarking, I ran the standard hadoop wordcount
> > > example
> > > >>> on
> > > >>> an input file using 2, 4, and 8 mappers and reducers for my job.
> > > >>> In other words,  I do:
> > > >>>
> > > >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r
> 2
> > > >>> sample.txt output
> > > >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r
> 4
> > > >>> sample.txt output2
> > > >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r
> 8
> > > >>> sample.txt output3
> > > >>>
> > > >>> Strangely enough, when this increase in mappers and reducers result
> > in
> > > >>> slower running times!
> > > >>> -On 2 mappers and reducers it ran for 40 seconds
> > > >>> on 4 mappers and reducers it ran for 60 seconds
> > > >>> on 8 mappers and reducers it ran for 90 seconds!
> > > >>>
> > > >>> Please note that the "sample.txt" file is identical in each of
> these
> > > >>> runs.
> > > >>>
> > > >>> I have the following questions:
> > > >>> - Shouldn't wordcount get -faster- with additional mappers and
> > > reducers,
> > > >>> instead of slower?
> > > >>> - If it does get faster for other people, why does it become slower
> > for
> > > >>> me?
> > > >>>  I am running hadoop on psuedo-distributed mode on a single 64-bit
> > Mac
> > > >>> Pro
> > > >>> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
> > > >>>
> > > >>> I would greatly appreciate it if someone could explain this
> behavior
> > to
> > > >>> me,
> > > >>> and tell me if I'm running this wrong. How can I change my settings
> > (if
> > > >>> at
> > > >>> all) to get wordcount running faster when i increases that number
> of
> > > maps
> > > >>> and reduces?
> > > >>>
> > > >>> Thanks,
> > > >>> -SM
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: wordcount getting slower with more mappers and reducers?

Posted by Sandy <sn...@gmail.com>.

I'm trying to make sense of the results, but running it like this working at
least a little better.

map4  reduce1
map4 reduce2
map4 reduce4
map4 reduce8

I tried keeping the reduces constant, while varying the maps.. this results
in an increase of running time.

When I tried keeping the maps constant, and varying the reduces, I got
something better, though when it hit something like map4 reduce4, the
running time shoots up, even though previously it had been decreasing.

This has been very helpful... though I am very curious: Is the reason one
worked better than the other a function of the input only? Or what about
pseudo-distibuted mode makes one way work better than the other?

Thanks again!

-SM



On Thu, Mar 5, 2009 at 9:04 PM, haizhou zhao <ra...@gmail.com> wrote:

> As I metioned above, you should at least try like this:
> map2 reduce1
> map4 reduce1
> map8 reduce1
>
> map4 reduce1
> map4 reduce2
> map4 reduce4
>
> instead of :
> map2 reduce2
> map4 reduce4
> map8 reduce8
>
> 2009/3/6 Sandy <sn...@gmail.com>
>
> > I was trying to control the maximum number of tasks per tasktracker by
> > using
> > the
> > mapred.tasktracker.tasks.maximum parameter
> >
> > I am interpreting your comment to mean that maybe this parameter is
> > malformed and should read:
> > mapred.tasktracker.map.tasks.maximum = 8
> > mapred.tasktracker.map.tasks.maximum = 8
> >
> > I did that, and reran on a 428MB input, and got the same results as
> before.
> > I also ran it on a 3.3G dataset, and got the same pattern.
> >
> > I am still trying to run it on a 20 GB input. This should confirm if the
> > filesystem cache thing is true.
> >
> > -SM
> >
> > On Thu, Mar 5, 2009 at 12:22 PM, Sandy <sn...@gmail.com>
> wrote:
> >
> > > Arun,
> > >
> > > How can I check the number of slots per tasktracker? Which parameter
> > > controls that?
> > >
> > > Thanks,
> > > -SM
> > >
> > >
> > > On Thu, Mar 5, 2009 at 12:14 PM, Arun C Murthy <ac...@yahoo-inc.com>
> > wrote:
> > >
> > >> I assume you have only 2 map and 2 reduce slots per tasktracker -
> which
> > >> totals to 2 maps/reduces for you cluster. This means with more
> > maps/reduces
> > >> they are serialized to 2 at a time.
> > >>
> > >> Also, the -m is only a hint to the JobTracker, you might see less/more
> > >> than the number of maps you have specified on the command line.
> > >> The -r however is followed faithfully.
> > >>
> > >> Arun
> > >>
> > >>
> > >> On Mar 4, 2009, at 2:46 PM, Sandy wrote:
> > >>
> > >>  Hello all,
> > >>>
> > >>> For the sake of benchmarking, I ran the standard hadoop wordcount
> > example
> > >>> on
> > >>> an input file using 2, 4, and 8 mappers and reducers for my job.
> > >>> In other words,  I do:
> > >>>
> > >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
> > >>> sample.txt output
> > >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
> > >>> sample.txt output2
> > >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
> > >>> sample.txt output3
> > >>>
> > >>> Strangely enough, when this increase in mappers and reducers result
> in
> > >>> slower running times!
> > >>> -On 2 mappers and reducers it ran for 40 seconds
> > >>> on 4 mappers and reducers it ran for 60 seconds
> > >>> on 8 mappers and reducers it ran for 90 seconds!
> > >>>
> > >>> Please note that the "sample.txt" file is identical in each of these
> > >>> runs.
> > >>>
> > >>> I have the following questions:
> > >>> - Shouldn't wordcount get -faster- with additional mappers and
> > reducers,
> > >>> instead of slower?
> > >>> - If it does get faster for other people, why does it become slower
> for
> > >>> me?
> > >>>  I am running hadoop on psuedo-distributed mode on a single 64-bit
> Mac
> > >>> Pro
> > >>> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
> > >>>
> > >>> I would greatly appreciate it if someone could explain this behavior
> to
> > >>> me,
> > >>> and tell me if I'm running this wrong. How can I change my settings
> (if
> > >>> at
> > >>> all) to get wordcount running faster when i increases that number of
> > maps
> > >>> and reduces?
> > >>>
> > >>> Thanks,
> > >>> -SM
> > >>>
> > >>
> > >>
> > >
> >
>

Re: wordcount getting slower with more mappers and reducers?

Posted by haizhou zhao <ra...@gmail.com>.

As I metioned above, you should at least try like this:
map2 reduce1
map4 reduce1
map8 reduce1

map4 reduce1
map4 reduce2
map4 reduce4

instead of :
map2 reduce2
map4 reduce4
map8 reduce8

2009/3/6 Sandy <sn...@gmail.com>

> I was trying to control the maximum number of tasks per tasktracker by
> using
> the
> mapred.tasktracker.tasks.maximum parameter
>
> I am interpreting your comment to mean that maybe this parameter is
> malformed and should read:
> mapred.tasktracker.map.tasks.maximum = 8
> mapred.tasktracker.map.tasks.maximum = 8
>
> I did that, and reran on a 428MB input, and got the same results as before.
> I also ran it on a 3.3G dataset, and got the same pattern.
>
> I am still trying to run it on a 20 GB input. This should confirm if the
> filesystem cache thing is true.
>
> -SM
>
> On Thu, Mar 5, 2009 at 12:22 PM, Sandy <sn...@gmail.com> wrote:
>
> > Arun,
> >
> > How can I check the number of slots per tasktracker? Which parameter
> > controls that?
> >
> > Thanks,
> > -SM
> >
> >
> > On Thu, Mar 5, 2009 at 12:14 PM, Arun C Murthy <ac...@yahoo-inc.com>
> wrote:
> >
> >> I assume you have only 2 map and 2 reduce slots per tasktracker - which
> >> totals to 2 maps/reduces for you cluster. This means with more
> maps/reduces
> >> they are serialized to 2 at a time.
> >>
> >> Also, the -m is only a hint to the JobTracker, you might see less/more
> >> than the number of maps you have specified on the command line.
> >> The -r however is followed faithfully.
> >>
> >> Arun
> >>
> >>
> >> On Mar 4, 2009, at 2:46 PM, Sandy wrote:
> >>
> >>  Hello all,
> >>>
> >>> For the sake of benchmarking, I ran the standard hadoop wordcount
> example
> >>> on
> >>> an input file using 2, 4, and 8 mappers and reducers for my job.
> >>> In other words,  I do:
> >>>
> >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
> >>> sample.txt output
> >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
> >>> sample.txt output2
> >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
> >>> sample.txt output3
> >>>
> >>> Strangely enough, when this increase in mappers and reducers result in
> >>> slower running times!
> >>> -On 2 mappers and reducers it ran for 40 seconds
> >>> on 4 mappers and reducers it ran for 60 seconds
> >>> on 8 mappers and reducers it ran for 90 seconds!
> >>>
> >>> Please note that the "sample.txt" file is identical in each of these
> >>> runs.
> >>>
> >>> I have the following questions:
> >>> - Shouldn't wordcount get -faster- with additional mappers and
> reducers,
> >>> instead of slower?
> >>> - If it does get faster for other people, why does it become slower for
> >>> me?
> >>>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac
> >>> Pro
> >>> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
> >>>
> >>> I would greatly appreciate it if someone could explain this behavior to
> >>> me,
> >>> and tell me if I'm running this wrong. How can I change my settings (if
> >>> at
> >>> all) to get wordcount running faster when i increases that number of
> maps
> >>> and reduces?
> >>>
> >>> Thanks,
> >>> -SM
> >>>
> >>
> >>
> >
>

Re: wordcount getting slower with more mappers and reducers?

Posted by Sandy <sn...@gmail.com>.

I was trying to control the maximum number of tasks per tasktracker by using
the
mapred.tasktracker.tasks.maximum parameter

I am interpreting your comment to mean that maybe this parameter is
malformed and should read:
mapred.tasktracker.map.tasks.maximum = 8
mapred.tasktracker.map.tasks.maximum = 8

I did that, and reran on a 428MB input, and got the same results as before.
I also ran it on a 3.3G dataset, and got the same pattern.

I am still trying to run it on a 20 GB input. This should confirm if the
filesystem cache thing is true.

-SM

On Thu, Mar 5, 2009 at 12:22 PM, Sandy <sn...@gmail.com> wrote:

> Arun,
>
> How can I check the number of slots per tasktracker? Which parameter
> controls that?
>
> Thanks,
> -SM
>
>
> On Thu, Mar 5, 2009 at 12:14 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:
>
>> I assume you have only 2 map and 2 reduce slots per tasktracker - which
>> totals to 2 maps/reduces for you cluster. This means with more maps/reduces
>> they are serialized to 2 at a time.
>>
>> Also, the -m is only a hint to the JobTracker, you might see less/more
>> than the number of maps you have specified on the command line.
>> The -r however is followed faithfully.
>>
>> Arun
>>
>>
>> On Mar 4, 2009, at 2:46 PM, Sandy wrote:
>>
>>  Hello all,
>>>
>>> For the sake of benchmarking, I ran the standard hadoop wordcount example
>>> on
>>> an input file using 2, 4, and 8 mappers and reducers for my job.
>>> In other words,  I do:
>>>
>>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
>>> sample.txt output
>>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
>>> sample.txt output2
>>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
>>> sample.txt output3
>>>
>>> Strangely enough, when this increase in mappers and reducers result in
>>> slower running times!
>>> -On 2 mappers and reducers it ran for 40 seconds
>>> on 4 mappers and reducers it ran for 60 seconds
>>> on 8 mappers and reducers it ran for 90 seconds!
>>>
>>> Please note that the "sample.txt" file is identical in each of these
>>> runs.
>>>
>>> I have the following questions:
>>> - Shouldn't wordcount get -faster- with additional mappers and reducers,
>>> instead of slower?
>>> - If it does get faster for other people, why does it become slower for
>>> me?
>>>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac
>>> Pro
>>> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
>>>
>>> I would greatly appreciate it if someone could explain this behavior to
>>> me,
>>> and tell me if I'm running this wrong. How can I change my settings (if
>>> at
>>> all) to get wordcount running faster when i increases that number of maps
>>> and reduces?
>>>
>>> Thanks,
>>> -SM
>>>
>>
>>
>

Re: wordcount getting slower with more mappers and reducers?

Posted by Sandy <sn...@gmail.com>.

Arun,

How can I check the number of slots per tasktracker? Which parameter
controls that?

Thanks,
-SM

On Thu, Mar 5, 2009 at 12:14 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:

> I assume you have only 2 map and 2 reduce slots per tasktracker - which
> totals to 2 maps/reduces for you cluster. This means with more maps/reduces
> they are serialized to 2 at a time.
>
> Also, the -m is only a hint to the JobTracker, you might see less/more than
> the number of maps you have specified on the command line.
> The -r however is followed faithfully.
>
> Arun
>
>
> On Mar 4, 2009, at 2:46 PM, Sandy wrote:
>
>  Hello all,
>>
>> For the sake of benchmarking, I ran the standard hadoop wordcount example
>> on
>> an input file using 2, 4, and 8 mappers and reducers for my job.
>> In other words,  I do:
>>
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
>> sample.txt output
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
>> sample.txt output2
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
>> sample.txt output3
>>
>> Strangely enough, when this increase in mappers and reducers result in
>> slower running times!
>> -On 2 mappers and reducers it ran for 40 seconds
>> on 4 mappers and reducers it ran for 60 seconds
>> on 8 mappers and reducers it ran for 90 seconds!
>>
>> Please note that the "sample.txt" file is identical in each of these runs.
>>
>> I have the following questions:
>> - Shouldn't wordcount get -faster- with additional mappers and reducers,
>> instead of slower?
>> - If it does get faster for other people, why does it become slower for
>> me?
>>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro
>> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
>>
>> I would greatly appreciate it if someone could explain this behavior to
>> me,
>> and tell me if I'm running this wrong. How can I change my settings (if at
>> all) to get wordcount running faster when i increases that number of maps
>> and reduces?
>>
>> Thanks,
>> -SM
>>
>
>

Re: wordcount getting slower with more mappers and reducers?

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

I assume you have only 2 map and 2 reduce slots per tasktracker -  
which totals to 2 maps/reduces for you cluster. This means with more  
maps/reduces they are serialized to 2 at a time.

Also, the -m is only a hint to the JobTracker, you might see less/more  
than the number of maps you have specified on the command line.
The -r however is followed faithfully.

Arun

On Mar 4, 2009, at 2:46 PM, Sandy wrote:

> Hello all,
>
> For the sake of benchmarking, I ran the standard hadoop wordcount  
> example on
> an input file using 2, 4, and 8 mappers and reducers for my job.
> In other words,  I do:
>
> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
> sample.txt output
> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
> sample.txt output2
> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
> sample.txt output3
>
> Strangely enough, when this increase in mappers and reducers result in
> slower running times!
> -On 2 mappers and reducers it ran for 40 seconds
> on 4 mappers and reducers it ran for 60 seconds
> on 8 mappers and reducers it ran for 90 seconds!
>
> Please note that the "sample.txt" file is identical in each of these  
> runs.
>
> I have the following questions:
> - Shouldn't wordcount get -faster- with additional mappers and  
> reducers,
> instead of slower?
> - If it does get faster for other people, why does it become slower  
> for me?
>  I am running hadoop on psuedo-distributed mode on a single 64-bit  
> Mac Pro
> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
>
> I would greatly appreciate it if someone could explain this behavior  
> to me,
> and tell me if I'm running this wrong. How can I change my settings  
> (if at
> all) to get wordcount running faster when i increases that number of  
> maps
> and reduces?
>
> Thanks,
> -SM