You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by 牛兆捷 <nz...@gmail.com> on 2013/09/06 17:27:47 UTC

hadoop1.2.1 speedup model

Hi all：

I vary the computational nodes of cluster and get the speedup result in
attachment.

In my mind, there are three type of speedup model: linear, sub-linear and
super-linear. However the curve of my result seems a little strange. I have
attached it.
[image: 内嵌图片 2]

This is sort in example.jar, actually it is done only using the default
map-reduce mechanism of Hadoop.

I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12 cpu, 20g
men)
 io.sort.mb = 512, block size = 512mb, heap size = 1024mb,
 reduce.slowstart = 0.05, the others are default.

Input data: 20g, I divide it to 64 files

Sort example: 64 map tasks, 64 reduce tasks

Computational nodes: varying from 2 to 9

Why the speedup mechanism is like this? How can I model it properly?

Thanks～

-- 
*Sincerely,*
*Zhaojie*
*
*

Re: hadoop1.2.1 speedup model

Posted by 牛兆捷 <nz...@gmail.com>.

>From 2 to 4, the performance increase sub-linearly, however from 4 to 8, it
seems super-linear.

Is it caused by some disk contention bottleneck?


2013/9/6 牛兆捷 <nz...@gmail.com>

> Hi all：
>
> I vary the computational nodes of cluster and get the speedup result in
> attachment.
>
> In my mind, there are three type of speedup model: linear, sub-linear and
> super-linear. However the curve of my result seems a little strange. I have
> attached it.
> [image: 内嵌图片 2]
>
> This is sort in example.jar, actually it is done only using the default
> map-reduce mechanism of Hadoop.
>
> I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12 cpu,
> 20g men)
>  io.sort.mb = 512, block size = 512mb, heap size = 1024mb,
>  reduce.slowstart = 0.05, the others are default.
>
> Input data: 20g, I divide it to 64 files
>
> Sort example: 64 map tasks, 64 reduce tasks
>
> Computational nodes: varying from 2 to 9
>
> Why the speedup mechanism is like this? How can I model it properly?
>
> Thanks～
>
> --
> *Sincerely,*
> *Zhaojie*
> *
> *
>



-- 
*Sincerely,*
*Zhaojie*
*
*

Re: hadoop1.2.1 speedup model

Posted by 牛兆捷 <nz...@gmail.com>.

>From 2 to 4, the performance increase sub-linearly, however from 4 to 8, it
seems super-linear.

Is it caused by some disk contention bottleneck?


2013/9/6 牛兆捷 <nz...@gmail.com>

> Hi all：
>
> I vary the computational nodes of cluster and get the speedup result in
> attachment.
>
> In my mind, there are three type of speedup model: linear, sub-linear and
> super-linear. However the curve of my result seems a little strange. I have
> attached it.
> [image: 内嵌图片 2]
>
> This is sort in example.jar, actually it is done only using the default
> map-reduce mechanism of Hadoop.
>
> I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12 cpu,
> 20g men)
>  io.sort.mb = 512, block size = 512mb, heap size = 1024mb,
>  reduce.slowstart = 0.05, the others are default.
>
> Input data: 20g, I divide it to 64 files
>
> Sort example: 64 map tasks, 64 reduce tasks
>
> Computational nodes: varying from 2 to 9
>
> Why the speedup mechanism is like this? How can I model it properly?
>
> Thanks～
>
> --
> *Sincerely,*
> *Zhaojie*
> *
> *
>



-- 
*Sincerely,*
*Zhaojie*
*
*

Re: hadoop1.2.1 speedup model

Posted by 牛兆捷 <nz...@gmail.com>.

Thanks bobby.

I will try more times.

Is there any more fine-grained profile tools for each task? For example,
cpu utilization, disk and network IO for each task.






2013/9/9 Robert Evans <ev...@yahoo-inc.com>

> How many times did you run the experiment at each setting?  What is the
> standard deviation for each of these settings.  It could be that you are
> simply running into the error bounds of Hadoop.  Hadoop is far from
> consistent in it's performance.  For our benchmarking we typically will
> run the test 5 times, throw out the top and bottom result, as possibly
> outliers and then average the other runs.  Even with that we have to be
> very careful that we weed out bad nodes or the numbers are useless for
> comparison.  The other thing to look at is where was all of the time spent
> for each of these settings.  The map portion should be very close to
> linear with the number of tasks, assuming that there is no disk or network
> contention.  The shuffle is far from linear as the number of fetches is a
> function of the number of maps and the number of reducers.  The reduce
> phase itself should be close to linear assuming that there isn't much skew
> to your data.
>
> --Bobby
>
> On 9/7/13 3:33 AM, "牛兆捷" <nz...@gmail.com> wrote:
>
> >But I still want to fine the most efficient assignment and scale both data
> >and nodes as you said, for example in my result, 2 is the best, and 8 is
> >better than 4.
> >
> >Why is it sub-linear from 2 to 4, super-linear from 4 to 8. I find it is
> >hard to model this result. Can you give me some hint about this kind of
> >trend?
> >
> >
> >2013/9/7 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
> >
> >>
> >> Clearly your input size isn't changing. And depending on how they are
> >> distributed on the nodes, there could be Datanode/disks contention.
> >>
> >> The better way to model this is by scaling the input data also linearly.
> >> More nodes should process more data in the same amount of time.
> >>
> >> Thanks,
> >> +Vinod
> >>
> >> On Sep 6, 2013, at 8:27 AM, 牛兆捷 wrote:
> >>
> >> > Hi all：
> >> >
> >> > I vary the computational nodes of cluster and get the speedup result
> >>in
> >> attachment.
> >> >
> >> > In my mind, there are three type of speedup model: linear, sub-linear
> >> and super-linear. However the curve of my result seems a little
> >>strange. I
> >> have attached it.
> >> > <speedup.png>
> >> >
> >> > This is sort in example.jar, actually it is done only using the
> >>default
> >> map-reduce mechanism of Hadoop.
> >> >
> >> > I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12
> >>cpu,
> >> 20g men)
> >> >  io.sort.mb = 512, block size = 512mb, heap size = 1024mb,
> >>  reduce.slowstart = 0.05, the others are default.
> >> >
> >> > Input data: 20g, I divide it to 64 files
> >> >
> >> > Sort example: 64 map tasks, 64 reduce tasks
> >> >
> >> > Computational nodes: varying from 2 to 9
> >> >
> >> > Why the speedup mechanism is like this? How can I model it properly?
> >> >
> >> > Thanks～
> >> >
> >> > --
> >> > Sincerely,
> >> > Zhaojie
> >> >
> >>
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or
> >>entity to
> >> which it is addressed and may contain information that is confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> >>reader
> >> of this message is not the intended recipient, you are hereby notified
> >>that
> >> any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> >>immediately
> >> and delete it from your system. Thank You.
> >>
> >
> >
> >
> >--
> >*Sincerely,*
> >*Zhaojie*
> >*
> >*
>
>


-- 
*Sincerely,*
*Zhaojie*
*
*

Re: hadoop1.2.1 speedup model

Posted by 牛兆捷 <nz...@gmail.com>.

Thanks bobby.

I will try more times.

Is there any more fine-grained profile tools for each task? For example,
cpu utilization, disk and network IO for each task.






2013/9/9 Robert Evans <ev...@yahoo-inc.com>

> How many times did you run the experiment at each setting?  What is the
> standard deviation for each of these settings.  It could be that you are
> simply running into the error bounds of Hadoop.  Hadoop is far from
> consistent in it's performance.  For our benchmarking we typically will
> run the test 5 times, throw out the top and bottom result, as possibly
> outliers and then average the other runs.  Even with that we have to be
> very careful that we weed out bad nodes or the numbers are useless for
> comparison.  The other thing to look at is where was all of the time spent
> for each of these settings.  The map portion should be very close to
> linear with the number of tasks, assuming that there is no disk or network
> contention.  The shuffle is far from linear as the number of fetches is a
> function of the number of maps and the number of reducers.  The reduce
> phase itself should be close to linear assuming that there isn't much skew
> to your data.
>
> --Bobby
>
> On 9/7/13 3:33 AM, "牛兆捷" <nz...@gmail.com> wrote:
>
> >But I still want to fine the most efficient assignment and scale both data
> >and nodes as you said, for example in my result, 2 is the best, and 8 is
> >better than 4.
> >
> >Why is it sub-linear from 2 to 4, super-linear from 4 to 8. I find it is
> >hard to model this result. Can you give me some hint about this kind of
> >trend?
> >
> >
> >2013/9/7 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
> >
> >>
> >> Clearly your input size isn't changing. And depending on how they are
> >> distributed on the nodes, there could be Datanode/disks contention.
> >>
> >> The better way to model this is by scaling the input data also linearly.
> >> More nodes should process more data in the same amount of time.
> >>
> >> Thanks,
> >> +Vinod
> >>
> >> On Sep 6, 2013, at 8:27 AM, 牛兆捷 wrote:
> >>
> >> > Hi all：
> >> >
> >> > I vary the computational nodes of cluster and get the speedup result
> >>in
> >> attachment.
> >> >
> >> > In my mind, there are three type of speedup model: linear, sub-linear
> >> and super-linear. However the curve of my result seems a little
> >>strange. I
> >> have attached it.
> >> > <speedup.png>
> >> >
> >> > This is sort in example.jar, actually it is done only using the
> >>default
> >> map-reduce mechanism of Hadoop.
> >> >
> >> > I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12
> >>cpu,
> >> 20g men)
> >> >  io.sort.mb = 512, block size = 512mb, heap size = 1024mb,
> >>  reduce.slowstart = 0.05, the others are default.
> >> >
> >> > Input data: 20g, I divide it to 64 files
> >> >
> >> > Sort example: 64 map tasks, 64 reduce tasks
> >> >
> >> > Computational nodes: varying from 2 to 9
> >> >
> >> > Why the speedup mechanism is like this? How can I model it properly?
> >> >
> >> > Thanks～
> >> >
> >> > --
> >> > Sincerely,
> >> > Zhaojie
> >> >
> >>
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or
> >>entity to
> >> which it is addressed and may contain information that is confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> >>reader
> >> of this message is not the intended recipient, you are hereby notified
> >>that
> >> any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> >>immediately
> >> and delete it from your system. Thank You.
> >>
> >
> >
> >
> >--
> >*Sincerely,*
> >*Zhaojie*
> >*
> >*
>
>


-- 
*Sincerely,*
*Zhaojie*
*
*

Re: hadoop1.2.1 speedup model

Posted by Robert Evans <ev...@yahoo-inc.com>.

How many times did you run the experiment at each setting?  What is the
standard deviation for each of these settings.  It could be that you are
simply running into the error bounds of Hadoop.  Hadoop is far from
consistent in it's performance.  For our benchmarking we typically will
run the test 5 times, throw out the top and bottom result, as possibly
outliers and then average the other runs.  Even with that we have to be
very careful that we weed out bad nodes or the numbers are useless for
comparison.  The other thing to look at is where was all of the time spent
for each of these settings.  The map portion should be very close to
linear with the number of tasks, assuming that there is no disk or network
contention.  The shuffle is far from linear as the number of fetches is a
function of the number of maps and the number of reducers.  The reduce
phase itself should be close to linear assuming that there isn't much skew
to your data.

--Bobby

On 9/7/13 3:33 AM, "牛兆捷" <nz...@gmail.com> wrote:

>But I still want to fine the most efficient assignment and scale both data
>and nodes as you said, for example in my result, 2 is the best, and 8 is
>better than 4.
>
>Why is it sub-linear from 2 to 4, super-linear from 4 to 8. I find it is
>hard to model this result. Can you give me some hint about this kind of
>trend?
>
>
>2013/9/7 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
>
>>
>> Clearly your input size isn't changing. And depending on how they are
>> distributed on the nodes, there could be Datanode/disks contention.
>>
>> The better way to model this is by scaling the input data also linearly.
>> More nodes should process more data in the same amount of time.
>>
>> Thanks,
>> +Vinod
>>
>> On Sep 6, 2013, at 8:27 AM, 牛兆捷 wrote:
>>
>> > Hi all：
>> >
>> > I vary the computational nodes of cluster and get the speedup result
>>in
>> attachment.
>> >
>> > In my mind, there are three type of speedup model: linear, sub-linear
>> and super-linear. However the curve of my result seems a little
>>strange. I
>> have attached it.
>> > <speedup.png>
>> >
>> > This is sort in example.jar, actually it is done only using the
>>default
>> map-reduce mechanism of Hadoop.
>> >
>> > I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12
>>cpu,
>> 20g men)
>> >  io.sort.mb = 512, block size = 512mb, heap size = 1024mb,
>>  reduce.slowstart = 0.05, the others are default.
>> >
>> > Input data: 20g, I divide it to 64 files
>> >
>> > Sort example: 64 map tasks, 64 reduce tasks
>> >
>> > Computational nodes: varying from 2 to 9
>> >
>> > Why the speedup mechanism is like this? How can I model it properly?
>> >
>> > Thanks〜
>> >
>> > --
>> > Sincerely,
>> > Zhaojie
>> >
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or
>>entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the
>>reader
>> of this message is not the intended recipient, you are hereby notified
>>that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>>immediately
>> and delete it from your system. Thank You.
>>
>
>
>
>-- 
>*Sincerely,*
>*Zhaojie*
>*
>*

Re: hadoop1.2.1 speedup model

Posted by Robert Evans <ev...@yahoo-inc.com>.

How many times did you run the experiment at each setting?  What is the
standard deviation for each of these settings.  It could be that you are
simply running into the error bounds of Hadoop.  Hadoop is far from
consistent in it's performance.  For our benchmarking we typically will
run the test 5 times, throw out the top and bottom result, as possibly
outliers and then average the other runs.  Even with that we have to be
very careful that we weed out bad nodes or the numbers are useless for
comparison.  The other thing to look at is where was all of the time spent
for each of these settings.  The map portion should be very close to
linear with the number of tasks, assuming that there is no disk or network
contention.  The shuffle is far from linear as the number of fetches is a
function of the number of maps and the number of reducers.  The reduce
phase itself should be close to linear assuming that there isn't much skew
to your data.

--Bobby

On 9/7/13 3:33 AM, "牛兆捷" <nz...@gmail.com> wrote:

>But I still want to fine the most efficient assignment and scale both data
>and nodes as you said, for example in my result, 2 is the best, and 8 is
>better than 4.
>
>Why is it sub-linear from 2 to 4, super-linear from 4 to 8. I find it is
>hard to model this result. Can you give me some hint about this kind of
>trend?
>
>
>2013/9/7 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
>
>>
>> Clearly your input size isn't changing. And depending on how they are
>> distributed on the nodes, there could be Datanode/disks contention.
>>
>> The better way to model this is by scaling the input data also linearly.
>> More nodes should process more data in the same amount of time.
>>
>> Thanks,
>> +Vinod
>>
>> On Sep 6, 2013, at 8:27 AM, 牛兆捷 wrote:
>>
>> > Hi all：
>> >
>> > I vary the computational nodes of cluster and get the speedup result
>>in
>> attachment.
>> >
>> > In my mind, there are three type of speedup model: linear, sub-linear
>> and super-linear. However the curve of my result seems a little
>>strange. I
>> have attached it.
>> > <speedup.png>
>> >
>> > This is sort in example.jar, actually it is done only using the
>>default
>> map-reduce mechanism of Hadoop.
>> >
>> > I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12
>>cpu,
>> 20g men)
>> >  io.sort.mb = 512, block size = 512mb, heap size = 1024mb,
>>  reduce.slowstart = 0.05, the others are default.
>> >
>> > Input data: 20g, I divide it to 64 files
>> >
>> > Sort example: 64 map tasks, 64 reduce tasks
>> >
>> > Computational nodes: varying from 2 to 9
>> >
>> > Why the speedup mechanism is like this? How can I model it properly?
>> >
>> > Thanks〜
>> >
>> > --
>> > Sincerely,
>> > Zhaojie
>> >
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or
>>entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the
>>reader
>> of this message is not the intended recipient, you are hereby notified
>>that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>>immediately
>> and delete it from your system. Thank You.
>>
>
>
>
>-- 
>*Sincerely,*
>*Zhaojie*
>*
>*

Re: hadoop1.2.1 speedup model

Posted by 牛兆捷 <nz...@gmail.com>.

But I still want to fine the most efficient assignment and scale both data
and nodes as you said, for example in my result, 2 is the best, and 8 is
better than 4.

Why is it sub-linear from 2 to 4, super-linear from 4 to 8. I find it is
hard to model this result. Can you give me some hint about this kind of
trend?


2013/9/7 Vinod Kumar Vavilapalli <vi...@hortonworks.com>

>
> Clearly your input size isn't changing. And depending on how they are
> distributed on the nodes, there could be Datanode/disks contention.
>
> The better way to model this is by scaling the input data also linearly.
> More nodes should process more data in the same amount of time.
>
> Thanks,
> +Vinod
>
> On Sep 6, 2013, at 8:27 AM, 牛兆捷 wrote:
>
> > Hi all：
> >
> > I vary the computational nodes of cluster and get the speedup result in
> attachment.
> >
> > In my mind, there are three type of speedup model: linear, sub-linear
> and super-linear. However the curve of my result seems a little strange. I
> have attached it.
> > <speedup.png>
> >
> > This is sort in example.jar, actually it is done only using the default
> map-reduce mechanism of Hadoop.
> >
> > I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12 cpu,
> 20g men)
> >  io.sort.mb = 512, block size = 512mb, heap size = 1024mb,
>  reduce.slowstart = 0.05, the others are default.
> >
> > Input data: 20g, I divide it to 64 files
> >
> > Sort example: 64 map tasks, 64 reduce tasks
> >
> > Computational nodes: varying from 2 to 9
> >
> > Why the speedup mechanism is like this? How can I model it properly?
> >
> > Thanks～
> >
> > --
> > Sincerely,
> > Zhaojie
> >
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
*Sincerely,*
*Zhaojie*
*
*

Re: hadoop1.2.1 speedup model

Posted by 牛兆捷 <nz...@gmail.com>.

But I still want to fine the most efficient assignment and scale both data
and nodes as you said, for example in my result, 2 is the best, and 8 is
better than 4.

Why is it sub-linear from 2 to 4, super-linear from 4 to 8. I find it is
hard to model this result. Can you give me some hint about this kind of
trend?


2013/9/7 Vinod Kumar Vavilapalli <vi...@hortonworks.com>

>
> Clearly your input size isn't changing. And depending on how they are
> distributed on the nodes, there could be Datanode/disks contention.
>
> The better way to model this is by scaling the input data also linearly.
> More nodes should process more data in the same amount of time.
>
> Thanks,
> +Vinod
>
> On Sep 6, 2013, at 8:27 AM, 牛兆捷 wrote:
>
> > Hi all：
> >
> > I vary the computational nodes of cluster and get the speedup result in
> attachment.
> >
> > In my mind, there are three type of speedup model: linear, sub-linear
> and super-linear. However the curve of my result seems a little strange. I
> have attached it.
> > <speedup.png>
> >
> > This is sort in example.jar, actually it is done only using the default
> map-reduce mechanism of Hadoop.
> >
> > I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12 cpu,
> 20g men)
> >  io.sort.mb = 512, block size = 512mb, heap size = 1024mb,
>  reduce.slowstart = 0.05, the others are default.
> >
> > Input data: 20g, I divide it to 64 files
> >
> > Sort example: 64 map tasks, 64 reduce tasks
> >
> > Computational nodes: varying from 2 to 9
> >
> > Why the speedup mechanism is like this? How can I model it properly?
> >
> > Thanks～
> >
> > --
> > Sincerely,
> > Zhaojie
> >
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
*Sincerely,*
*Zhaojie*
*
*

Re: hadoop1.2.1 speedup model

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Clearly your input size isn't changing. And depending on how they are distributed on the nodes, there could be Datanode/disks contention.

The better way to model this is by scaling the input data also linearly. More nodes should process more data in the same amount of time.

Thanks,
+Vinod

On Sep 6, 2013, at 8:27 AM, 牛兆捷 wrote:

> Hi all：
> 
> I vary the computational nodes of cluster and get the speedup result in attachment.
> 
> In my mind, there are three type of speedup model: linear, sub-linear and super-linear. However the curve of my result seems a little strange. I have attached it.
> <speedup.png>
> 
> This is sort in example.jar, actually it is done only using the default map-reduce mechanism of Hadoop.
> 
> I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12 cpu, 20g men)
>  io.sort.mb = 512, block size = 512mb, heap size = 1024mb,  reduce.slowstart = 0.05, the others are default.
> 
> Input data: 20g, I divide it to 64 files
> 
> Sort example: 64 map tasks, 64 reduce tasks
> 
> Computational nodes: varying from 2 to 9
> 
> Why the speedup mechanism is like this? How can I model it properly?
> 
> Thanks～
> 
> -- 
> Sincerely,
> Zhaojie
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: hadoop1.2.1 speedup model

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Clearly your input size isn't changing. And depending on how they are distributed on the nodes, there could be Datanode/disks contention.

The better way to model this is by scaling the input data also linearly. More nodes should process more data in the same amount of time.

Thanks,
+Vinod

On Sep 6, 2013, at 8:27 AM, 牛兆捷 wrote:

> Hi all：
> 
> I vary the computational nodes of cluster and get the speedup result in attachment.
> 
> In my mind, there are three type of speedup model: linear, sub-linear and super-linear. However the curve of my result seems a little strange. I have attached it.
> <speedup.png>
> 
> This is sort in example.jar, actually it is done only using the default map-reduce mechanism of Hadoop.
> 
> I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12 cpu, 20g men)
>  io.sort.mb = 512, block size = 512mb, heap size = 1024mb,  reduce.slowstart = 0.05, the others are default.
> 
> Input data: 20g, I divide it to 64 files
> 
> Sort example: 64 map tasks, 64 reduce tasks
> 
> Computational nodes: varying from 2 to 9
> 
> Why the speedup mechanism is like this? How can I model it properly?
> 
> Thanks～
> 
> -- 
> Sincerely,
> Zhaojie
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.