You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by 牛兆捷 <nz...@gmail.com> on 2013/05/08 17:25:47 UTC

question about cpu utilization

I saw the application container log to trace the map-reduce application.

For map task, I find there are mainly 3 phase: spilit input, sort and spill
out.
I set the enough memory to make sure the input can stay in memory.

Initially, I thought the highest cpu utilization will appear in sort phase
because the other two phase focus on IO,however, it doesn't behave as what
I thought. On the contrary, the cpu utilization during  the other phase are
higher.

Anyone know the reason?

-- 
*Sincerely,*
*Zhaojie*
*
*

Re: question about cpu utilization

Posted by 牛兆捷 <nz...@gmail.com>.
Thanks~


2013/5/9 Robert Evans <ev...@yahoo-inc.com>

> The I am really not sure what is happening.  Try profiling your task.
>
> --Bobby
>
> On 5/8/13 11:48 AM, "牛兆捷" <nz...@gmail.com> wrote:
>
> >Just for simplicity, I run only one map task for such as 256mb, then I set
> >my io.sort.memory to more than 512mb to make sure all input can stay in
> >memory, I also check the log to make sure there is just on spill happen
> >for
> >flushing.
> >
> >So I think the different part run one by one, but the cpu utilization is
> >out of my expect.
> >
> >
> >2013/5/9 牛兆捷 <nz...@gmail.com>
> >
> >> I have enough memory, so there will be only one sort and spill. Why do
> >> they will happen parallel?
> >>
> >>
> >> 2013/5/9 Robert Evans <ev...@yahoo-inc.com>
> >>
> >>> Yes it all happens in parallel even on a single task
> >>>
> >>> On 5/8/13 11:17 AM, "牛兆捷" <nz...@gmail.com> wrote:
> >>>
> >>> >I forget to say, for see the behavior of single task, I just run one
> >>>map
> >>> >task for 1G input-split(I set block size to 1GB)
> >>> >
> >>> >
> >>> >2013/5/9 Robert Evans <ev...@yahoo-inc.com>
> >>> >
> >>> >> Deciding on the input split happens in the client.  Each map process
> >>> >>just
> >>> >> opens up the input file and seeks to the appropriate offset in the
> >>> file.
> >>> >> At that point it reads each entry one at a time and sends it to the
> >>>map
> >>> >> task.  The output of the map task is placed in a buffer.  When the
> >>> >>buffer
> >>> >> gets close to full the data is sorted and spilled out to disk in
> >>> >>parallel
> >>> >> with the map task still running.  It is hard to get CPU time for the
> >>> >> different parts because they are all happening in parallel. If you
> >>>do
> >>> >>have
> >>> >> enough ram to store the entire output in memory and you have
> >>>configured
> >>> >> your sort buffer to be able to hold it all then you will probably
> >>>only
> >>> >> sort/spill once.
> >>> >>
> >>> >> --Bobby
> >>> >>
> >>> >> On 5/8/13 10:25 AM, "牛兆捷" <nz...@gmail.com> wrote:
> >>> >>
> >>> >> >I saw the application container log to trace the map-reduce
> >>> >>application.
> >>> >> >
> >>> >> >For map task, I find there are mainly 3 phase: spilit input, sort
> >>>and
> >>> >> >spill
> >>> >> >out.
> >>> >> >I set the enough memory to make sure the input can stay in memory.
> >>> >> >
> >>> >> >Initially, I thought the highest cpu utilization will appear in
> >>>sort
> >>> >>phase
> >>> >> >because the other two phase focus on IO,however, it doesn't behave
> >>>as
> >>> >>what
> >>> >> >I thought. On the contrary, the cpu utilization during  the other
> >>> phase
> >>> >> >are
> >>> >> >higher.
> >>> >> >
> >>> >> >Anyone know the reason?
> >>> >> >
> >>> >> >--
> >>> >> >*Sincerely,*
> >>> >> >*Zhaojie*
> >>> >> >*
> >>> >> >*
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>> >--
> >>> >*Sincerely,*
> >>> >*Zhaojie*
> >>> >*
> >>> >*
> >>>
> >>>
> >>
> >>
> >> --
> >> *Sincerely,*
> >> *Zhaojie*
> >> *
> >> *
> >>
> >
> >
> >
> >--
> >*Sincerely,*
> >*Zhaojie*
> >*
> >*
>
>


-- 
*Sincerely,*
*Zhaojie*
*
*

Re: question about cpu utilization

Posted by Robert Evans <ev...@yahoo-inc.com>.
The CPU scheduling is still kind of fuzzy.  Your request is done in
virtual cores, which do not necessarily correspond to actual physical
cores.  In some cases linux cgroups may be used to guarantee that you will
get at least a certain level of CPU time, but nothing I am aware of right
now will actually bind the process to a given core.

--Bobby

On 5/8/13 11:55 PM, "牛兆捷" <nz...@gmail.com> wrote:

>btw,if I set the container cpu to less than 1, what will be? Can many
>container will share one core?
>
>
>2013/5/9 Robert Evans <ev...@yahoo-inc.com>
>
>> The I am really not sure what is happening.  Try profiling your task.
>>
>> --Bobby
>>
>> On 5/8/13 11:48 AM, "牛兆捷" <nz...@gmail.com> wrote:
>>
>> >Just for simplicity, I run only one map task for such as 256mb, then I
>>set
>> >my io.sort.memory to more than 512mb to make sure all input can stay in
>> >memory, I also check the log to make sure there is just on spill happen
>> >for
>> >flushing.
>> >
>> >So I think the different part run one by one, but the cpu utilization
>>is
>> >out of my expect.
>> >
>> >
>> >2013/5/9 牛兆捷 <nz...@gmail.com>
>> >
>> >> I have enough memory, so there will be only one sort and spill. Why
>>do
>> >> they will happen parallel?
>> >>
>> >>
>> >> 2013/5/9 Robert Evans <ev...@yahoo-inc.com>
>> >>
>> >>> Yes it all happens in parallel even on a single task
>> >>>
>> >>> On 5/8/13 11:17 AM, "牛兆捷" <nz...@gmail.com> wrote:
>> >>>
>> >>> >I forget to say, for see the behavior of single task, I just run
>>one
>> >>>map
>> >>> >task for 1G input-split(I set block size to 1GB)
>> >>> >
>> >>> >
>> >>> >2013/5/9 Robert Evans <ev...@yahoo-inc.com>
>> >>> >
>> >>> >> Deciding on the input split happens in the client.  Each map
>>process
>> >>> >>just
>> >>> >> opens up the input file and seeks to the appropriate offset in
>>the
>> >>> file.
>> >>> >> At that point it reads each entry one at a time and sends it to
>>the
>> >>>map
>> >>> >> task.  The output of the map task is placed in a buffer.  When
>>the
>> >>> >>buffer
>> >>> >> gets close to full the data is sorted and spilled out to disk in
>> >>> >>parallel
>> >>> >> with the map task still running.  It is hard to get CPU time for
>>the
>> >>> >> different parts because they are all happening in parallel. If
>>you
>> >>>do
>> >>> >>have
>> >>> >> enough ram to store the entire output in memory and you have
>> >>>configured
>> >>> >> your sort buffer to be able to hold it all then you will probably
>> >>>only
>> >>> >> sort/spill once.
>> >>> >>
>> >>> >> --Bobby
>> >>> >>
>> >>> >> On 5/8/13 10:25 AM, "牛兆捷" <nz...@gmail.com> wrote:
>> >>> >>
>> >>> >> >I saw the application container log to trace the map-reduce
>> >>> >>application.
>> >>> >> >
>> >>> >> >For map task, I find there are mainly 3 phase: spilit input,
>>sort
>> >>>and
>> >>> >> >spill
>> >>> >> >out.
>> >>> >> >I set the enough memory to make sure the input can stay in
>>memory.
>> >>> >> >
>> >>> >> >Initially, I thought the highest cpu utilization will appear in
>> >>>sort
>> >>> >>phase
>> >>> >> >because the other two phase focus on IO,however, it doesn't
>>behave
>> >>>as
>> >>> >>what
>> >>> >> >I thought. On the contrary, the cpu utilization during  the
>>other
>> >>> phase
>> >>> >> >are
>> >>> >> >higher.
>> >>> >> >
>> >>> >> >Anyone know the reason?
>> >>> >> >
>> >>> >> >--
>> >>> >> >*Sincerely,*
>> >>> >> >*Zhaojie*
>> >>> >> >*
>> >>> >> >*
>> >>> >>
>> >>> >>
>> >>> >
>> >>> >
>> >>> >--
>> >>> >*Sincerely,*
>> >>> >*Zhaojie*
>> >>> >*
>> >>> >*
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> *Sincerely,*
>> >> *Zhaojie*
>> >> *
>> >> *
>> >>
>> >
>> >
>> >
>> >--
>> >*Sincerely,*
>> >*Zhaojie*
>> >*
>> >*
>>
>>
>
>
>-- 
>*Sincerely,*
>*Zhaojie*
>*
>*


Re: question about cpu utilization

Posted by 牛兆捷 <nz...@gmail.com>.
btw,if I set the container cpu to less than 1, what will be? Can many
container will share one core?


2013/5/9 Robert Evans <ev...@yahoo-inc.com>

> The I am really not sure what is happening.  Try profiling your task.
>
> --Bobby
>
> On 5/8/13 11:48 AM, "牛兆捷" <nz...@gmail.com> wrote:
>
> >Just for simplicity, I run only one map task for such as 256mb, then I set
> >my io.sort.memory to more than 512mb to make sure all input can stay in
> >memory, I also check the log to make sure there is just on spill happen
> >for
> >flushing.
> >
> >So I think the different part run one by one, but the cpu utilization is
> >out of my expect.
> >
> >
> >2013/5/9 牛兆捷 <nz...@gmail.com>
> >
> >> I have enough memory, so there will be only one sort and spill. Why do
> >> they will happen parallel?
> >>
> >>
> >> 2013/5/9 Robert Evans <ev...@yahoo-inc.com>
> >>
> >>> Yes it all happens in parallel even on a single task
> >>>
> >>> On 5/8/13 11:17 AM, "牛兆捷" <nz...@gmail.com> wrote:
> >>>
> >>> >I forget to say, for see the behavior of single task, I just run one
> >>>map
> >>> >task for 1G input-split(I set block size to 1GB)
> >>> >
> >>> >
> >>> >2013/5/9 Robert Evans <ev...@yahoo-inc.com>
> >>> >
> >>> >> Deciding on the input split happens in the client.  Each map process
> >>> >>just
> >>> >> opens up the input file and seeks to the appropriate offset in the
> >>> file.
> >>> >> At that point it reads each entry one at a time and sends it to the
> >>>map
> >>> >> task.  The output of the map task is placed in a buffer.  When the
> >>> >>buffer
> >>> >> gets close to full the data is sorted and spilled out to disk in
> >>> >>parallel
> >>> >> with the map task still running.  It is hard to get CPU time for the
> >>> >> different parts because they are all happening in parallel. If you
> >>>do
> >>> >>have
> >>> >> enough ram to store the entire output in memory and you have
> >>>configured
> >>> >> your sort buffer to be able to hold it all then you will probably
> >>>only
> >>> >> sort/spill once.
> >>> >>
> >>> >> --Bobby
> >>> >>
> >>> >> On 5/8/13 10:25 AM, "牛兆捷" <nz...@gmail.com> wrote:
> >>> >>
> >>> >> >I saw the application container log to trace the map-reduce
> >>> >>application.
> >>> >> >
> >>> >> >For map task, I find there are mainly 3 phase: spilit input, sort
> >>>and
> >>> >> >spill
> >>> >> >out.
> >>> >> >I set the enough memory to make sure the input can stay in memory.
> >>> >> >
> >>> >> >Initially, I thought the highest cpu utilization will appear in
> >>>sort
> >>> >>phase
> >>> >> >because the other two phase focus on IO,however, it doesn't behave
> >>>as
> >>> >>what
> >>> >> >I thought. On the contrary, the cpu utilization during  the other
> >>> phase
> >>> >> >are
> >>> >> >higher.
> >>> >> >
> >>> >> >Anyone know the reason?
> >>> >> >
> >>> >> >--
> >>> >> >*Sincerely,*
> >>> >> >*Zhaojie*
> >>> >> >*
> >>> >> >*
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>> >--
> >>> >*Sincerely,*
> >>> >*Zhaojie*
> >>> >*
> >>> >*
> >>>
> >>>
> >>
> >>
> >> --
> >> *Sincerely,*
> >> *Zhaojie*
> >> *
> >> *
> >>
> >
> >
> >
> >--
> >*Sincerely,*
> >*Zhaojie*
> >*
> >*
>
>


-- 
*Sincerely,*
*Zhaojie*
*
*

Re: question about cpu utilization

Posted by Robert Evans <ev...@yahoo-inc.com>.
The I am really not sure what is happening.  Try profiling your task.

--Bobby

On 5/8/13 11:48 AM, "牛兆捷" <nz...@gmail.com> wrote:

>Just for simplicity, I run only one map task for such as 256mb, then I set
>my io.sort.memory to more than 512mb to make sure all input can stay in
>memory, I also check the log to make sure there is just on spill happen
>for
>flushing.
>
>So I think the different part run one by one, but the cpu utilization is
>out of my expect.
>
>
>2013/5/9 牛兆捷 <nz...@gmail.com>
>
>> I have enough memory, so there will be only one sort and spill. Why do
>> they will happen parallel?
>>
>>
>> 2013/5/9 Robert Evans <ev...@yahoo-inc.com>
>>
>>> Yes it all happens in parallel even on a single task
>>>
>>> On 5/8/13 11:17 AM, "牛兆捷" <nz...@gmail.com> wrote:
>>>
>>> >I forget to say, for see the behavior of single task, I just run one
>>>map
>>> >task for 1G input-split(I set block size to 1GB)
>>> >
>>> >
>>> >2013/5/9 Robert Evans <ev...@yahoo-inc.com>
>>> >
>>> >> Deciding on the input split happens in the client.  Each map process
>>> >>just
>>> >> opens up the input file and seeks to the appropriate offset in the
>>> file.
>>> >> At that point it reads each entry one at a time and sends it to the
>>>map
>>> >> task.  The output of the map task is placed in a buffer.  When the
>>> >>buffer
>>> >> gets close to full the data is sorted and spilled out to disk in
>>> >>parallel
>>> >> with the map task still running.  It is hard to get CPU time for the
>>> >> different parts because they are all happening in parallel. If you
>>>do
>>> >>have
>>> >> enough ram to store the entire output in memory and you have
>>>configured
>>> >> your sort buffer to be able to hold it all then you will probably
>>>only
>>> >> sort/spill once.
>>> >>
>>> >> --Bobby
>>> >>
>>> >> On 5/8/13 10:25 AM, "牛兆捷" <nz...@gmail.com> wrote:
>>> >>
>>> >> >I saw the application container log to trace the map-reduce
>>> >>application.
>>> >> >
>>> >> >For map task, I find there are mainly 3 phase: spilit input, sort
>>>and
>>> >> >spill
>>> >> >out.
>>> >> >I set the enough memory to make sure the input can stay in memory.
>>> >> >
>>> >> >Initially, I thought the highest cpu utilization will appear in
>>>sort
>>> >>phase
>>> >> >because the other two phase focus on IO,however, it doesn't behave
>>>as
>>> >>what
>>> >> >I thought. On the contrary, the cpu utilization during  the other
>>> phase
>>> >> >are
>>> >> >higher.
>>> >> >
>>> >> >Anyone know the reason?
>>> >> >
>>> >> >--
>>> >> >*Sincerely,*
>>> >> >*Zhaojie*
>>> >> >*
>>> >> >*
>>> >>
>>> >>
>>> >
>>> >
>>> >--
>>> >*Sincerely,*
>>> >*Zhaojie*
>>> >*
>>> >*
>>>
>>>
>>
>>
>> --
>> *Sincerely,*
>> *Zhaojie*
>> *
>> *
>>
>
>
>
>-- 
>*Sincerely,*
>*Zhaojie*
>*
>*


Re: question about cpu utilization

Posted by 牛兆捷 <nz...@gmail.com>.
Just for simplicity, I run only one map task for such as 256mb, then I set
my io.sort.memory to more than 512mb to make sure all input can stay in
memory, I also check the log to make sure there is just on spill happen for
flushing.

So I think the different part run one by one, but the cpu utilization is
out of my expect.


2013/5/9 牛兆捷 <nz...@gmail.com>

> I have enough memory, so there will be only one sort and spill. Why do
> they will happen parallel?
>
>
> 2013/5/9 Robert Evans <ev...@yahoo-inc.com>
>
>> Yes it all happens in parallel even on a single task
>>
>> On 5/8/13 11:17 AM, "牛兆捷" <nz...@gmail.com> wrote:
>>
>> >I forget to say, for see the behavior of single task, I just run one map
>> >task for 1G input-split(I set block size to 1GB)
>> >
>> >
>> >2013/5/9 Robert Evans <ev...@yahoo-inc.com>
>> >
>> >> Deciding on the input split happens in the client.  Each map process
>> >>just
>> >> opens up the input file and seeks to the appropriate offset in the
>> file.
>> >> At that point it reads each entry one at a time and sends it to the map
>> >> task.  The output of the map task is placed in a buffer.  When the
>> >>buffer
>> >> gets close to full the data is sorted and spilled out to disk in
>> >>parallel
>> >> with the map task still running.  It is hard to get CPU time for the
>> >> different parts because they are all happening in parallel. If you do
>> >>have
>> >> enough ram to store the entire output in memory and you have configured
>> >> your sort buffer to be able to hold it all then you will probably only
>> >> sort/spill once.
>> >>
>> >> --Bobby
>> >>
>> >> On 5/8/13 10:25 AM, "牛兆捷" <nz...@gmail.com> wrote:
>> >>
>> >> >I saw the application container log to trace the map-reduce
>> >>application.
>> >> >
>> >> >For map task, I find there are mainly 3 phase: spilit input, sort and
>> >> >spill
>> >> >out.
>> >> >I set the enough memory to make sure the input can stay in memory.
>> >> >
>> >> >Initially, I thought the highest cpu utilization will appear in sort
>> >>phase
>> >> >because the other two phase focus on IO,however, it doesn't behave as
>> >>what
>> >> >I thought. On the contrary, the cpu utilization during  the other
>> phase
>> >> >are
>> >> >higher.
>> >> >
>> >> >Anyone know the reason?
>> >> >
>> >> >--
>> >> >*Sincerely,*
>> >> >*Zhaojie*
>> >> >*
>> >> >*
>> >>
>> >>
>> >
>> >
>> >--
>> >*Sincerely,*
>> >*Zhaojie*
>> >*
>> >*
>>
>>
>
>
> --
> *Sincerely,*
> *Zhaojie*
> *
> *
>



-- 
*Sincerely,*
*Zhaojie*
*
*

Re: question about cpu utilization

Posted by 牛兆捷 <nz...@gmail.com>.
I have enough memory, so there will be only one sort and spill. Why do they
will happen parallel?


2013/5/9 Robert Evans <ev...@yahoo-inc.com>

> Yes it all happens in parallel even on a single task
>
> On 5/8/13 11:17 AM, "牛兆捷" <nz...@gmail.com> wrote:
>
> >I forget to say, for see the behavior of single task, I just run one map
> >task for 1G input-split(I set block size to 1GB)
> >
> >
> >2013/5/9 Robert Evans <ev...@yahoo-inc.com>
> >
> >> Deciding on the input split happens in the client.  Each map process
> >>just
> >> opens up the input file and seeks to the appropriate offset in the file.
> >> At that point it reads each entry one at a time and sends it to the map
> >> task.  The output of the map task is placed in a buffer.  When the
> >>buffer
> >> gets close to full the data is sorted and spilled out to disk in
> >>parallel
> >> with the map task still running.  It is hard to get CPU time for the
> >> different parts because they are all happening in parallel. If you do
> >>have
> >> enough ram to store the entire output in memory and you have configured
> >> your sort buffer to be able to hold it all then you will probably only
> >> sort/spill once.
> >>
> >> --Bobby
> >>
> >> On 5/8/13 10:25 AM, "牛兆捷" <nz...@gmail.com> wrote:
> >>
> >> >I saw the application container log to trace the map-reduce
> >>application.
> >> >
> >> >For map task, I find there are mainly 3 phase: spilit input, sort and
> >> >spill
> >> >out.
> >> >I set the enough memory to make sure the input can stay in memory.
> >> >
> >> >Initially, I thought the highest cpu utilization will appear in sort
> >>phase
> >> >because the other two phase focus on IO,however, it doesn't behave as
> >>what
> >> >I thought. On the contrary, the cpu utilization during  the other phase
> >> >are
> >> >higher.
> >> >
> >> >Anyone know the reason?
> >> >
> >> >--
> >> >*Sincerely,*
> >> >*Zhaojie*
> >> >*
> >> >*
> >>
> >>
> >
> >
> >--
> >*Sincerely,*
> >*Zhaojie*
> >*
> >*
>
>


-- 
*Sincerely,*
*Zhaojie*
*
*

Re: question about cpu utilization

Posted by Robert Evans <ev...@yahoo-inc.com>.
Yes it all happens in parallel even on a single task

On 5/8/13 11:17 AM, "牛兆捷" <nz...@gmail.com> wrote:

>I forget to say, for see the behavior of single task, I just run one map
>task for 1G input-split(I set block size to 1GB)
>
>
>2013/5/9 Robert Evans <ev...@yahoo-inc.com>
>
>> Deciding on the input split happens in the client.  Each map process
>>just
>> opens up the input file and seeks to the appropriate offset in the file.
>> At that point it reads each entry one at a time and sends it to the map
>> task.  The output of the map task is placed in a buffer.  When the
>>buffer
>> gets close to full the data is sorted and spilled out to disk in
>>parallel
>> with the map task still running.  It is hard to get CPU time for the
>> different parts because they are all happening in parallel. If you do
>>have
>> enough ram to store the entire output in memory and you have configured
>> your sort buffer to be able to hold it all then you will probably only
>> sort/spill once.
>>
>> --Bobby
>>
>> On 5/8/13 10:25 AM, "牛兆捷" <nz...@gmail.com> wrote:
>>
>> >I saw the application container log to trace the map-reduce
>>application.
>> >
>> >For map task, I find there are mainly 3 phase: spilit input, sort and
>> >spill
>> >out.
>> >I set the enough memory to make sure the input can stay in memory.
>> >
>> >Initially, I thought the highest cpu utilization will appear in sort
>>phase
>> >because the other two phase focus on IO,however, it doesn't behave as
>>what
>> >I thought. On the contrary, the cpu utilization during  the other phase
>> >are
>> >higher.
>> >
>> >Anyone know the reason?
>> >
>> >--
>> >*Sincerely,*
>> >*Zhaojie*
>> >*
>> >*
>>
>>
>
>
>-- 
>*Sincerely,*
>*Zhaojie*
>*
>*


Re: question about cpu utilization

Posted by 牛兆捷 <nz...@gmail.com>.
I forget to say, for see the behavior of single task, I just run one map
task for 1G input-split(I set block size to 1GB)


2013/5/9 Robert Evans <ev...@yahoo-inc.com>

> Deciding on the input split happens in the client.  Each map process just
> opens up the input file and seeks to the appropriate offset in the file.
> At that point it reads each entry one at a time and sends it to the map
> task.  The output of the map task is placed in a buffer.  When the buffer
> gets close to full the data is sorted and spilled out to disk in parallel
> with the map task still running.  It is hard to get CPU time for the
> different parts because they are all happening in parallel. If you do have
> enough ram to store the entire output in memory and you have configured
> your sort buffer to be able to hold it all then you will probably only
> sort/spill once.
>
> --Bobby
>
> On 5/8/13 10:25 AM, "牛兆捷" <nz...@gmail.com> wrote:
>
> >I saw the application container log to trace the map-reduce application.
> >
> >For map task, I find there are mainly 3 phase: spilit input, sort and
> >spill
> >out.
> >I set the enough memory to make sure the input can stay in memory.
> >
> >Initially, I thought the highest cpu utilization will appear in sort phase
> >because the other two phase focus on IO,however, it doesn't behave as what
> >I thought. On the contrary, the cpu utilization during  the other phase
> >are
> >higher.
> >
> >Anyone know the reason?
> >
> >--
> >*Sincerely,*
> >*Zhaojie*
> >*
> >*
>
>


-- 
*Sincerely,*
*Zhaojie*
*
*

Re: question about cpu utilization

Posted by Robert Evans <ev...@yahoo-inc.com>.
Deciding on the input split happens in the client.  Each map process just
opens up the input file and seeks to the appropriate offset in the file.
At that point it reads each entry one at a time and sends it to the map
task.  The output of the map task is placed in a buffer.  When the buffer
gets close to full the data is sorted and spilled out to disk in parallel
with the map task still running.  It is hard to get CPU time for the
different parts because they are all happening in parallel. If you do have
enough ram to store the entire output in memory and you have configured
your sort buffer to be able to hold it all then you will probably only
sort/spill once.

--Bobby

On 5/8/13 10:25 AM, "牛兆捷" <nz...@gmail.com> wrote:

>I saw the application container log to trace the map-reduce application.
>
>For map task, I find there are mainly 3 phase: spilit input, sort and
>spill
>out.
>I set the enough memory to make sure the input can stay in memory.
>
>Initially, I thought the highest cpu utilization will appear in sort phase
>because the other two phase focus on IO,however, it doesn't behave as what
>I thought. On the contrary, the cpu utilization during  the other phase
>are
>higher.
>
>Anyone know the reason?
>
>-- 
>*Sincerely,*
>*Zhaojie*
>*
>*