You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Amar Kamat <am...@yahoo-inc.com> on 2011/12/03 08:58:17 UTC

Re: Capturing Map and Reduce I/O time

Arun,
> I see that hadoop doesn't capture the Map task I/O time and Reduce task I/O time and captures only map runtime
> and reduce runtime. Am i right ?
For maps, the framework doesn't explicitly capture the read time. For reduce, maybe shuffle time is a good metric to start with.

> What does that runtime of Map and reduce tasks mean ?
Time to finish the entire map task (not the method). Includes data read, data processing, sort and spill.

> Which files do i need to look at and modify in Hadoop if i want to capture the map and reduce I/O time's ?
For the old codebase (pre YARN), see MapTask.java and ReduceTask.java.

Roughly, the map phase is divided into 2 phases i.e map and sort. In the map phase, the read and processing happens in parallel. While the user code processes the current key-value pair, the framework reads and caches the next key-value pair. Hence its tough to distinguish between the read and process phases.

Reduce task is divided into 3 phases i.e shuffle, sort (final), reduce. The shuffle phase has data copy (over the network) and sort (rather merge) happening in parallel. Once the entire data gets copies, a final merge happens. This gets captured under the sort phase. But still the shuffle phase time (recorded in the job history) is a good indicator of the time it takes to read the data off the network.

Amar

On 11/29/11 7:56 PM, "ArunKumar" <ar...@gmail.com> wrote:

Hi guys !

I see that hadoop doesn't capture the Map task I/O time and Reduce task I/O
time and captures only map runtime  and reduce runtime. Am i right ?

By I/O time for map task i meant time taken by the map task to read the
input chunk allocated to it for processing and the time for it to write the
O/P data to the local disk.
By I/O time for Reduce task i meant time for reduce task to transfer map
O/Ps to reduce task(shuffle phase) and writing reduce O/Ps to DFS.

> What does that runtime of Map and reduce tasks mean ?
   Does it mean time taken to execute the Map method and reduce method
respectively ? (or)
   Does it mean time taken from the start of the Map/Reduce task to the
completion of the Map/Reduce task(i.e including time to read,sort ,compute
map or reduce ,merge,etc.) ?

> Which files do i need to look at and modify in Hadoop if i want to capture
> the map and reduce I/O time's ?

>  If i want to capture these values for few jobs of applications like
> wordcount,sort,etc. what is the best way to do ?

Can anyone guide me in this regard ?

Thanks,
Arun

--
View this message in context: http://lucene.472066.n3.nabble.com/Capturing-Map-and-Reduce-I-O-time-tp3545298p3545298.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.