You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Atish Kathpal <at...@gmail.com> on 2014/02/19 10:24:59 UTC

How to list the order in which file splits will be processed by Maps in Hadoop 2.2.0?

Hello

I am interested to know the order in which input files will be processed by
the map tasks of a given job.

*Example*: I am running Wordcount on input directory /ebooks/ containing
say 10 .txt files
On running the above job I would like to know at any point of time, what
map tasks (mad tasks ids) on which nodes (ip address), were processing
which file splits (actual file, range of offsets).

Is it possible to hook into MR source code to obtain such details ? Please
point me to the section of code I can get these details from?

Based on logging and analyzing above details I might want to perform some
pre-fetching to improve Map tasks performance. (I am not using HDFS, but a
different FS which needs some performance fixing using pre-fetching or
other techniques).

TL;DR
I want to be able to know the sequence/order in which different files will
be accessed by map tasks for processing once a job is submitted to Hadoop
v2 cluster. I am assuming some kind of FIFO scheduler module might be able
to give me this information at file level?

Looking forward to your reply.

Thanks.