You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Michel Tourn <mi...@yahoo-inc.com> on 2006/02/10 01:01:24 UTC

MapRed status information

Can somebody comment on the feasibility of this?

Currently the MapRed status information looks like:

060208 223641  map 0%  reduce 100%
060208 223641 Job complete: job_ityg9w

percentage complete for a given job.

This is nice but I would also like to see some absolute numbers:

060208 223641  map 50% (2456) reduce 100% (123)
060208 223641 Job complete: job_ityg9w

which tells me:
2456 inputs records is 50% of the Map job
123 output records is 100% of the Reduce job

In particular this is useful info when you get
your DFS paths or file wildcards wrong:
MapRed finds zero input files and happily processes
an empty job and completes with:

060208 223641  map 100% reduce 100%
..which is 100% of zero records.
Completing successfully for an empty job is the right behaviour.
But we need more a informative status.


This could be combined with work on:
 ETA estimation for the Job:

060208 223641  map 50% (2456) ETA: +3h20m 070208 015641
..
070208 010000 Job complete: job_ityg9w


Can somebody comment on the feasibility of this?
1. is it easy to get absolute number of records processed?
2. is it easy to get (an estimate(*) of) total number of records
   to be processed in the job.
3. Where should this be computed?
   Should it be computed by a client polling for status?
   Can the information also be made availalbe to the job.tracker.info Web UI
?

(*)extrapolation based on file fragments size and bytes/rec so far.


Thanks,
Michel