You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Jane Chen <jx...@yahoo.com> on 2011/03/29 01:10:29 UTC

RecordReader Progress Reporting.

I'd like to get some idea on how the task scheduler relies on RecordReader.getProgress() with version 0.20.2.  

There are times when I don't have an accurate count of the total records to be processed, and I wonder the impact on task scheduling when returning an inaccurate progress percentage.  I found that when I return either 0 when not done and 1 when done will make the job hang.

Any advice is greatly appreciated.

Thanks,
Jane

Re: RecordReader Progress Reporting.

Posted by Harsh J <ha...@cloudera.com>.
Hello Jane,

On Tue, Mar 29, 2011 at 4:40 AM, Jane Chen <jx...@yahoo.com> wrote:
> There are times when I don't have an accurate count of the total records to be processed, and I wonder the impact on task scheduling when returning an inaccurate progress percentage.  I found that when I return either 0 when not done and 1 when done will make the job hang.

What do you mean when you say the job 'hangs' when you statically set
it to 0 or 1 always? Do you mean the task gets killed and restarted?

When progress or status message changes are made, a Task status report
is sent back via the reporter to the TIP object held by the parent
TaskTracker. In case a TIP has not received task reports in a while,
it can go ahead and purge the task claiming that it has hung or gone
unresponsive (mapred.task.timeout, 600s by default - set to 0 to never
let it purge) and it gets rescheduled.

If you're not sure what your progress is while processing stuff in RR,
set progress to a random value; it shouldn't matter to the framework
if the progress decreases in value.

-- 
Harsh J