You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Xuan Gong (JIRA)" <ji...@apache.org> on 2012/12/04 01:47:58 UTC
[jira] [Commented] (MAPREDUCE-4835) AM job metrics can double-count a job if it errors after entering a completion state

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509385#comment-13509385 ] 

Xuan Gong commented on MAPREDUCE-4835:
--------------------------------------

The method "Have JobImpl.finished ignore incrementing any metrics if the job is already in a terminal state (SUCCEEDED/FAILED/KILLED) to avoid double-counting a job." may not work. But before we call the finished, the current states is already changed. So, it is very difficult to check previous status is terminal states or not.
For example, somehow we did InternalErrorTransition, it will change to state from succeeded to error. From the code at InternalErrorTransition, 
    public void transition(JobImpl job, JobEvent event) {
      //TODO Is this JH event required.
      job.setFinishTime();
      JobUnsuccessfulCompletionEvent failedEvent =
          new JobUnsuccessfulCompletionEvent(job.oldJobId,
              job.finishTime, 0, 0,
              JobStateInternal.ERROR.toString());
      job.eventHandler.handle(new JobHistoryEvent(job.jobId, failedEvent)); <-- this line is actually change the states
      job.finished(JobStateInternal.ERROR); <-- this line will increase the failure count that is duplicate
    }
So, what we can do is add JobStateInternal previousState = getInternalState() before job.eventHandler.handle(new JobHistoryEvent(job.jobId, failedEvent)), and check the previousState to decide whether we need to increase the count or not.
For example, if we do not want to increase the count when we change the terminal states to error state. We can do:
In InternalErrorTransition, 
    public void transition(JobImpl job, JobEvent event) {
      //TODO Is this JH event required.
      job.setFinishTime();
      JobUnsuccessfulCompletionEvent failedEvent =
          new JobUnsuccessfulCompletionEvent(job.oldJobId,
              job.finishTime, 0, 0,
              JobStateInternal.ERROR.toString());
      JobStateInternal previousState = job.getInternalState();
      job.eventHandler.handle(new JobHistoryEvent(job.jobId, failedEvent));
      //check the previous state is not terminal states, is not error states, when we meet error states, we should have already increase the count, we do not want to do it again
      if(previousState != JobStateInternal.SUCCEEDED || previousState != JobStateInternal.KILLED || previousState != JobStateInternal.FAILED || previousState != JobStateInternal.ERROR)
      {
    	  job.finished(JobStateInternal.ERROR);
      }
    }
                
> AM job metrics can double-count a job if it errors after entering a completion state
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4835
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4835
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.0.3-alpha, 0.23.6
>            Reporter: Jason Lowe
>            Priority: Minor
>
> If JobImpl enters the SUCCEEDED, FAILED, or KILLED state but then encounters an invalid state transition, it could double-count the job since jobs that encounter an error are considered failed jobs.  Therefore the job could be counted initially as a successful, failed, or killed job, respectively, then counted again as a failed job due to the internal error afterwards.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira