You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Junping Du (JIRA)" <ji...@apache.org> on 2015/05/11 17:44:00 UTC

[jira] [Commented] (MAPREDUCE-6361) NPE issue in shuffle caused by concurrent issue between copySucceeded() in one thread and copyFailed() in another thread on the same host

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538071#comment-14538071 ] 

Junping Du commented on MAPREDUCE-6361:
---------------------------------------

NPE get throw in copyFailed() in ShuffleSchedulerImpl.java:267:
{code}
"boolean hostFail = hostFailures.get(hostname).get() > getMaxHostFailures() ? true : false;"
{code} 
It means hostFailures doesn't include hostname that just failed, which is not expected because we call hostFailed() to put host into hostFailures before anytime to call copyFailed():
{code}
        scheduler.hostFailed(host.getHostName());
        for(TaskAttemptID left: failedTasks) {
          scheduler.copyFailed(left, host, true, false);
        }
{code}
Although hostFailed() and copyFailed() are both synchronized method (so as copySucceeded()), it is still possible (like the only reason) to cause this NPE for the other thread calls copySucceeded() on the same host (for other map output) between we call hostFailed() and copyFailed() in this thread when taking care of one map output failure.
We need to fix this concurrent issue to get rid of NPE issue which failed map output copy directly without any retry.

> NPE issue in shuffle caused by concurrent issue between copySucceeded() in one thread and copyFailed() in another thread on the same host
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6361
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6361
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Junping Du
>            Assignee: Junping Du
>
> The failure in log:
> 2015-05-08 21:00:00,513 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#25
>          at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>          at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Subject.java:415)
>          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>          at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
>          at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:267)
>          at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:308)
>          at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)