You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2009/09/11 01:57:59 UTC

[jira] Updated: (MAPREDUCE-969) NullPointerException during reduce freezes job

     [ https://issues.apache.org/jira/browse/MAPREDUCE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated MAPREDUCE-969:
----------------------------------

    Attachment: reduce_task_logs
                bad_job_jt_logs
                bad_job_events

Attaching sanitized logs from this incident.

The event that seems to be a red flag is the lost task tracker xx05.

The null pointer exception is caused by u.getHost() being null - this URI is the taskTrackerHttpAddress in a TaskTrackerStatus. The job event output doesn't show any with a malformed URL, so I suspect some kind of race.

Aside from this issue, I find it odd that GetMapEventsThread ignores exceptions. In cases like this it will cause the ReduceTask to spin forever while still reporting progress until the user intervenes.

> NullPointerException during reduce freezes job
> ----------------------------------------------
>
>                 Key: MAPREDUCE-969
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-969
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker, task, tasktracker
>    Affects Versions: 0.20.2
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: bad_job_events, bad_job_jt_logs, reduce_task_logs
>
>
> We experienced several jobs stuck in Reduce on a cluster. All of the stuck reduce tasks had a similar were stuck at "Need another 2 map output(s) where 0 is already in progress" despite all of the mappers having completed, and 0 scheduled. The stuck reducers had experienced the following exception early in the shuffle:
> java.lang.NullPointerException
> 	at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2747)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2670)
> Will attach more information and logs momentarily.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.