You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Koji Noguchi (JIRA)" <ji...@apache.org> on 2007/03/08 01:00:27 UTC

[jira] Created: (HADOOP-1087) Reducer hangs pulling from incorrect file.out.index path. (when one of the mapred.local.dir is not accessible but becomes available later at reduce time)

Reducer hangs pulling from incorrect file.out.index path. (when one of the mapred.local.dir is not accessible but becomes available later at reduce time)
---------------------------------------------------------------------------------------------------------------------------------------------------------

                 Key: HADOOP-1087
                 URL: https://issues.apache.org/jira/browse/HADOOP-1087
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.10.1
            Reporter: Koji Noguchi



2007-03-07 23:14:23,431 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Server returned HTTP response code: 500 for URL: http://____:____/mapOutput?map=task_7810_m_000897_0&reduce=397
  at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1149)
  at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:121)
  at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:236)
  at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:199)
2007-03-07 23:14:23,431 WARN org.apache.hadoop.mapred.TaskRunner: task_7810_r_000397_0 adding host ____.com to penalty box, next contact in 279 seconds

This happened when one of the drives was full and not accessible at map time.

and one mapper

    public void mergeParts() throws IOException {
      ...
      Path finalIndexFile = mapOutputFile.getOutputIndexFile(getTaskId());

failed on the first hash entry in mapred.local.dir and used the second entry

Afterwards, first dir entry became available and when reducer tried to pull through,
    public static class MapOutputServlet extends HttpServlet {
      ...
      Path indexFileName = conf.getLocalPath(mapId+"/file.out.index");

it used the first entry.

As a result, directory was empty and reducer kept on trying to pull from the incorrect path and hang.

(wasn't sure if this is a duplicate of HADOOP-895 since it is not reproducible unless I get disk failure.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1087) Reducer hangs pulling from incorrect file.out.index path. (when one of the mapred.local.dir is not accessible but becomes available later at reduce time)

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488364 ] 

Devaraj Das commented on HADOOP-1087:
-------------------------------------

This is strange actually. If the file.out.index does not exist in the directory, then we should get an IOException and the handler for IOException in the doGet method sends a SC_GONE status in the HTTP response. This should be seen as 410 on the client and not 500 (in the exception reported by the client: 'Server returned HTTP response code: 500'). '500' means that there was an internal error in the servlet which could have been caused due to some other problem like a NPE (could have been due to something like HADOOP-1123). Another thing pointing to this direction is that the map output will be declared as 'lost' in the same exception handler code in the doGet method, and the JobTracker will reexecute the map. So the job should not hang. 

> Reducer hangs pulling from incorrect file.out.index path. (when one of the mapred.local.dir is not accessible but becomes available later at reduce time)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1087
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1087
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.10.1
>            Reporter: Koji Noguchi
>
> 2007-03-07 23:14:23,431 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Server returned HTTP response code: 500 for URL: http://____:____/mapOutput?map=task_7810_m_000897_0&reduce=397
>   at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1149)
>   at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:121)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:236)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:199)
> 2007-03-07 23:14:23,431 WARN org.apache.hadoop.mapred.TaskRunner: task_7810_r_000397_0 adding host ____.com to penalty box, next contact in 279 seconds
> This happened when one of the drives was full and not accessible at map time.
> and one mapper
>     public void mergeParts() throws IOException {
>       ...
>       Path finalIndexFile = mapOutputFile.getOutputIndexFile(getTaskId());
> failed on the first hash entry in mapred.local.dir and used the second entry
> Afterwards, first dir entry became available and when reducer tried to pull through,
>     public static class MapOutputServlet extends HttpServlet {
>       ...
>       Path indexFileName = conf.getLocalPath(mapId+"/file.out.index");
> it used the first entry.
> As a result, directory was empty and reducer kept on trying to pull from the incorrect path and hang.
> (wasn't sure if this is a duplicate of HADOOP-895 since it is not reproducible unless I get disk failure.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-1087) Reducer hangs pulling from incorrect file.out.index path. (when one of the mapred.local.dir is not accessible but becomes available later at reduce time)

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das resolved HADOOP-1087.
---------------------------------

    Resolution: Won't Fix

Resolving this for now. The comments made earlier explain the reason. Also, HADOOP-1252 should take care of this situation it it ever happens. In case the problem appears even with the fix for HADOOP-1252, then we can reopen this.

> Reducer hangs pulling from incorrect file.out.index path. (when one of the mapred.local.dir is not accessible but becomes available later at reduce time)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1087
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1087
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.10.1
>            Reporter: Koji Noguchi
>
> 2007-03-07 23:14:23,431 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Server returned HTTP response code: 500 for URL: http://____:____/mapOutput?map=task_7810_m_000897_0&reduce=397
>   at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1149)
>   at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:121)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:236)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:199)
> 2007-03-07 23:14:23,431 WARN org.apache.hadoop.mapred.TaskRunner: task_7810_r_000397_0 adding host ____.com to penalty box, next contact in 279 seconds
> This happened when one of the drives was full and not accessible at map time.
> and one mapper
>     public void mergeParts() throws IOException {
>       ...
>       Path finalIndexFile = mapOutputFile.getOutputIndexFile(getTaskId());
> failed on the first hash entry in mapred.local.dir and used the second entry
> Afterwards, first dir entry became available and when reducer tried to pull through,
>     public static class MapOutputServlet extends HttpServlet {
>       ...
>       Path indexFileName = conf.getLocalPath(mapId+"/file.out.index");
> it used the first entry.
> As a result, directory was empty and reducer kept on trying to pull from the incorrect path and hang.
> (wasn't sure if this is a duplicate of HADOOP-895 since it is not reproducible unless I get disk failure.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1087) Reducer hangs pulling from incorrect file.out.index path. (when one of the mapred.local.dir is not accessible but becomes available later at reduce time)

Posted by "Koji Noguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488773 ] 

Koji Noguchi commented on HADOOP-1087:
--------------------------------------

I tried some random mapouput link on 0.10.1 and 0.12.3.  

In 0.10.1, it returned 500. 
in 0.12.3, it returned 410.

On the web (in 0.10.1) 
==============================
HTTP ERROR: 500

/hadoop1/mapred/local/task_0198_m_000251_0/file.out.index

RequestURI=/mapOutput

Powered by Jetty://

==============================

When the original error happened, I found the file.out.index at the different directory.  /hadoop4/mapred/local/task_0198_m_000251_0/file.out.index instead of /hadoop1.  That's how I thought it's something to do with the full drive.

> Another thing pointing to this direction is that the map output will be declared as 'lost' in the same exception handler code in the doGet method, and the JobTracker will reexecute the map. So the job should not hang.
>
Was this a fix after 0.10.1? 
If so, we can change this to 'won't fix'.


> Reducer hangs pulling from incorrect file.out.index path. (when one of the mapred.local.dir is not accessible but becomes available later at reduce time)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1087
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1087
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.10.1
>            Reporter: Koji Noguchi
>
> 2007-03-07 23:14:23,431 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Server returned HTTP response code: 500 for URL: http://____:____/mapOutput?map=task_7810_m_000897_0&reduce=397
>   at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1149)
>   at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:121)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:236)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:199)
> 2007-03-07 23:14:23,431 WARN org.apache.hadoop.mapred.TaskRunner: task_7810_r_000397_0 adding host ____.com to penalty box, next contact in 279 seconds
> This happened when one of the drives was full and not accessible at map time.
> and one mapper
>     public void mergeParts() throws IOException {
>       ...
>       Path finalIndexFile = mapOutputFile.getOutputIndexFile(getTaskId());
> failed on the first hash entry in mapred.local.dir and used the second entry
> Afterwards, first dir entry became available and when reducer tried to pull through,
>     public static class MapOutputServlet extends HttpServlet {
>       ...
>       Path indexFileName = conf.getLocalPath(mapId+"/file.out.index");
> it used the first entry.
> As a result, directory was empty and reducer kept on trying to pull from the incorrect path and hang.
> (wasn't sure if this is a duplicate of HADOOP-895 since it is not reproducible unless I get disk failure.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.