You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Koji Noguchi (JIRA)" <ji...@apache.org> on 2007/03/23 17:31:32 UTC

[jira] Created: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput

Reduce task hang failing in MapOutputCopier.copyOutput
------------------------------------------------------

                 Key: HADOOP-1152
                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
            Reporter: Koji Noguchi


We had couple of reduce tasks hang repeating the output below.

2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
  at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
  at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
  at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
  at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 seconds

===============================
Before the above output, there was 

2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure: java.lang.NullPointerException
  at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
  at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
  at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
  at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
  at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490323 ] 

Hadoop QA commented on HADOOP-1152:
-----------------------------------

-1, build or testing failed

2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12355904/1152.patch against trunk revision r530556.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/65/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/65/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>         Assigned To: Tahir Hashmi
>         Attachments: 1152.patch
>
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure: java.lang.NullPointerException
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490292 ] 

Devaraj Das commented on HADOOP-1152:
-------------------------------------

+1, regardless of the race condition theory, it makes sense to not call getLength on a renamed ramfs file (that the merge thread might have already deleted in the time interval between rename and getLength). But it seems very likely the race condition is caused due to this.

> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>         Assigned To: Tahir Hashmi
>         Attachments: 1152.patch, 1152.workaround.patch
>
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure: java.lang.NullPointerException
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483658 ] 

Devaraj Das commented on HADOOP-1152:
-------------------------------------

Dug the logs a bit more (for the hung reduces).. Here is another relevant "rename" exception, I found for all the hung reduces; this is for the .crc file. I looked at the code also - and it seems like the first rename is somehow not really updating (or the updates are not visible) the datastructures in the ramfs. So, later on the first getLength() is failing on the final file that rename just attempted to create (in the ramfs datastructure). However, later on, the final filenames become visible in the ramfs and then the renames really fail leading to the chain of IOExceptions... Some weird behavior...

2007-03-22 18:17:28,670 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 copy failed: task_0026_m_007854_0 from __
2007-03-22 18:17:28,670 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /export/crawlspace3/kryptonite/hadoop/mapred/local/task_0026_r_000307_0/.map_7854.out.crc already exists
        at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
        at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:480)
        at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
        at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)


> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure: java.lang.NullPointerException
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Reopened: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White reopened HADOOP-1152:
-------------------------------


> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>         Assigned To: Tahir Hashmi
>         Attachments: 1152.patch
>
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure: java.lang.NullPointerException
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput

Posted by "Tahir Hashmi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tahir Hashmi reassigned HADOOP-1152:
------------------------------------

    Assignee: Tahir Hashmi

> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>         Assigned To: Tahir Hashmi
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure: java.lang.NullPointerException
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White resolved HADOOP-1152.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 0.13.0

> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>         Assigned To: Tahir Hashmi
>             Fix For: 0.13.0
>
>         Attachments: 1152.patch
>
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure: java.lang.NullPointerException
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput

Posted by "Johan Oskarson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12489998 ] 

Johan Oskarson commented on HADOOP-1152:
----------------------------------------

This is happening on a daily basis on our 25 node cluster running 0.12.3 causing serious delays.
I noticed there's no news for about a month so thought I'd ask if a hack would be to just make it overwrite the file that already exists?
I've not had time to look at the code but as long as it doesn't hang the job I'm fine with a minor performance hit until a real solution can be found.

> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure: java.lang.NullPointerException
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-1152:
------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I've just committed this. Thanks Tahir!

> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>         Assigned To: Tahir Hashmi
>         Attachments: 1152.patch
>
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure: java.lang.NullPointerException
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1152:
--------------------------------

    Attachment: 1152.workaround.patch

Actually the problem might not be in the rename. The NPE should not happen in the first place.
Here is a patch that will kill the task if it ever encounters a NPE in the affected part of the code. It will also log the state of the ramfs in terms of the files it currently has and their lengths. It will also log a diagnostic message which you would see in the web ui for the killed task. Please upload those log messages whenever you see tasks failing with this exception.

> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>         Assigned To: Tahir Hashmi
>         Attachments: 1152.workaround.patch
>
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure: java.lang.NullPointerException
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput

Posted by "Tahir Hashmi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tahir Hashmi updated HADOOP-1152:
---------------------------------

    Attachment:     (was: 1152.workaround.patch)

> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>         Assigned To: Tahir Hashmi
>         Attachments: 1152.patch
>
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure: java.lang.NullPointerException
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput

Posted by "Tahir Hashmi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tahir Hashmi updated HADOOP-1152:
---------------------------------

    Attachment: 1152.patch

Looked at this with Devaraj yesterday and our theory about why this fails is that in MapTask.MapOutputCopier.copyOutput(), there's a call to rename a temporary file to the actual .out file. After the rename, another call is made to get the length of the actual file (which is same as that of the temporary file, obviously). Between these calls, if a MergeThread flushes the file to .out file to disk and deletes it, the call to getLength() will fail.

Sameer's suggestion was to simply invoke getLength() on the temporary file before the rename and discard the value in case the rename fails. After the file is renamed, it should be assumed that the local code no longer owns it. 1152.patch has these changes incorporated.

> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>         Assigned To: Tahir Hashmi
>         Attachments: 1152.patch, 1152.workaround.patch
>
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure: java.lang.NullPointerException
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490706 ] 

Hadoop QA commented on HADOOP-1152:
-----------------------------------

Integrated in Hadoop-Nightly #66 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/66/)

> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>         Assigned To: Tahir Hashmi
>             Fix For: 0.13.0
>
>         Attachments: 1152.patch
>
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure: java.lang.NullPointerException
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput

Posted by "Tahir Hashmi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tahir Hashmi updated HADOOP-1152:
---------------------------------

    Status: Patch Available  (was: Open)

> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>         Assigned To: Tahir Hashmi
>         Attachments: 1152.patch, 1152.workaround.patch
>
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure: java.lang.NullPointerException
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.