You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Richard Lee (JIRA)" <ji...@apache.org> on 2007/10/19 20:17:50 UTC

[jira] Created: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

ChecksumFileSystem checksum file size incorrect.
------------------------------------------------

                 Key: HADOOP-2080
                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
             Project: Hadoop
          Issue Type: Bug
          Components: fs
    Affects Versions: 0.14.2, 0.14.1, 0.14.0
         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
            Reporter: Richard Lee


Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:

2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)

The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.

The method used for calculating checksum file size is the following (ChecksumFileSystem:318):

((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;

The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.

(((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-2080:
--------------------------------

    Priority: Blocker  (was: Major)

Making this a blocker since a job would be badly affected if the tasks run into the situation described.

> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: ChecksumFileSystem.java.patch, TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Richard Lee (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Lee updated HADOOP-2080:
--------------------------------

    Attachment: TestInternalFilesystem.java

Here is a junit test case that reproduces this issue.  Most of the code that this bug deals with is not outside-testable, so values and method contents were copied and pasted into the test case with corresponding line numbers.

> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>         Attachments: TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-2080:
----------------------------------

    Attachment: hadoop-2080.patch

This patch fixes the checksum file system to use integer math instead of floating point math and includes a test case. It also updates the checksum filesystem to cache the value of bytes per a checksum rather than continually refetching it from the config.

> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: ChecksumFileSystem.java.patch, hadoop-2080.patch, TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-2080:
----------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this.

> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: ChecksumFileSystem.java.patch, hadoop-2080.patch, TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536857 ] 

Raghu Angadi commented on HADOOP-2080:
--------------------------------------

What if size is zero?   Another way to write it would be : 

( (size + bytesPerChecksum -1)/bytesPerChecksum) * 4 + CHECKSUM_VERSION.length + 4 



> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: ChecksumFileSystem.java.patch, TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536842 ] 

Hairong Kuang commented on HADOOP-2080:
---------------------------------------

(((size-1)/bytesPerSum) + 1) * 4 + CHECKSUM_VERSION.length + 4 seems right to me.

> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: ChecksumFileSystem.java.patch, TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-2080:
----------------------------------

    Status: Patch Available  (was: Open)

> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.2, 0.14.1, 0.14.0
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: ChecksumFileSystem.java.patch, hadoop-2080.patch, TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536976 ] 

Hudson commented on HADOOP-2080:
--------------------------------

Integrated in Hadoop-Nightly #280 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/280/])

> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: ChecksumFileSystem.java.patch, hadoop-2080.patch, TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley reassigned HADOOP-2080:
-------------------------------------

    Assignee: Owen O'Malley

> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: ChecksumFileSystem.java.patch, hadoop-2080.patch, TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Richard Lee (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536772 ] 

Richard Lee commented on HADOOP-2080:
-------------------------------------

I have a +2 in my calculation because there's another int that's written right after the header in the checksum file at ChecksumFileSystem:306.  That int is the BytesPerSum value.  I guess it could be more explicit by adding a separate term of 4 instead of adding a +2.

As to whether the size should be +1 or -1 when being divided by bytesPerSum... i don't think it matters since we +1 the result.

So maybe the final computation should be:

(((size-1)/bytesPerSum) + 1) * 4 + CHECKSUM_VERSION.length + 4

> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: ChecksumFileSystem.java.patch, TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-2080:
--------------------------------

    Fix Version/s: 0.15.0

> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: ChecksumFileSystem.java.patch, TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536860 ] 

Raghu Angadi commented on HADOOP-2080:
--------------------------------------

> As to whether the size should be +1 or -1 when being divided by bytesPerSum... i don't think it matters since we +1 the result.
+1 is gives wrong value when ( size  == (n*bytesPerChecksum - 1 ) ).


> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: ChecksumFileSystem.java.patch, hadoop-2080.patch, TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Richard Lee (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Lee updated HADOOP-2080:
--------------------------------

    Attachment: ChecksumFileSystem.java.patch

here's an svn diff done from the release-xxx dir.  I let my cluster run with this patch in place for the day and i haven't seen any occurances of the out of space exception.

> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>         Attachments: ChecksumFileSystem.java.patch, TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536863 ] 

Raghu Angadi commented on HADOOP-2080:
--------------------------------------

+1 to Owen's patch.

Thanks Richard. That was a good catch with the floats. findbugs catches similar non-obvious math errors. May we should inform findbugs guys.


> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: ChecksumFileSystem.java.patch, hadoop-2080.patch, TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect.

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536764 ] 

Hairong Kuang commented on HADOOP-2080:
---------------------------------------

should the computation be (((size-1)/bytesPerSum) + 1) * 4 + CHECKSUM_VERSION.length?



> ChecksumFileSystem checksum file size incorrect.
> ------------------------------------------------
>
>                 Key: HADOOP-2080
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2
>         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
>            Reporter: Richard Lee
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: ChecksumFileSystem.java.patch, TestInternalFilesystem.java
>
>
> Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:
> 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
> 	at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.
> The method used for calculating checksum file size is the following (ChecksumFileSystem:318):
> ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;
> The issue here is the cast to float.  Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000.  The fix is to replace this calculation with something that doesn't cast to float.
> (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.