You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Brian Bockelman (JIRA)" <ji...@apache.org> on 2008/09/29 06:22:44 UTC

[jira] Created: (HADOOP-4298) File corruption when reading with fuse-dfs

File corruption when reading with fuse-dfs
------------------------------------------

Key: HADOOP-4298
URL: https://issues.apache.org/jira/browse/HADOOP-4298
Project: Hadoop Core
Issue Type: Bug
Components: contrib/fuse-dfs
Affects Versions: 0.18.1
Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit

I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
Reporter: Brian Bockelman
Priority: Critical
Fix For: 0.18.1

I pulled a 5GB data file into Hadoop using the following command:
hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
I have HDFS mounted in /mnt/hadoop using fuse-dfs.

However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.

When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.

When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.

CentOs 4.6 is, unfortunately, not the apparent culprit. When checking on CentOs 5.x, I could recreate the corruption issue. The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).

Thanks for looking into this,
Brian

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638925#action_12638925 ] 

Pete Wyckoff commented on HADOOP-4298:
--------------------------------------

this is reading from the same file handle? 

I have looked some more at the DFSClient and it looks like everything is completely thread safe (but that was in trunk around the time of the 0.19 release).

Now the libhdfs library should also be thread safe by design i believe, but I am not completely positive that all calls actually are.

pete



> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>         Attachments: 4298-patch-test.txt, HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635845#action_12635845 ] 

Pete Wyckoff commented on HADOOP-4298:
--------------------------------------

looks like fuse itself doesn't like this.  let me look into this further


> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635392#action_12635392 ] 

Brian Bockelman commented on HADOOP-4298:
-----------------------------------------

I wrote a small program this morning to do the file copies in C via libhdfs, thus bypassing fuse-dfs.

This resulted in no file corruption - seeming to indicate that the bug lies in fuse-dfs, not libhdfs.

> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.1
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff resolved HADOOP-4298.
----------------------------------

    Resolution: Duplicate

this fix is included in HADOOP-4397 as it fixes all read corruptions - from the buffer and from concurrency on the buffer.


> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>         Attachments: 4298-patch-test.txt, HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638928#action_12638928 ] 

Pete Wyckoff commented on HADOOP-4298:
--------------------------------------

bq. Can you confirm that this patch is only for 0.18 branch?
yes - the one for 0.19 already got committed and i think i opened one for 0.20.

This is ready to commit.

thanks, pete


> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>         Attachments: 4298-patch-test.txt, HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Craig Macdonald (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637682#action_12637682 ] 

Craig Macdonald commented on HADOOP-4298:
-----------------------------------------

Pete,

Can you confirm that this patch is only for 0.18 branch?
HADOOP-4319 had the equivalent patch for trunk/0.20, as far as I can see.

C

> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>         Attachments: 4298-patch-test.txt, HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4298:
---------------------------------

    Attachment: HADOOP-4298.1.txt

This fixes the problem as Brian did basically and adds a test case.


> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>         Attachments: HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636078#action_12636078 ] 

Hadoop QA commented on HADOOP-4298:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12391246/HADOOP-4298.1.txt
  against trunk revision 700628.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3413/console

This message is automatically generated.

> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>         Attachments: HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4298:
---------------------------------

    Priority: Blocker  (was: Critical)

> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Assignee: Pete Wyckoff
>            Priority: Blocker
>             Fix For: 0.18.2
>
>         Attachments: 4298-patch-test.txt, HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-4298:
----------------------------------

      Description: 
I pulled a 5GB data file into Hadoop using the following command:
hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
I have HDFS mounted in /mnt/hadoop using fuse-dfs.

However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.

When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.

When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.

CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).

Thanks for looking into this,
Brian

  was:

I pulled a 5GB data file into Hadoop using the following command:
hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
I have HDFS mounted in /mnt/hadoop using fuse-dfs.

However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.

When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.

When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.

CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).

Thanks for looking into this,
Brian

    Fix Version/s:     (was: 0.18.1)
                   0.18.2

18.1 has already been released. Pushing to 0.18.2.

> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4298:
---------------------------------

    Status: Patch Available  (was: Open)

> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>         Attachments: HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff reopened HADOOP-4298:
----------------------------------

      Assignee: Pete Wyckoff

> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Assignee: Pete Wyckoff
>            Priority: Critical
>             Fix For: 0.18.2
>
>         Attachments: 4298-patch-test.txt, HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635675#action_12635675 ] 

Pete Wyckoff commented on HADOOP-4298:
--------------------------------------

Hi Brian,

thanks for looking into this. I will look at the code and try the change and test case you showed.  In version 18, you should be able to comment out #define OPTIMIZED_READS. Have you tried that and did it help? 

I didn't write this specific code but am fairly familiar with it and will look more carefully tomorrow.

One question - is this a single threaded application that is doing the reads? If not, this may be a problem with fuse-dfs not correctly interacting with the FileSystem filehandle cache.

pete




> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4298:
---------------------------------

    Status: Open  (was: Patch Available)

HADOOP-4397 supersedes this one as it contains all the read and other concurrency problem fixes.


> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>         Attachments: 4298-patch-test.txt, HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4298:
---------------------------------

    Attachment: 4298-patch-test.txt

The patch command succeeds and testing passes for this component - there is some unrelated failure:
 hadoop-0.18.1 > grep FAIL ../4298-patch-test.txt 
    [junit] Test org.apache.hadoop.mapred.jobcontrol.TestLocalJobControl FAILED
BUILD FAILED




> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>         Attachments: 4298-patch-test.txt, HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635833#action_12635833 ] 

Pete Wyckoff commented on HADOOP-4298:
--------------------------------------

bq. if (fh->sizeBuffer == 0 || offset < fh->startOffset || offset >= (fh->startOffset + fh->sizeBuffer))

I think this is correct. it's just very conservative: if there's any data in the current buffer that can satisfy even 1 byte of the current request, then use that data before refilling  the buffer. The next call to read would then trigger offset > fh->startOffset.

This of course means that fuse-dfs is often returning less than the requested number of bytes.  Is this a problem for your application?

pete








> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635853#action_12635853 ] 

Brian Bockelman commented on HADOOP-4298:
-----------------------------------------

Hey Pete - 

Regarding the comment on "29/Sep/08 11:10 PM": Yes, I misspoke.  I had adjusted -m64 in the libhdfs makefile.

Regarding the later comment, "30/Sep/08 01:09 PM":  As you mention, FUSE appears to be unhappy with this behavior - I always get the asked-for number of bytes returned.

Besides, in the unix definition of 'read' (http://www.opengroup.org/onlinepubs/000095399/functions/read.html), it states, regarding the return value:

"""
This number shall never be greater than nbyte. The value returned may be less than nbyte if the number of bytes left in the file is less than nbyte, if the read() request was interrupted by a signal, or if the file is a pipe or FIFO or special file and has fewer than nbyte bytes immediately available for reading. For example, a read() from a file associated with a terminal may return one typed line of data.
"""

To me, that means if there are more bytes in the file, you should give exactly nbytes; I guess I can see that if you have less than nbytes in the buffer, you could claim it meets the requirement "fewer than nbyte bytes immediately available for reading".

I guess you still end up being stuck with the FUSE problem.

Regarding exceptionally large reads (> 10MB): it might be a problem, but I can't trigger it locally.

Finally, I think when my application failed last, I hadn't remounted the FS with the new code.  I tried it out this morning, and was pleased to see things work all the way through, no segfaults.

Thanks for the help.  This goes far in "selling" the idea of a new distributed file system to our sysadmins.

Brian

> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635834#action_12635834 ] 

Pete Wyckoff commented on HADOOP-4298:
--------------------------------------

bq. if (fh->sizeBuffer == 0 || offset < fh->startOffset || offset >= (fh->startOffset + fh->sizeBuffer))

I think this is correct. it's just very conservative: if there's any data in the current buffer that can satisfy even 1 byte of the current request, then use that data before refilling  the buffer. The next call to read would then trigger offset > fh->startOffset.

This of course means that fuse-dfs is often returning less than the requested number of bytes.  Is this a problem for your application?

pete








> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635679#action_12635679 ] 

Pete Wyckoff commented on HADOOP-4298:
--------------------------------------

bq. I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.

you mean src/c++/llibhdfs/Makefile, right?

> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff resolved HADOOP-4298.
----------------------------------

    Resolution: Duplicate

opened again by accident

> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Assignee: Pete Wyckoff
>            Priority: Blocker
>             Fix For: 0.18.2
>
>         Attachments: 4298-patch-test.txt, HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Craig Macdonald (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637673#action_12637673 ] 

Craig Macdonald commented on HADOOP-4298:
-----------------------------------------

Original code was mine.
+1 on patch.

C

> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>         Attachments: 4298-patch-test.txt, HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635637#action_12635637 ] 

Brian Bockelman commented on HADOOP-4298:
-----------------------------------------

Hey Owen, all,

In fuse_dfs.c, I replaced this line in function dfs_read:

  if (fh->sizeBuffer == 0  || offset < fh->startOffset || offset > (fh->startOffset + fh->sizeBuffer)  )

with the following:

  if (fh->sizeBuffer == 0  || offset < fh->startOffset || offset >= (fh->startOffset + fh->sizeBuffer) || (offset+size) >= (fh->startOffset + fh->sizeBuffer) )

This covers the bug I mentioned below.  I can now md5sum files successfully.  However, my application still complains of data corruption on reads (although it does make it further through the file!); the application, unlike md5sum, has a very random read pattern.

One possibility is that it is doing a huge read which would break through the buffer, or another undiscovered bug.  When I figure things out, I'll turn the above fix into a proper patch (although you are welcome to do it for me if you have time).

However, I would prefer if the expert or original author took a peak at this code.

> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638292#action_12638292 ] 

Brian Bockelman commented on HADOOP-4298:
-----------------------------------------

Bad news!

I guess that, somewhere in our "single-thread physics application", a physicist stuck in a separate thread which does file system access.

This absolutely does cause problems with fuse-dfs: it appears that all the problems go away when I surround calls to dfs_read with a pthread mutex (still stress-testing things, however...).

Should we add this to the task at hand, or open a new bug report?

Brian

> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>         Attachments: 4298-patch-test.txt, HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636172#action_12636172 ] 

Pete Wyckoff commented on HADOOP-4298:
--------------------------------------

     [exec] +1 overall.  

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.

     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.

     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.



> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.2
>
>         Attachments: 4298-patch-test.txt, HADOOP-4298.1.txt
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4298) File corruption when reading with fuse-dfs

Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635566#action_12635566 ] 

Brian Bockelman commented on HADOOP-4298:
-----------------------------------------

More debugging info.  This is definitely bad logic in the optimization block of fuse-ds, function dfs_read.  This most definitely would not work under any architecture.

I turned on some of the debugging statements, and got the following:

Sep 29 17:27:03 node182 fuse_dfs: Cache bounds for /user/brian/test_helloworld4: 0 -> 10485760 (10485760 bytes). Check for offset 10199040 
Sep 29 17:27:03 node182 fuse_dfs: FUSE requested 131072 bytes of /user/brian/test_helloworld4 for offset 10199040 in file 
Sep 29 17:27:03 node182 fuse_dfs: Cache bounds for /user/brian/test_helloworld4: 0 -> 10485760 (10485760 bytes). Check for offset 10330112 
Sep 29 17:27:03 node182 fuse_dfs: FUSE requested 131072 bytes of /user/brian/test_helloworld4 for offset 10330112 in file 
Sep 29 17:27:03 node182 fuse_dfs: Cache bounds for /user/brian/test_helloworld4: 0 -> 10485760 (10485760 bytes). Check for offset 10461184 
Sep 29 17:27:03 node182 fuse_dfs: FUSE requested 131072 bytes of /user/brian/test_helloworld4 for offset 10461184 in file 

Pause here.  The contents of the cache are [0, 10485760]; however, we are trying to read 10461184 (in the cache) to 10461184+131072 (not in the cache!!!).

Note that fuse-dfs serves this out of cache, and *does not* trigger a read!

Sep 29 17:27:03 node182 fuse_dfs: Cache bounds for /user/brian/test_helloworld4: 0 -> 10485760 (10485760 bytes). Check for offset 10592256 
Sep 29 17:27:03 node182 fuse_dfs: Reading /user/brian/test_helloworld4 from HDFS, offset 10592256, amount 10485760 
Sep 29 17:27:03 node182 fuse_dfs: FUSE requested 131072 bytes of /user/brian/test_helloworld4 for offset 10592256 in file 

Here, we didn't load new info from the cache; so, the bytes in the range (10485760, 10592256) are taken as random junk from memory.

Sep 29 17:27:03 node182 fuse_dfs: Cache bounds for /user/brian/test_helloworld4: 10592256 -> 15600000 (5007744 bytes). Check for offset 10723328 
Sep 29 17:27:03 node182 fuse_dfs: FUSE requested 131072 bytes of /user/brian/test_helloworld4 for offset 10723328 in file 
Sep 29 17:27:03 node182 fuse_dfs: Cache bounds for /user/brian/test_helloworld4: 10592256 -> 15600000 (5007744 bytes). Check for offset 10854400 
Sep 29 17:27:03 node182 fuse_dfs: FUSE requested 131072 bytes of /user/brian/test_helloworld4 for offset 10854400 in file 
Sep 29 17:27:03 node182 fuse_dfs: Cache bounds for /user/brian/test_helloworld4: 10592256 -> 15600000 (5007744 bytes). Check for offset 10985472 
Sep 29 17:27:03 node182 fuse_dfs: FUSE requested 131072 bytes of /user/brian/test_helloworld4 for offset 10985472 in file 

I'm also pretty sure that reads greater than size 10MB (the size of the cache) are going to bust through the cache and again result in random junk.

Someone seriously needs to review the logic and corner conditions in this code.  I'll try to fix it myself and do a patch later, but no promises.


> File corruption when reading with fuse-dfs
> ------------------------------------------
>
>                 Key: HADOOP-4298
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4298
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.1
>         Environment: CentOs 4.6 final; kernel 2.6.9-67.ELsmp; FUSE 2.7.4; hadoop 0.18.1; 64-bit
> I hand-altered the fuse-dfs makefile to use 64-bit instead of the hardcoded -m32.
>            Reporter: Brian Bockelman
>            Priority: Critical
>             Fix For: 0.18.1
>
>
> I pulled a 5GB data file into Hadoop using the following command:
> hadoop fs -put /scratch/886B9B3D-6A85-DD11-A9AB-000423D6CA6E.root /user/brian/testfile
> I have HDFS mounted in /mnt/hadoop using fuse-dfs.
> However, when I try to md5sum the file in place (md5sum /mnt/hadoop) or copy the file back to local disk using "cp" then md5sum it, the checksum is incorrect.
> When I pull the file using normal hadoop means (hadoop fs -get /user/brian/testfile /scratch), the md5sum is correct.
> When I repeat the test with a smaller file (512MB, on the theory that there is a problem with some 2GB limit somewhere), the problem remains.
> When I repeat the test, the md5sum is consistently wrong - i.e., some part of the corruption is deterministic, and not the apparent fault of a bad disk.
> CentOs 4.6 is, unfortunately, not the apparent culprit.  When checking on CentOs 5.x, I could recreate the corruption issue.  The second node was also a 64-bit compile and CentOs 5.2 (`uname -r` returns 2.6.18-92.1.10.el5).
> Thanks for looking into this,
> Brian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.