You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2012/08/05 23:40:02 UTC
[jira] [Created] (ACCUMULO-716) Corrupt WAL file
Josh Elser created ACCUMULO-716:
-----------------------------------
Summary: Corrupt WAL file
Key: ACCUMULO-716
URL: https://issues.apache.org/jira/browse/ACCUMULO-716
Project: Accumulo
Issue Type: Bug
Components: tserver
Affects Versions: 1.5.0
Environment: java version "1.6.0_33", hadoop-0.20.2-cdh3u3
Reporter: Josh Elser
Assignee: Eric Newton
Ran wikisearch-ingest. Ended up filling up a drive used by HDFS and things failed not-so-gracefully. Upon restart, log recovery started, appeared to finish (failed HDFS checksum on one WAL entry), and left Accumulo in a state where no tablets were assigned.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ACCUMULO-716) Corrupt WAL file
Posted by "Josh Elser (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/ACCUMULO-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428916#comment-13428916 ]
Josh Elser commented on ACCUMULO-716:
-------------------------------------
Looking at my tserver's debug log, I can see that processing one of the WAL entries failed with the below exception:
{noformat}
05 17:09:05,604 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
05 17:09:05,706 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
05 17:09:05,707 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
05 17:09:05,728 [fs.FSInputChecker] INFO : Found checksum error: b[0, 512]=0000012c99d6b958000000000e080010011a08313936363039323400000001000000000764657461636f6c0000003f000000045445585400000009313900656e77696b690000000a616c6c7c656e77696b69010000012c99d6b958000000000e080010011a083139363630393234000000010000000006676e69766f6d0000003f000000045445585400000009313900656e77696b690000000a616c6c7c656e77696b69010000012c99d6b958000000000e080010011a083139363630393234000000010000000004657375660000003f000000045445585400000009313900656e77696b690000000a616c6c7c656e77696b69010000012c99d6b958000000000e080010011a08313936363039323400000001000000000565676e61720000003f000000045445585400000009313900656e77696b690000000a616c6c7c656e77696b69010000012c99d6b958000000000e080010011a08313936363039323400000001000000000665726f6665620000003f000000045445585400000009313900656e77696b690000000a616c6c7c656e77696b69010000012c99d6b958000000000e080010011a083139363630393234000000010000000004656e6f740000003f000000045445585400000009313900656e77696b690000000a616c6c7c656e77696b69010000012c99d6b958000000000e080010011a083139363630he.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at org.apache.hadoop.hdfs.DFSClient$RemoteBlockReader.read(DFSClient.java:1385)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2121)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2173)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at java.io.DataInputStream.readFully(DataInputStream.java:152)
at org.apache.accumulo.core.data.Mutation.readFields(Mutation.java:443)
at org.apache.accumulo.server.logger.LogFileValue.readFields(LogFileValue.java:42)
at org.apache.accumulo.server.tabletserver.log.LogSorter$LogProcessor.sort(LogSorter.java:122)
at org.apache.accumulo.server.tabletserver.log.LogSorter$LogProcessor.process(LogSorter.java:87)
at org.apache.accumulo.server.zookeeper.DistributedWorkQueue$1.run(DistributedWorkQueue.java:101)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at org.apache.accumulo.cloudtrace.instrument.TraceRunnable.run(TraceRunnable.java:47)
at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
at java.lang.Thread.run(Thread.java:662)
05 17:09:05,729 [hdfs.DFSClient] WARN : Found Checksum error for blk_-7588599586768272616_79756 from 127.0.0.1:50010 at 1170472960
05 17:09:05,731 [hdfs.DFSClient] INFO : Could not obtain block blk_-7588599586768272616_79756 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...] DEBUG: MultiScanSess null 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
05 17:09:05,810 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
05 17:09:05,913 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
...
...
05 17:09:08,805 [log.LogSorter] ERROR: org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_-7588599586768272616:of:/accumulo-1.5/wal/127.0.0.1+9997/4e9df088-8c15-4499-bd9c-4a664cfe1a66 at 1170472960umException: Checksum error: /blk_-7588599586768272616:of:/accumulo-1.5/wal/127.0.0.1+9997/4e9df088-8c15-4499-bd9c-4a664cfe1a66 at 1170472960
at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at org.apache.hadoop.hdfs.DFSClient$RemoteBlockReader.read(DFSClient.java:1385)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2121)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2173)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at java.io.DataInputStream.readFully(DataInputStream.java:152)
at org.apache.accumulo.core.data.Mutation.readFields(Mutation.java:443)
at org.apache.accumulo.server.logger.LogFileValue.readFields(LogFileValue.java:42)
at org.apache.accumulo.server.tabletserver.log.LogSorter$LogProcessor.sort(LogSorter.java:122)
at org.apache.accumulo.server.tabletserver.log.LogSorter$LogProcessor.process(LogSorter.java:87)
at org.apache.accumulo.server.zookeeper.DistributedWorkQueue$1.run(DistributedWorkQueue.java:101)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at org.apache.accumulo.cloudtrace.instrument.TraceRunnable.run(TraceRunnable.java:47)
at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
at java.lang.Thread.run(Thread.java:662)
{noformat}
The offending entry in HDFS is:
{noformat}
$ hadoop fs -ls /accumulo-1.5/wal/127.0.0.1+9997 (svn)-[trunk:1369523]
12/08/05 17:42:28 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
Found 6 items
...
-rw-r--r-- 2 elserj supergroup 0 2012-08-05 16:42 /accumulo-1.5/wal/127.0.0.1+9997/4e9df088-8c15-4499-bd9c-4a664cfe1a66
...
{noformat}
The fun part is if I `hadoop fs -text` that file, there's definitely content in it. Making this slightly more confusing too, I had moved some files from one dfs.data.dir to another dfs.data.dir to reclaim some space. But in the spirit of full disclosure, I did something along the lines of the following (which, to my knowledge, shouldn't have messed anything up):
{noformat}
$ mv /dfs1/blocks/subdir44/* /dfs2/blocks/subdir44/
{noformat}
I also see that the block whose checksum failed in the stack trace (from above), is still listed in the partition's drive who filled up:
{noformat}
$ find /dfs* -name "blk_-7588599586768272616*" -
/dfs1/blocksBeingWritten/blk_-7588599586768272616_79756.meta
/dfs1/blocksBeingWritten/blk_-7588599586768272616
{noformat}
Not sure if this is something inside Accumulo itself, or if this is an HDFS bug. Given experiences of ensuring files in HDFS being fully written (closing the file, re-opening, seeking to the end), I thought there may be something that Accumulo could do to prevent this from happening.
> Corrupt WAL file
> ----------------
>
> Key: ACCUMULO-716
> URL: https://issues.apache.org/jira/browse/ACCUMULO-716
> Project: Accumulo
> Issue Type: Bug
> Components: tserver
> Affects Versions: 1.5.0
> Environment: java version "1.6.0_33", hadoop-0.20.2-cdh3u3
> Reporter: Josh Elser
> Assignee: Eric Newton
>
> Ran wikisearch-ingest. Ended up filling up a drive used by HDFS and things failed not-so-gracefully. Upon restart, log recovery started, appeared to finish (failed HDFS checksum on one WAL entry), and left Accumulo in a state where no tablets were assigned.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ACCUMULO-716) Corrupt WAL file
Posted by "Keith Turner (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/ACCUMULO-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429162#comment-13429162 ]
Keith Turner commented on ACCUMULO-716:
---------------------------------------
Josh,
Did you have dfs.support.append set? If you have a newer version, the changes for ACCUMULO-623 should force you to set it.
> Corrupt WAL file
> ----------------
>
> Key: ACCUMULO-716
> URL: https://issues.apache.org/jira/browse/ACCUMULO-716
> Project: Accumulo
> Issue Type: Bug
> Components: tserver
> Affects Versions: 1.5.0
> Environment: java version "1.6.0_33", hadoop-0.20.2-cdh3u3
> Reporter: Josh Elser
> Assignee: Eric Newton
>
> Ran wikisearch-ingest. Ended up filling up a drive used by HDFS and things failed not-so-gracefully. Upon restart, log recovery started, appeared to finish (failed HDFS checksum on one WAL entry), and left Accumulo in a state where no tablets were assigned.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ACCUMULO-716) Corrupt WAL file
Posted by "Christopher Tubbs (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/ACCUMULO-716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Christopher Tubbs updated ACCUMULO-716:
---------------------------------------
Affects Version/s: (was: 1.5.0)
> Corrupt WAL file
> ----------------
>
> Key: ACCUMULO-716
> URL: https://issues.apache.org/jira/browse/ACCUMULO-716
> Project: Accumulo
> Issue Type: Bug
> Components: tserver
> Environment: java version "1.6.0_33", hadoop-0.20.2-cdh3u3
> Reporter: Josh Elser
> Assignee: Eric Newton
>
> Ran wikisearch-ingest. Ended up filling up a drive used by HDFS and things failed not-so-gracefully. Upon restart, log recovery started, appeared to finish (failed HDFS checksum on one WAL entry), and left Accumulo in a state where no tablets were assigned.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ACCUMULO-716) Corrupt WAL file
Posted by "Josh Elser (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/ACCUMULO-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429171#comment-13429171 ]
Josh Elser commented on ACCUMULO-716:
-------------------------------------
Oops, no I did not, Keith. I thought I would have the changes from that ticket in my locally deployed instance, but since I wasn't forced to do so, I'll have to double check that my instance is, in fact, up to date.
> Corrupt WAL file
> ----------------
>
> Key: ACCUMULO-716
> URL: https://issues.apache.org/jira/browse/ACCUMULO-716
> Project: Accumulo
> Issue Type: Bug
> Components: tserver
> Affects Versions: 1.5.0
> Environment: java version "1.6.0_33", hadoop-0.20.2-cdh3u3
> Reporter: Josh Elser
> Assignee: Eric Newton
>
> Ran wikisearch-ingest. Ended up filling up a drive used by HDFS and things failed not-so-gracefully. Upon restart, log recovery started, appeared to finish (failed HDFS checksum on one WAL entry), and left Accumulo in a state where no tablets were assigned.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ACCUMULO-716) Corrupt WAL file
Posted by "Josh Elser (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/ACCUMULO-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450256#comment-13450256 ]
Josh Elser commented on ACCUMULO-716:
-------------------------------------
Geez, I'm terrible at responding to comments: Hadoop had the dfs.support.append turned on by default. Keith's conf check for dfs append support did work as intended.
I wonder if it's possible to replicate this on a VM with a small partition. Does anyone have cycles to quickly bootstrap this and see if they can replicate this issue?
> Corrupt WAL file
> ----------------
>
> Key: ACCUMULO-716
> URL: https://issues.apache.org/jira/browse/ACCUMULO-716
> Project: Accumulo
> Issue Type: Bug
> Components: tserver
> Environment: java version "1.6.0_33", hadoop-0.20.2-cdh3u3
> Reporter: Josh Elser
> Assignee: Eric Newton
>
> Ran wikisearch-ingest. Ended up filling up a drive used by HDFS and things failed not-so-gracefully. Upon restart, log recovery started, appeared to finish (failed HDFS checksum on one WAL entry), and left Accumulo in a state where no tablets were assigned.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ACCUMULO-716) Corrupt WAL file
Posted by "Keith Turner (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/ACCUMULO-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429180#comment-13429180 ]
Keith Turner commented on ACCUMULO-716:
---------------------------------------
Let me know what happens with that setting. If its not forcing you set it, there could be a bug in the changes I made for 623.
> Corrupt WAL file
> ----------------
>
> Key: ACCUMULO-716
> URL: https://issues.apache.org/jira/browse/ACCUMULO-716
> Project: Accumulo
> Issue Type: Bug
> Components: tserver
> Affects Versions: 1.5.0
> Environment: java version "1.6.0_33", hadoop-0.20.2-cdh3u3
> Reporter: Josh Elser
> Assignee: Eric Newton
>
> Ran wikisearch-ingest. Ended up filling up a drive used by HDFS and things failed not-so-gracefully. Upon restart, log recovery started, appeared to finish (failed HDFS checksum on one WAL entry), and left Accumulo in a state where no tablets were assigned.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira