You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Eric Newton (JIRA)" <ji...@apache.org> on 2014/02/07 22:14:21 UTC
[jira] [Created] (ACCUMULO-2339) WAL recovery fails
Eric Newton created ACCUMULO-2339:
-------------------------------------
Summary: WAL recovery fails
Key: ACCUMULO-2339
URL: https://issues.apache.org/jira/browse/ACCUMULO-2339
Project: Accumulo
Issue Type: New Feature
Components: tserver
Environment: testing 1.5.1rc1 on a 10 node cluster, hadoop 2.2.0, zk 3.4.5
Reporter: Eric Newton
Priority: Critical
I was running accumulo 1.5.1rc1 on a 10 node cluster. After two days, I saw that several tservers had died with OOME. Several hundred tablets were offline.
The master was attempting to recover the write lease on the file, and this was failing.
Attempts to examine the log file failed:
{noformat}
$ hadoop fs -cat /accumulo/wal/192.168.1.5+9997/bc94602a-9a57-45f6-afdf-ffa2a5b70b14
Cannot obtain block length for LocatedBlock{BP-901421341-192.168.1.3-1389719663617:blk_1076582460_2869891; getBlockSize()=0; corrupt=false; offset=0; locs=[192.168.1.5:50010]}
{noformat}
Looking at the DN logs, I see this:
{noformat}
2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at host2/192.168.1.3:9000 calls recoverBlock(BP-901421341-192.168.1.3-1389719663617:blk_1076582290_2869721, targets=[192.168.1.5:50010], newGenerationStamp=2880680)
2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: blk_1076582290_2869721, recoveryId=2880680, replica=ReplicaBeingWritten, blk_1076582290_2869721, RBW
getNumBytes() = 634417185
getBytesOnDisk() = 634417113
getVisibleLength()= 634417113
getVolume() = /srv/hdfs4/hadoop/dn/current
getBlockFile() = /srv/hdfs4/hadoop/dn/current/BP-901421341-192.168.1.3-1389719663617/current/rbw/blk_1076582290
bytesAcked=634417113
bytesOnDisk=634417113
{noformat}
I'm guessing that the /srv/hdfs4 partition was filled up, and disagreement about the size of the file and the size the DN thinks the file should be is causing failures.
Restarting HDFS made no difference.
I manually copied the block up into HDFS as the WAL to make any progress.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)