You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2014/02/07 22:57:23 UTC
[jira] [Commented] (ACCUMULO-2339) WAL recovery fails
[ https://issues.apache.org/jira/browse/ACCUMULO-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895099#comment-13895099 ]
Josh Elser commented on ACCUMULO-2339:
--------------------------------------
Interesting stuff, [~ecn]. I inadvertently ran into similar situations where I filled up the local partition that the DNs were writing to. The difference though is that after I freed up some space on disk, things happily recovered once they could successfully complete log recovery.
I assume you were running with the dfs.datanode.synconclose option set to true?
> WAL recovery fails
> ------------------
>
> Key: ACCUMULO-2339
> URL: https://issues.apache.org/jira/browse/ACCUMULO-2339
> Project: Accumulo
> Issue Type: New Feature
> Components: tserver
> Environment: testing 1.5.1rc1 on a 10 node cluster, hadoop 2.2.0, zk 3.4.5
> Reporter: Eric Newton
> Priority: Critical
>
> I was running accumulo 1.5.1rc1 on a 10 node cluster. After two days, I saw that several tservers had died with OOME. Several hundred tablets were offline.
> The master was attempting to recover the write lease on the file, and this was failing.
> Attempts to examine the log file failed:
> {noformat}
> $ hadoop fs -cat /accumulo/wal/192.168.1.5+9997/bc94602a-9a57-45f6-afdf-ffa2a5b70b14
> Cannot obtain block length for LocatedBlock{BP-901421341-192.168.1.3-1389719663617:blk_1076582460_2869891; getBlockSize()=0; corrupt=false; offset=0; locs=[192.168.1.5:50010]}
> {noformat}
> Looking at the DN logs, I see this:
> {noformat}
> 2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at host2/192.168.1.3:9000 calls recoverBlock(BP-901421341-192.168.1.3-1389719663617:blk_1076582290_2869721, targets=[192.168.1.5:50010], newGenerationStamp=2880680)
> 2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: blk_1076582290_2869721, recoveryId=2880680, replica=ReplicaBeingWritten, blk_1076582290_2869721, RBW
> getNumBytes() = 634417185
> getBytesOnDisk() = 634417113
> getVisibleLength()= 634417113
> getVolume() = /srv/hdfs4/hadoop/dn/current
> getBlockFile() = /srv/hdfs4/hadoop/dn/current/BP-901421341-192.168.1.3-1389719663617/current/rbw/blk_1076582290
> bytesAcked=634417113
> bytesOnDisk=634417113
> {noformat}
> I'm guessing that the /srv/hdfs4 partition was filled up, and disagreement about the size of the file and the size the DN thinks the file should be is causing failures.
> Restarting HDFS made no difference.
> I manually copied the block up into HDFS as the WAL to make any progress.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)