You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "John Vines (JIRA)" <ji...@apache.org> on 2013/01/05 00:20:13 UTC

[jira] [Created] (ACCUMULO-939) WAL get stuck when datanode dies

John Vines created ACCUMULO-939:
-----------------------------------

             Summary: WAL get stuck when datanode dies
                 Key: ACCUMULO-939
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-939
             Project: Accumulo
          Issue Type: Bug
          Components: tserver
    Affects Versions: 1.5.0
            Reporter: John Vines
            Assignee: Eric Newton


Attempting to test ACCUMULO-575 with the following test framework:

Test bench-
1 node running hadoop namenode and 1 datanode
slave noderunning 1 datanode and accumulo stack, with 8GB in memory map
Running patched version of accumulo with the following aptch to provide helper debug
{code}Index: server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java
===================================================================
--- server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java	(revision 1429057)
+++ server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java	(working copy)
@@ -81,6 +81,7 @@
   private FileSystem fs;
   protected KeyExtent extent;
   private List<IteratorSetting> iterators;
+  protected boolean minor= false;
   
   Compactor(Configuration conf, FileSystem fs, Map<String,DataFileValue> files, InMemoryMap imm, String outputFile, boolean propogateDeletes,
       TableConfiguration acuTableConf, KeyExtent extent, CompactionEnv env, List<IteratorSetting> iterators) {
@@ -158,7 +159,7 @@
         log.error("Verification of successful compaction fails!!! " + extent + " " + outputFile, ex);
         throw ex;
       }
-      
+      log.info("Just completed minor? " + minor + " for table " + extent.getTableId());
       log.debug(String.format("Compaction %s %,d read | %,d written | %,6d entries/sec | %6.3f secs", extent, majCStats.getEntriesRead(),
           majCStats.getEntriesWritten(), (int) (majCStats.getEntriesRead() / ((t2 - t1) / 1000.0)), (t2 - t1) / 1000.0));
       
Index: server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java
===================================================================
--- server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java	(revision 1429057)
+++ server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java	(working copy)
@@ -88,6 +88,7 @@
     
     do {
       try {
+        this.minor = true;
         CompactionStats ret = super.call();
         
         // log.debug(String.format("MinC %,d recs in | %,d recs out | %,d recs/sec | %6.3f secs | %,d bytes ",map.size(), entriesCompacted,
{code}

I stood up a new instance, create a table named test. Ran the following -
{code}tail -f accumulo-1.5.0-SNAPSHOT/logs/tserver_slave.debug.log | ./ifttt.sh {code}
where ifttt.sh is
{code} #!/bin/sh

dnpid=`jps -m | grep DataNode | awk '{print $1}'`

while [ -z "" ]; do
  if [ -e $1 ] ;then read str; else str=$1;fi
  if [ -n "`echo $str | grep "Just completed minor? true for table 2"`" ]; then
    echo "I'm gonna kill datanode, pid $dnpid"
    kill -9 $dnpid
  fi
done
{code}

Then I ran thefollowing
{code}accumulo org.apache.accumulo.server.test.TestIngest --table test --rows 65536 --cols 100 --size 8192 -z 172.16.101.220:2181 --batchMemory 100000000 --batchThreads 10 {code}

Eventually the memory map filled, minor compaction happened, local datanode was killed and things died. Logs filled with-
{code} org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /accumulo/wal/172.16.101.219+9997/08b9f1b4-26d5-4b07-a260-3334c2013576 could only be replicated to 0 nodes, instead of 1
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1556)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:416)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
{code}

and

{code}
Unexpected error writing to log, retrying attempt 1
	java.io.IOException: DFSOutputStream is closed
		at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3666)
		at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
		at org.apache.accumulo.server.tabletserver.log.DfsLogger.defineTablet(DfsLogger.java:295)
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger$4.write(TabletServerLogger.java:333)
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:273)
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229)
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330)
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254)
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229)
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330)
... repeats...
{code}.

Bringing the datanode back up did NOT fix it, either.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira