You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "John Vines (JIRA)" <ji...@apache.org> on 2013/03/11 19:37:13 UTC
[jira] [Commented] (ACCUMULO-939) WAL get stuck for unknown reasons

    [ https://issues.apache.org/jira/browse/ACCUMULO-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13599105#comment-13599105 ] 

John Vines commented on ACCUMULO-939:
-------------------------------------

Ran with the latest code, still reproducable.

{code}
ERROR
	

Unexpected error writing to log, retrying attempt 767
	java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /accumulo/wal/172.16.101.71+9997/4f85c383-8386-4220-8eed-364bf9b48495 could only be replicated to 0 nodes, instead of 1
		at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558)
		at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
		at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
		at java.lang.reflect.Method.invoke(Method.java:616)
		at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
		at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
		at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
		at java.security.AccessController.doPrivileged(Native Method)
		at javax.security.auth.Subject.doAs(Subject.java:416)
		at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
		at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
	
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.createLoggers(TabletServerLogger.java:189)
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.access$300(TabletServerLogger.java:51)
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger$1.withWriteLock(TabletServerLogger.java:147)
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.testLockAndRun(TabletServerLogger.java:114)
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.initializeLoggers(TabletServerLogger.java:136)
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:244)
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229)
		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.log(TabletServerLogger.java:342)
		at org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.update(TabletServer.java:1676)
		at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
		at java.lang.reflect.Method.invoke(Method.java:616)
		at org.apache.accumulo.trace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:63)
		at $Proxy2.update(Unknown Source)
		at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$update.getResult(TabletClientService.java:2125)
		at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$update.getResult(TabletClientService.java:2109)
		at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
		at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
		at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:156)
		at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:478)
		at org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:208)
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
		at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
		at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
		at java.lang.Thread.run(Thread.java:679)
	Caused by: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /accumulo/wal/172.16.101.71+9997/4f85c383-8386-4220-8eed-364bf9b48495 could only be replicated to 0 nodes, instead of 1
		at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558)
		at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
		at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
		at java.lang.reflect.Method.invoke(Method.java:616)
		at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
		at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
		at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
		at java.security.AccessController.doPrivileged(Native Method)
		at javax.security.auth.Subject.doAs(Subject.java:416)
		at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
		at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
	
		at org.apache.hadoop.ipc.Client.call(Client.java:1070)
		at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
		at $Proxy1.addBlock(Unknown Source)
		at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
		at java.lang.reflect.Method.invoke(Method.java:616)
		at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
		at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
		at $Proxy1.addBlock(Unknown Source)
		at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3510)
		at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3373)
		at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2589)
		at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2829)
{code}

This repeated until the tserver died due to OOM (tserver Xmx 512m, 50m data cache, 100m index cache, using native maps)

Additionally, this was blocking all ingest, so if we can't fix this, it would be nice to incorporate this into hold time somehow.
                
> WAL get stuck for unknown reasons
> ---------------------------------
>
>                 Key: ACCUMULO-939
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-939
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.5.0
>         Environment: hadoop-1.0.1
>            Reporter: John Vines
>            Assignee: Eric Newton
>             Fix For: 1.5.0
>
>         Attachments: tserver_jstack
>
>
> Attempting to test ACCUMULO-575 with the following test framework:
> Test bench-
> 1 node running hadoop namenode and 1 datanode
> slave noderunning 1 datanode and accumulo stack, with 8GB in memory map
> Running patched version of accumulo with the following aptch to provide helper debug
> {code}Index: server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java
> ===================================================================
> --- server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java	(revision 1429057)
> +++ server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java	(working copy)
> @@ -81,6 +81,7 @@
>    private FileSystem fs;
>    protected KeyExtent extent;
>    private List<IteratorSetting> iterators;
> +  protected boolean minor= false;
>    
>    Compactor(Configuration conf, FileSystem fs, Map<String,DataFileValue> files, InMemoryMap imm, String outputFile, boolean propogateDeletes,
>        TableConfiguration acuTableConf, KeyExtent extent, CompactionEnv env, List<IteratorSetting> iterators) {
> @@ -158,7 +159,7 @@
>          log.error("Verification of successful compaction fails!!! " + extent + " " + outputFile, ex);
>          throw ex;
>        }
> -      
> +      log.info("Just completed minor? " + minor + " for table " + extent.getTableId());
>        log.debug(String.format("Compaction %s %,d read | %,d written | %,6d entries/sec | %6.3f secs", extent, majCStats.getEntriesRead(),
>            majCStats.getEntriesWritten(), (int) (majCStats.getEntriesRead() / ((t2 - t1) / 1000.0)), (t2 - t1) / 1000.0));
>        
> Index: server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java
> ===================================================================
> --- server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java	(revision 1429057)
> +++ server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java	(working copy)
> @@ -88,6 +88,7 @@
>      
>      do {
>        try {
> +        this.minor = true;
>          CompactionStats ret = super.call();
>          
>          // log.debug(String.format("MinC %,d recs in | %,d recs out | %,d recs/sec | %6.3f secs | %,d bytes ",map.size(), entriesCompacted,
> {code}
> I stood up a new instance, create a table named test. Ran the following -
> {code}tail -f accumulo-1.5.0-SNAPSHOT/logs/tserver_slave.debug.log | ./ifttt.sh {code}
> where ifttt.sh is
> {code} #!/bin/sh
> dnpid=`jps -m | grep DataNode | awk '{print $1}'`
> while [ -z "" ]; do
>   if [ -e $1 ] ;then read str; else str=$1;fi
>   if [ -n "`echo $str | grep "Just completed minor? true for table 2"`" ]; then
>     echo "I'm gonna kill datanode, pid $dnpid"
>     kill -9 $dnpid
>   fi
> done
> {code}
> Then I ran thefollowing
> {code}accumulo org.apache.accumulo.server.test.TestIngest --table test --rows 65536 --cols 100 --size 8192 -z 172.16.101.220:2181 --batchMemory 100000000 --batchThreads 10 {code}
> Eventually the memory map filled, minor compaction happened, local datanode was killed and things died. Logs filled with-
> {code} org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /accumulo/wal/172.16.101.219+9997/08b9f1b4-26d5-4b07-a260-3334c2013576 could only be replicated to 0 nodes, instead of 1
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1556)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
> 	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:416)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
> {code}
> and
> {code}
> Unexpected error writing to log, retrying attempt 1
> 	java.io.IOException: DFSOutputStream is closed
> 		at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3666)
> 		at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
> 		at org.apache.accumulo.server.tabletserver.log.DfsLogger.defineTablet(DfsLogger.java:295)
> 		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger$4.write(TabletServerLogger.java:333)
> 		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:273)
> 		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229)
> 		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330)
> 		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254)
> 		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229)
> 		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330)
> ... repeats...
> {code}.
> Bringing the datanode back up did NOT fix it, either.
> UPDATE: reran and never killed datanode and it still died. So this isn't an issue with my datanode killing, it's something with hadop 1.0.1 and the new rite ahead logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira