You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "John Vines (JIRA)" <ji...@apache.org> on 2013/01/05 00:14:14 UTC

[jira] [Commented] (ACCUMULO-575) Potential data loss when datanode fails immediately after minor compaction

    [ https://issues.apache.org/jira/browse/ACCUMULO-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13544338#comment-13544338 ] 

John Vines commented on ACCUMULO-575:
-------------------------------------

Test bench-
1 node running hadoop namenode and 1 datanode
slave noderunning 1 datanode and accumulo stack, with 8GB in memory map
Running patched version of accumulo with the following aptch to provide helper debug
{code}Index: server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java
===================================================================
--- server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java	(revision 1429057)
+++ server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java	(working copy)
@@ -81,6 +81,7 @@
   private FileSystem fs;
   protected KeyExtent extent;
   private List<IteratorSetting> iterators;
+  protected boolean minor= false;
   
   Compactor(Configuration conf, FileSystem fs, Map<String,DataFileValue> files, InMemoryMap imm, String outputFile, boolean propogateDeletes,
       TableConfiguration acuTableConf, KeyExtent extent, CompactionEnv env, List<IteratorSetting> iterators) {
@@ -158,7 +159,7 @@
         log.error("Verification of successful compaction fails!!! " + extent + " " + outputFile, ex);
         throw ex;
       }
-      
+      log.info("Just completed minor? " + minor + " for table " + extent.getTableId());
       log.debug(String.format("Compaction %s %,d read | %,d written | %,6d entries/sec | %6.3f secs", extent, majCStats.getEntriesRead(),
           majCStats.getEntriesWritten(), (int) (majCStats.getEntriesRead() / ((t2 - t1) / 1000.0)), (t2 - t1) / 1000.0));
       
Index: server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java
===================================================================
--- server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java	(revision 1429057)
+++ server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java	(working copy)
@@ -88,6 +88,7 @@
     
     do {
       try {
+        this.minor = true;
         CompactionStats ret = super.call();
         
         // log.debug(String.format("MinC %,d recs in | %,d recs out | %,d recs/sec | %6.3f secs | %,d bytes ",map.size(), entriesCompacted,
{code}

I stood up a new instance, create a table named test. Ran the following -
{code}tail -f accumulo-1.5.0-SNAPSHOT/logs/tserver_slave.debug.log | ./ifttt.sh {code}
where ifttt.sh is
{code} #!/bin/sh

dnpid=`jps -m | grep DataNode | awk '{print $1}'`

while [ -z "" ]; do
  if [ -e $1 ] ;then read str; else str=$1;fi
  if [ -n "`echo $str | grep "Just completed minor? true for table 2"`" ]; then
    echo "I'm gonna kill datanode, pid $dnpid"
    kill -9 $dnpid
  fi
done
{code}

Then I ran thefollowing
{code}accumulo org.apache.accumulo.server.test.TestIngest --table test --rows 65536 --cols 100 --size 8192 -z 172.16.101.220:2181 --batchMemory 100000000 --batchThreads 10 {code}

Eventually the memory map filled, minor compaction happened, local datanode was killed and things died. Unfortunately, I didn't hit the bug I was shooting for. I'm documenting my testing here so once the wal is fixed I can look into this more.

                
> Potential data loss when datanode fails immediately after minor compaction
> --------------------------------------------------------------------------
>
>                 Key: ACCUMULO-575
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-575
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.4.1, 1.4.0
>            Reporter: John Vines
>            Assignee: John Vines
>             Fix For: 1.5.0
>
>
> So this one popped into my head a few days ago, and I've done some research.
> Context-
> 1. In memory map is written to an RFile.
> 2. yadda yadda yadda, FSOutputStream.close() is called.
> 3. close() calls complete() which will not return until the dfs.replication.min is reached. dfs.replication.min is by default set to 1 on systems and I don't think it's frequently configured
> 4. We read the file to make sure that it was written correctly (this has probably been a mitigating factor as to why we haven't run into this potential issue)
> 5. We write the file to the !METADATA table
> 6. We write minor compaction to the walog
> If the datanode goes down after 6 but before the file is replicated more, then we'll have data loss. The file will be known to the namenode as corrupted, but we can't restore it automatically, because the walog has the file complete. Step 4 has probably provided enough of a time buffer to significantly decrease the possibility of this happening.
> I have not explicitly tested this, but I want to test to validate the potential scenario of losing data by dropping a datanode in a multi-node system immediately after closing the FSOutputStream. If this is the case, then we may want to consider adding a wait between steps 4 and 5 that polls the namenode for replication reaching at least the max(2, # nodes).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira