You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Eric Newton (JIRA)" <ji...@apache.org> on 2013/01/08 03:10:12 UTC

[jira] [Created] (ACCUMULO-942) accumulo should be more resilient in the face of NN failures

Eric Newton created ACCUMULO-942:
------------------------------------

             Summary: accumulo should be more resilient in the face of NN failures
                 Key: ACCUMULO-942
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-942
             Project: Accumulo
          Issue Type: Bug
          Components: tserver
            Reporter: Eric Newton
            Assignee: Keith Turner
            Priority: Critical


We experienced a NN failure on a large cluster.  The edit log was written to a RAIDed file system, but it did lose data sent to the edit log.  We suspect drivers making promises it did not keep.

This left Accumulo in a slightly corrupt state: a few references to files that were missing.

Also, we have attempted to have backup images of HDFS archived for disaster recovery.  This has not been helpful because Accumulo needs a highly consistent set of metadata, and a slightly older version of the file system confuses it.

One defense is to use snapshots.  However, this works at the table level, and it is hard to coordinate with the HDFS snapshot.

Another approach is to leave a short history of the files in the !METADATA table.  The Google paper hints at keeping historical information:

{quote}
We also store secondary information in the
METADATA table, including a log of all events per-
taining to each tablet (such as when a server begins
serving it). This information is helpful for debugging
and performance analysis.
{quote}

I think it would also be helpful for disaster recovery.  It may require the GC to be more sensitive to historical information about compactions.

Alternatively, we should start looking into high-availability NNs and bookkeeper high-performance logging.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira