You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2014/07/11 21:43:04 UTC
[jira] [Created] (ACCUMULO-2990) BatchWriter never unsets somethingFailed

Josh Elser created ACCUMULO-2990:
------------------------------------

             Summary: BatchWriter never unsets somethingFailed
                 Key: ACCUMULO-2990
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2990
             Project: Accumulo
          Issue Type: Bug
          Components: client
    Affects Versions: 1.6.0, 1.5.1
            Reporter: Josh Elser
            Priority: Critical
             Fix For: 1.5.2, 1.6.1, 1.7.0


In trying to understand what's happening in ACCUMULO-2964, I noticed that I had similar exceptions from two different threads. One of the threads starting working after the unexplained thrift exceptions from a tserver restart, and the other continued to repeatedly fail for the lifetime of the test.

I repeatedly saw this exception: 

{noformat}
2014-07-11 04:14:41,591 [replication.WorkMaker] WARN : Failed to write work mutations for replication, will retry
org.apache.accumulo.core.client.MutationsRejectedException: # constraint violations : 0  security codes: {accumulo.metadata(ID:!0)=[DEFAULT_SECURITY_ERROR]}  # server errors 0 # exceptions 0
        at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)
        at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)
        at org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:45)
        at org.apache.accumulo.master.replication.WorkMaker.addWorkRecord(WorkMaker.java:184)
        at org.apache.accumulo.master.replication.WorkMaker.run(WorkMaker.java:124)
        at org.apache.accumulo.master.replication.ReplicationDriver.run(ReplicationDriver.java:91)
{noformat}

The part that struck me as odd was that the BatchWriter wasn't against the metadata table, but the replication table.

I looked into the TabletServerBatchWriter. It appears that once the client sees a MutationsRejectedException, that BatchWriter becomes useless as the internal member {{somethingFailed}} is never reset back to {{false}} after the failure is reported. Same goes for {{serverSideErrors}}, {{unknownErrors}}, {{lastUnknownErrors}}, too.

If this is the case, this is a bug because the BatchWriter should be resilient in this regard and not force the client to create a new Instance. If that's infeasible to do, we should add exceptions to the BatchWriter that fail fast when a BatchWriter is used that will report repeatedly report the same failure over and over again.



--
This message was sent by Atlassian JIRA
(v6.2#6252)