You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2014/09/22 21:02:34 UTC
[jira] [Created] (ACCUMULO-3159) BatchScanner very aggressive after Connection Refused

Josh Elser created ACCUMULO-3159:
------------------------------------

             Summary: BatchScanner very aggressive after Connection Refused
                 Key: ACCUMULO-3159
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3159
             Project: Accumulo
          Issue Type: Improvement
          Components: client
    Affects Versions: 1.6.0, 1.5.2
            Reporter: Josh Elser
            Priority: Minor


Running the replication tests, I tend to find a lot of spam in the Master's log of the following:

{noformat}
[impl.TabletServerBatchReaderIterator] DEBUG: Server : hostname:port msg : java.net.ConnectException: Connection refused
{noformat}

Most of the replication tests will restart a tabletserver to trigger log recovery (to ultimately make sure that a file gets pushed through the replication process). As part of the bookkeeping the Master is doing, it's reading the metadata and replication table(s) to figure out if it needs to assign any work, clean up old work, etc. It uses a batchscanner to do this.

What I believe to be happening is the BathScanner tries to get the TabletClientService client object for a tabletserver which is dead (the one we killed). This throws a TTransportException which we wrap in an IOException and throw up the pipe.

{code}
client = ThriftUtil.getTServerClient(server, conf, timeoutTracker.getTimeOut());
{code}

{code}
} catch (TTransportException e) {
      log.debug("Server : " + server + " msg : " + e.getMessage());
      timeoutTracker.errorOccured(e);
      throw new IOException(e);
}
{code}

The caller (the threadpool inside the batchscanner) catches the IOException, tracks the failure that happened, invalidates the cached tablets for the tserver (the one we got the connection refused from) and repeats (re-bin the ranges to tablets, re-submit the query task).

When this is the only thing happening, this occurs in a really tight loop (ones to tens of milliseconds). Seems excessive to be repeatedly bashing the same tserver that we already got a connection refused from. Perhaps the catch on TTransportException can be enhanced to introduce some backoff on connection refused? Alternatively, we could be a little smarter when processing failures to be less aggressive?

The converse is that, in some cases, we likely want to spin quickly. For the cases where a client has stale tablet information and another tablet server has already picked up the tablets, we want the client to retry immediately so they can (hopefully) get their results from the new server. Any change made would definitely need to only back off on the retries when the same server is chosen again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)