You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Bryan Beaudreault (Jira)" <ji...@apache.org> on 2022/11/15 22:09:00 UTC

[jira] [Created] (HBASE-27487) Slow meta can create pathological feedback loop with multigets

Bryan Beaudreault created HBASE-27487:
-----------------------------------------

             Summary: Slow meta can create pathological feedback loop with multigets
                 Key: HBASE-27487
                 URL: https://issues.apache.org/jira/browse/HBASE-27487
             Project: HBase
          Issue Type: Improvement
    Affects Versions: 2.4.15, 2.5.1
            Reporter: Bryan Beaudreault


This only affects the Table implementation in 2.x releases.
h4. Call stack

When Table.batch is called, an AsyncProcessTask is created with SubmittedRows.ALL, which is sent to AsyncProcess.submit(). For the ALL case, this goes to submitAll which creates an AsyncRequestFutureImpl and then calls groupAndSendMultiAction on that.

When a AsyncRequestFutureImpl is created, a RetryingTimeTracker is created and started as the last step of the constructor.

In groupAndSendMultiAction, the first thing that has to be done is resolve the HRegionLocation for every action in the batch. This is currently done sequentially, with no timeout on the overall batch completion.

Once all actions have been resolved, they are passed into sendMultiAction which creates a SingleServerRequestRunnable. Once that runnable is executed, the first thing it does is create a new MultiServerCallable using the same RetryingTimeTracker that was originally created way back.

That callable extends CancellableRegionServerCallable, and the call method first checks the tracker.getRemainingTime() before actually doing any work. If exceeded, it throws an exception.
h4. Problem

If meta is overloaded, or you send any sufficiently large batch of actions, the resolving of HRegionLocations (which happens sequentially) may take a while.

Depending on the operation timeout configured for the client, that duration may already exceed that timeout before even reaching the CancellableRegionServerCallable.call().

When the timeout is exceeded there, a DoNotRetryIOException is thrown. This is considered a cache clearing exception, so any locations that may have been slowly resolved earlier up the chain will be thrown away.

If done with enough concurrency, this can create a feedback loop that is impossible to recover from.
h4. Potential Solutions
 # Change the thrown exception type from DoNotRetryIOException to something more appropriate for the actual error (some sort of timeout exception). We'd have to make that exception a "special" exception in ClientExceptionUtil so that it doesn't clear the cache.
 # Make DoNotRetryIOException itself a "special" exception. The point of clearing cache is to make retries more likely to succeed if the failure was related to a wrong location. But DoNotRetryIOException explicitly is not supposed to be retried, so you might think it shouldn't clear the cache as well. There are many usages of this exception, so it's hard to say for sure that this would be universally safe.
 # Reset the RetryingTimeTracker after resolving region locations.

I think I'd lean towards option 1, because it seems odd to say "don't retry in that case". In fact, retrying should be more likely to succeed because locations will have been resolved.

Whichever we choose, I think we should additionally check the timeout in groupAndSendMultiAction after resolving each region location. We should not allow that process to exceed timeouts and currently it can way more than exceed them before finally being checked incidentally at the end.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)