You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Bryan Beaudreault (Jira)" <ji...@apache.org> on 2022/12/07 22:59:00 UTC

[jira] [Created] (HBASE-27521) CallTimeoutException can cause feedback loop with meta clears

Bryan Beaudreault created HBASE-27521:
-----------------------------------------

Summary: CallTimeoutException can cause feedback loop with meta clears
Key: HBASE-27521
URL: https://issues.apache.org/jira/browse/HBASE-27521
Project: HBase
Issue Type: Improvement
Reporter: Bryan Beaudreault

In HBASE-27487 and HBASE-27490 we added safeguard which should reduce feedback loop caused by slow meta. We have continued to chaos test the hbase client and have found another case that needs to be handled.

With those two jiras, we no longer allow multiget to exceed operation timeout when meta is slow, and OperationTimeoutExceededExceptions do not clear meta cache. This allows for quicker recovery in many cases.

However, consider the case where you have a 1s RPC timeout and 3s operation timeout. Let's say meta is slow and it takes 2.9 seconds to resolve region locations for a batch. When we go to submit the multi actions to the server, we will only have a 100ms remaining time on our operation timeout. This may not be enough, and it results in a CallTimeoutException.

I use slow meta as an example, but it's possible for any slow regionserver to kick off a feedback loop due to a sudden surge in CallTimeoutException resulting in many clients clearing cache and hitting meta. Even with meta replicas, this just exacerbates any slowness and may become unrecoverable without extreme actions.

I also use multigets as the example here, and I think they are most at risk of this, but this is theoretically possible for all request types. A single Get might retry a few times and the last attempt only has a few milliseconds of remaining time. This could also result in a CallTimeoutException and potentially kick off a feedback loop.

I'm still trying to consider options, and am open to opinions here. This issue affects AyncTable and Table. Here are some raw options I've been weighing:
* Treat CTE as special (non-clearing)
* Problem: what if a failed server continues to timeout and we have no idea that regions have moved?
* Can we differentiate CTE vs some sort of SocketTimeoutException (where server is not serving)?

* Rate limit cache clears from CTE
* Rate limit all cache clears
* Treat CTE as an OperationTimeoutExceededException when remainingTime < rpcTimeout, thus skipping clear in that case
* This is the most targeted solution, which may leave other edge cases.
* Problem: what about the case where a server is just running slow, but continually causing cache clears?

--
This message was sent by Atlassian Jira
(v8.20.10#820010)