You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "张铎(Duo Zhang)" <pa...@gmail.com> on 2022/06/01 00:54:52 UTC

Re: Separately configurable client meta rpc timeout

Scan will not honor operation timeout configuration as its logic is a bit
different compared to normal read/write operations.

For scan, usually there is no simple 'retry'(except the open scanner call),
if you hit an error, usually you need to restart the scan by making a new
open scanner call, not retry on the scanner next call.

IIRC we have a special hbase.client.scanner.timeout.period and also a
special hbase.rpc.timeout for meta?

Thanks.

Bryan Beaudreault <bb...@hubspot.com.invalid> 于2022年6月1日周三 00:47写道：

> Hi all,
>
> We just had a production issue where a user-facing API service had a low
> hbase.rpc.timeout, and this majorly contributed to a meta hotspotting
> issue. The issue is, user requests can only be submitted once the necessary
> RegionLocation is in the MetaCache. But in a meta hotspotting scenario it
> may be impossible to return a RegionLocation for hbase:meta in a timely
> manner. This will trigger the rpc timeout, which may result in a number of
> retries. This retry storm (across many client instances) can further
> exacerbate meta hotspotting issues.
>
> My thought is to decouple meta rpc timeout from user rpc timeouts, because
> generally you would prefer to allow a longer meta request to succeed
> because it may unblock many user requests.
>
> I think our current timeouts for meta scans are a bit confusing. There's
> a hbase.client.meta.operation.timeout, but actually that does not apply to
> meta scans. Instead they are configured via hbase.rpc.timeout
> and hbase.client.scanner.timeout.period.
>
> I was considering special casing meta scans so that they are configured via
> (new) hbase.client.meta.rpc.timeout and (existing)
> hbase.client.meta.operation.timeout. This would be different from typical
> scan requests, but may be more intuitive overall? Does anyone have any
> opinions?
>
> See https://issues.apache.org/jira/browse/HBASE-27078
>

Re: Separately configurable client meta rpc timeout

Posted by Bryan Beaudreault <bb...@hubspot.com.INVALID>.

Thanks again for your inputs here. I have a PR for this here:
https://github.com/apache/hbase/pull/4557

On Mon, Jun 20, 2022 at 5:57 PM Bryan Beaudreault <bb...@hubspot.com>
wrote:

> Actually, it looks like hbase.rpc.timeout currently applies to the
> openScanner call (which is all that's necessary for most meta scans, since
> they are small). So I think we do also need an
> hbase.client.meta.rpc.timeout config after all.
>
> On Mon, Jun 20, 2022 at 4:17 PM Bryan Beaudreault <
> bbeaudreault@hubspot.com> wrote:
>
>> Thank you both for the input. I will get a PR up for that shortly.
>>
>> Related, I also filed https://issues.apache.org/jira/browse/HBASE-27142
>> for branch-2 blocking client -- "Scanner timeout should take precedence
>> over rpc timeout". I noticed that you changed this behavior for the async
>> client a few years ago Duo, and I think it makes sense to do for the
>> blocking client. Otherwise setting a special meta scanner timeout won't
>> really take effect unless we also provide a special meta rpc timeout. Per
>> Andrew's comment (which I 100% agree), it seems better to unify the clients
>> than to create another new config.
>>
>> On Mon, Jun 20, 2022 at 12:46 PM Andrew Purtell <ap...@apache.org>
>> wrote:
>>
>>> Our default position should be to resist adding new configuration
>>> variables, but in this case, I think it makes sense.
>>> +1 for adding a distinct timeout setting for meta. Definitely a valid
>>> special case.
>>>
>>> On Mon, Jun 20, 2022 at 9:09 AM 张铎(Duo Zhang) <pa...@gmail.com>
>>> wrote:
>>>
>>> > You can see the comments at the top of the method, on why we do not
>>> honor
>>> > the rpc timeout, and also not the operation timeout.
>>> >
>>> > So here maybe we should introduce a special scan timeout for the meta
>>> > table?
>>> >
>>> > Bryan Beaudreault <bb...@hubspot.com.invalid> 于2022年6月20日周一
>>> > 23:45写道：
>>> >
>>> > > Hi Duo, just getting back to this. Thanks for your response.
>>> > >
>>> > > Actually I'm pretty sure there is a simple retry for all scanner next
>>> > > calls. In master branch this occurs
>>> > > in AsyncScanSingleRegionRpcRetryingCaller#call(), which is called
>>> from
>>> > > #next(). The stub.scan() call in call() passes a callback onComplete
>>> > which
>>> > > includes an error handling call of onError. In onError, a retry is
>>> > > scheduled at the end of the method which calls call() again. See
>>> > >
>>> > >
>>> >
>>> https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584
>>> <https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584>
>>> > > .
>>> > > Let me know if I'm missing something. Similar logic in branch-2
>>> blocking
>>> > > client.
>>> > >
>>> > > But anyway, most meta calls are small scans which return their
>>> results in
>>> > > the openScanner call anyway. So improperly tuned rpc timeouts (too
>>> short)
>>> > > can cause retries in openScanner, and probably next() as well if
>>> > > applicable.
>>> > >
>>> > > I took another look and we do not have any special
>>> > > hbase.client.scanner.timeout or hbase.rpc.timeout for meta. Unless
>>> I'm
>>> > > missing something in the link above, I'm going to move forward adding
>>> > these
>>> > > in the jira.
>>> > >
>>> > > On Tue, May 31, 2022 at 8:55 PM 张铎(Duo Zhang) <palomino219@gmail.com
>>> >
>>> > > wrote:
>>> > >
>>> > > > Scan will not honor operation timeout configuration as its logic
>>> is a
>>> > bit
>>> > > > different compared to normal read/write operations.
>>> > > >
>>> > > > For scan, usually there is no simple 'retry'(except the open
>>> scanner
>>> > > call),
>>> > > > if you hit an error, usually you need to restart the scan by
>>> making a
>>> > new
>>> > > > open scanner call, not retry on the scanner next call.
>>> > > >
>>> > > > IIRC we have a special hbase.client.scanner.timeout.period and
>>> also a
>>> > > > special hbase.rpc.timeout for meta?
>>> > > >
>>> > > > Thanks.
>>> > > >
>>> > > > Bryan Beaudreault <bb...@hubspot.com.invalid> 于2022年6月1日周三
>>> > > 00:47写道：
>>> > > >
>>> > > > > Hi all,
>>> > > > >
>>> > > > > We just had a production issue where a user-facing API service
>>> had a
>>> > > low
>>> > > > > hbase.rpc.timeout, and this majorly contributed to a meta
>>> hotspotting
>>> > > > > issue. The issue is, user requests can only be submitted once the
>>> > > > necessary
>>> > > > > RegionLocation is in the MetaCache. But in a meta hotspotting
>>> > scenario
>>> > > it
>>> > > > > may be impossible to return a RegionLocation for hbase:meta in a
>>> > timely
>>> > > > > manner. This will trigger the rpc timeout, which may result in a
>>> > number
>>> > > > of
>>> > > > > retries. This retry storm (across many client instances) can
>>> further
>>> > > > > exacerbate meta hotspotting issues.
>>> > > > >
>>> > > > > My thought is to decouple meta rpc timeout from user rpc
>>> timeouts,
>>> > > > because
>>> > > > > generally you would prefer to allow a longer meta request to
>>> succeed
>>> > > > > because it may unblock many user requests.
>>> > > > >
>>> > > > > I think our current timeouts for meta scans are a bit confusing.
>>> > > There's
>>> > > > > a hbase.client.meta.operation.timeout, but actually that does not
>>> > apply
>>> > > > to
>>> > > > > meta scans. Instead they are configured via hbase.rpc.timeout
>>> > > > > and hbase.client.scanner.timeout.period.
>>> > > > >
>>> > > > > I was considering special casing meta scans so that they are
>>> > configured
>>> > > > via
>>> > > > > (new) hbase.client.meta.rpc.timeout and (existing)
>>> > > > > hbase.client.meta.operation.timeout. This would be different from
>>> > > typical
>>> > > > > scan requests, but may be more intuitive overall? Does anyone
>>> have
>>> > any
>>> > > > > opinions?
>>> > > > >
>>> > > > > See https://issues.apache.org/jira/browse/HBASE-27078
>>> <https://issues.apache.org/jira/browse/HBASE-27078>
>>> > > > <https://issues.apache.org/jira/browse/HBASE-27078
>>> <https://issues.apache.org/jira/browse/HBASE-27078>
>>> >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>>
>>> --
>>> Best regards,
>>> Andrew
>>>
>>> Unrest, ignorance distilled, nihilistic imbeciles -
>>> It's what we’ve earned
>>> Welcome, apocalypse, what’s taken you so long?
>>> Bring us the fitting end that we’ve been counting on
>>> - A23, Welcome, Apocalypse
>>>
>>

Re: Separately configurable client meta rpc timeout

Posted by Bryan Beaudreault <bb...@hubspot.com.INVALID>.

Actually, it looks like hbase.rpc.timeout currently applies to the
openScanner call (which is all that's necessary for most meta scans, since
they are small). So I think we do also need an
hbase.client.meta.rpc.timeout config after all.

On Mon, Jun 20, 2022 at 4:17 PM Bryan Beaudreault <bb...@hubspot.com>
wrote:

> Thank you both for the input. I will get a PR up for that shortly.
>
> Related, I also filed https://issues.apache.org/jira/browse/HBASE-27142
> for branch-2 blocking client -- "Scanner timeout should take precedence
> over rpc timeout". I noticed that you changed this behavior for the async
> client a few years ago Duo, and I think it makes sense to do for the
> blocking client. Otherwise setting a special meta scanner timeout won't
> really take effect unless we also provide a special meta rpc timeout. Per
> Andrew's comment (which I 100% agree), it seems better to unify the clients
> than to create another new config.
>
> On Mon, Jun 20, 2022 at 12:46 PM Andrew Purtell <ap...@apache.org>
> wrote:
>
>> Our default position should be to resist adding new configuration
>> variables, but in this case, I think it makes sense.
>> +1 for adding a distinct timeout setting for meta. Definitely a valid
>> special case.
>>
>> On Mon, Jun 20, 2022 at 9:09 AM 张铎(Duo Zhang) <pa...@gmail.com>
>> wrote:
>>
>> > You can see the comments at the top of the method, on why we do not
>> honor
>> > the rpc timeout, and also not the operation timeout.
>> >
>> > So here maybe we should introduce a special scan timeout for the meta
>> > table?
>> >
>> > Bryan Beaudreault <bb...@hubspot.com.invalid> 于2022年6月20日周一
>> > 23:45写道：
>> >
>> > > Hi Duo, just getting back to this. Thanks for your response.
>> > >
>> > > Actually I'm pretty sure there is a simple retry for all scanner next
>> > > calls. In master branch this occurs
>> > > in AsyncScanSingleRegionRpcRetryingCaller#call(), which is called from
>> > > #next(). The stub.scan() call in call() passes a callback onComplete
>> > which
>> > > includes an error handling call of onError. In onError, a retry is
>> > > scheduled at the end of the method which calls call() again. See
>> > >
>> > >
>> >
>> https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584
>> <https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584>
>> > > .
>> > > Let me know if I'm missing something. Similar logic in branch-2
>> blocking
>> > > client.
>> > >
>> > > But anyway, most meta calls are small scans which return their
>> results in
>> > > the openScanner call anyway. So improperly tuned rpc timeouts (too
>> short)
>> > > can cause retries in openScanner, and probably next() as well if
>> > > applicable.
>> > >
>> > > I took another look and we do not have any special
>> > > hbase.client.scanner.timeout or hbase.rpc.timeout for meta. Unless I'm
>> > > missing something in the link above, I'm going to move forward adding
>> > these
>> > > in the jira.
>> > >
>> > > On Tue, May 31, 2022 at 8:55 PM 张铎(Duo Zhang) <pa...@gmail.com>
>> > > wrote:
>> > >
>> > > > Scan will not honor operation timeout configuration as its logic is
>> a
>> > bit
>> > > > different compared to normal read/write operations.
>> > > >
>> > > > For scan, usually there is no simple 'retry'(except the open scanner
>> > > call),
>> > > > if you hit an error, usually you need to restart the scan by making
>> a
>> > new
>> > > > open scanner call, not retry on the scanner next call.
>> > > >
>> > > > IIRC we have a special hbase.client.scanner.timeout.period and also
>> a
>> > > > special hbase.rpc.timeout for meta?
>> > > >
>> > > > Thanks.
>> > > >
>> > > > Bryan Beaudreault <bb...@hubspot.com.invalid> 于2022年6月1日周三
>> > > 00:47写道：
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > We just had a production issue where a user-facing API service
>> had a
>> > > low
>> > > > > hbase.rpc.timeout, and this majorly contributed to a meta
>> hotspotting
>> > > > > issue. The issue is, user requests can only be submitted once the
>> > > > necessary
>> > > > > RegionLocation is in the MetaCache. But in a meta hotspotting
>> > scenario
>> > > it
>> > > > > may be impossible to return a RegionLocation for hbase:meta in a
>> > timely
>> > > > > manner. This will trigger the rpc timeout, which may result in a
>> > number
>> > > > of
>> > > > > retries. This retry storm (across many client instances) can
>> further
>> > > > > exacerbate meta hotspotting issues.
>> > > > >
>> > > > > My thought is to decouple meta rpc timeout from user rpc timeouts,
>> > > > because
>> > > > > generally you would prefer to allow a longer meta request to
>> succeed
>> > > > > because it may unblock many user requests.
>> > > > >
>> > > > > I think our current timeouts for meta scans are a bit confusing.
>> > > There's
>> > > > > a hbase.client.meta.operation.timeout, but actually that does not
>> > apply
>> > > > to
>> > > > > meta scans. Instead they are configured via hbase.rpc.timeout
>> > > > > and hbase.client.scanner.timeout.period.
>> > > > >
>> > > > > I was considering special casing meta scans so that they are
>> > configured
>> > > > via
>> > > > > (new) hbase.client.meta.rpc.timeout and (existing)
>> > > > > hbase.client.meta.operation.timeout. This would be different from
>> > > typical
>> > > > > scan requests, but may be more intuitive overall? Does anyone have
>> > any
>> > > > > opinions?
>> > > > >
>> > > > > See https://issues.apache.org/jira/browse/HBASE-27078
>> <https://issues.apache.org/jira/browse/HBASE-27078>
>> > > > <https://issues.apache.org/jira/browse/HBASE-27078
>> <https://issues.apache.org/jira/browse/HBASE-27078>
>> >
>> > > > >
>> > > >
>> > >
>> >
>>
>>
>> --
>> Best regards,
>> Andrew
>>
>> Unrest, ignorance distilled, nihilistic imbeciles -
>> It's what we’ve earned
>> Welcome, apocalypse, what’s taken you so long?
>> Bring us the fitting end that we’ve been counting on
>> - A23, Welcome, Apocalypse
>>
>

Re: Separately configurable client meta rpc timeout

Posted by Bryan Beaudreault <bb...@hubspot.com.INVALID>.

Thank you both for the input. I will get a PR up for that shortly.

Related, I also filed https://issues.apache.org/jira/browse/HBASE-27142 for
branch-2 blocking client -- "Scanner timeout should take precedence over
rpc timeout". I noticed that you changed this behavior for the async client
a few years ago Duo, and I think it makes sense to do for the blocking
client. Otherwise setting a special meta scanner timeout won't really take
effect unless we also provide a special meta rpc timeout. Per Andrew's
comment (which I 100% agree), it seems better to unify the clients than to
create another new config.

On Mon, Jun 20, 2022 at 12:46 PM Andrew Purtell <ap...@apache.org> wrote:

> Our default position should be to resist adding new configuration
> variables, but in this case, I think it makes sense.
> +1 for adding a distinct timeout setting for meta. Definitely a valid
> special case.
>
> On Mon, Jun 20, 2022 at 9:09 AM 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > You can see the comments at the top of the method, on why we do not honor
> > the rpc timeout, and also not the operation timeout.
> >
> > So here maybe we should introduce a special scan timeout for the meta
> > table?
> >
> > Bryan Beaudreault <bb...@hubspot.com.invalid> 于2022年6月20日周一
> > 23:45写道：
> >
> > > Hi Duo, just getting back to this. Thanks for your response.
> > >
> > > Actually I'm pretty sure there is a simple retry for all scanner next
> > > calls. In master branch this occurs
> > > in AsyncScanSingleRegionRpcRetryingCaller#call(), which is called from
> > > #next(). The stub.scan() call in call() passes a callback onComplete
> > which
> > > includes an error handling call of onError. In onError, a retry is
> > > scheduled at the end of the method which calls call() again. See
> > >
> > >
> >
> https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584
> <https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584>
> > > .
> > > Let me know if I'm missing something. Similar logic in branch-2
> blocking
> > > client.
> > >
> > > But anyway, most meta calls are small scans which return their results
> in
> > > the openScanner call anyway. So improperly tuned rpc timeouts (too
> short)
> > > can cause retries in openScanner, and probably next() as well if
> > > applicable.
> > >
> > > I took another look and we do not have any special
> > > hbase.client.scanner.timeout or hbase.rpc.timeout for meta. Unless I'm
> > > missing something in the link above, I'm going to move forward adding
> > these
> > > in the jira.
> > >
> > > On Tue, May 31, 2022 at 8:55 PM 张铎(Duo Zhang) <pa...@gmail.com>
> > > wrote:
> > >
> > > > Scan will not honor operation timeout configuration as its logic is a
> > bit
> > > > different compared to normal read/write operations.
> > > >
> > > > For scan, usually there is no simple 'retry'(except the open scanner
> > > call),
> > > > if you hit an error, usually you need to restart the scan by making a
> > new
> > > > open scanner call, not retry on the scanner next call.
> > > >
> > > > IIRC we have a special hbase.client.scanner.timeout.period and also a
> > > > special hbase.rpc.timeout for meta?
> > > >
> > > > Thanks.
> > > >
> > > > Bryan Beaudreault <bb...@hubspot.com.invalid> 于2022年6月1日周三
> > > 00:47写道：
> > > >
> > > > > Hi all,
> > > > >
> > > > > We just had a production issue where a user-facing API service had
> a
> > > low
> > > > > hbase.rpc.timeout, and this majorly contributed to a meta
> hotspotting
> > > > > issue. The issue is, user requests can only be submitted once the
> > > > necessary
> > > > > RegionLocation is in the MetaCache. But in a meta hotspotting
> > scenario
> > > it
> > > > > may be impossible to return a RegionLocation for hbase:meta in a
> > timely
> > > > > manner. This will trigger the rpc timeout, which may result in a
> > number
> > > > of
> > > > > retries. This retry storm (across many client instances) can
> further
> > > > > exacerbate meta hotspotting issues.
> > > > >
> > > > > My thought is to decouple meta rpc timeout from user rpc timeouts,
> > > > because
> > > > > generally you would prefer to allow a longer meta request to
> succeed
> > > > > because it may unblock many user requests.
> > > > >
> > > > > I think our current timeouts for meta scans are a bit confusing.
> > > There's
> > > > > a hbase.client.meta.operation.timeout, but actually that does not
> > apply
> > > > to
> > > > > meta scans. Instead they are configured via hbase.rpc.timeout
> > > > > and hbase.client.scanner.timeout.period.
> > > > >
> > > > > I was considering special casing meta scans so that they are
> > configured
> > > > via
> > > > > (new) hbase.client.meta.rpc.timeout and (existing)
> > > > > hbase.client.meta.operation.timeout. This would be different from
> > > typical
> > > > > scan requests, but may be more intuitive overall? Does anyone have
> > any
> > > > > opinions?
> > > > >
> > > > > See https://issues.apache.org/jira/browse/HBASE-27078
> <https://issues.apache.org/jira/browse/HBASE-27078>
> > > > <https://issues.apache.org/jira/browse/HBASE-27078
> <https://issues.apache.org/jira/browse/HBASE-27078>
> >
> > > > >
> > > >
> > >
> >
>
>
> --
> Best regards,
> Andrew
>
> Unrest, ignorance distilled, nihilistic imbeciles -
> It's what we’ve earned
> Welcome, apocalypse, what’s taken you so long?
> Bring us the fitting end that we’ve been counting on
> - A23, Welcome, Apocalypse
>

Re: Separately configurable client meta rpc timeout

Posted by Andrew Purtell <ap...@apache.org>.

Our default position should be to resist adding new configuration
variables, but in this case, I think it makes sense.
+1 for adding a distinct timeout setting for meta. Definitely a valid
special case.

On Mon, Jun 20, 2022 at 9:09 AM 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> You can see the comments at the top of the method, on why we do not honor
> the rpc timeout, and also not the operation timeout.
>
> So here maybe we should introduce a special scan timeout for the meta
> table?
>
> Bryan Beaudreault <bb...@hubspot.com.invalid> 于2022年6月20日周一
> 23:45写道：
>
> > Hi Duo, just getting back to this. Thanks for your response.
> >
> > Actually I'm pretty sure there is a simple retry for all scanner next
> > calls. In master branch this occurs
> > in AsyncScanSingleRegionRpcRetryingCaller#call(), which is called from
> > #next(). The stub.scan() call in call() passes a callback onComplete
> which
> > includes an error handling call of onError. In onError, a retry is
> > scheduled at the end of the method which calls call() again. See
> >
> >
> https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584
> > .
> > Let me know if I'm missing something. Similar logic in branch-2 blocking
> > client.
> >
> > But anyway, most meta calls are small scans which return their results in
> > the openScanner call anyway. So improperly tuned rpc timeouts (too short)
> > can cause retries in openScanner, and probably next() as well if
> > applicable.
> >
> > I took another look and we do not have any special
> > hbase.client.scanner.timeout or hbase.rpc.timeout for meta. Unless I'm
> > missing something in the link above, I'm going to move forward adding
> these
> > in the jira.
> >
> > On Tue, May 31, 2022 at 8:55 PM 张铎(Duo Zhang) <pa...@gmail.com>
> > wrote:
> >
> > > Scan will not honor operation timeout configuration as its logic is a
> bit
> > > different compared to normal read/write operations.
> > >
> > > For scan, usually there is no simple 'retry'(except the open scanner
> > call),
> > > if you hit an error, usually you need to restart the scan by making a
> new
> > > open scanner call, not retry on the scanner next call.
> > >
> > > IIRC we have a special hbase.client.scanner.timeout.period and also a
> > > special hbase.rpc.timeout for meta?
> > >
> > > Thanks.
> > >
> > > Bryan Beaudreault <bb...@hubspot.com.invalid> 于2022年6月1日周三
> > 00:47写道：
> > >
> > > > Hi all,
> > > >
> > > > We just had a production issue where a user-facing API service had a
> > low
> > > > hbase.rpc.timeout, and this majorly contributed to a meta hotspotting
> > > > issue. The issue is, user requests can only be submitted once the
> > > necessary
> > > > RegionLocation is in the MetaCache. But in a meta hotspotting
> scenario
> > it
> > > > may be impossible to return a RegionLocation for hbase:meta in a
> timely
> > > > manner. This will trigger the rpc timeout, which may result in a
> number
> > > of
> > > > retries. This retry storm (across many client instances) can further
> > > > exacerbate meta hotspotting issues.
> > > >
> > > > My thought is to decouple meta rpc timeout from user rpc timeouts,
> > > because
> > > > generally you would prefer to allow a longer meta request to succeed
> > > > because it may unblock many user requests.
> > > >
> > > > I think our current timeouts for meta scans are a bit confusing.
> > There's
> > > > a hbase.client.meta.operation.timeout, but actually that does not
> apply
> > > to
> > > > meta scans. Instead they are configured via hbase.rpc.timeout
> > > > and hbase.client.scanner.timeout.period.
> > > >
> > > > I was considering special casing meta scans so that they are
> configured
> > > via
> > > > (new) hbase.client.meta.rpc.timeout and (existing)
> > > > hbase.client.meta.operation.timeout. This would be different from
> > typical
> > > > scan requests, but may be more intuitive overall? Does anyone have
> any
> > > > opinions?
> > > >
> > > > See https://issues.apache.org/jira/browse/HBASE-27078
> > > <https://issues.apache.org/jira/browse/HBASE-27078>
> > > >
> > >
> >
>


-- 
Best regards,
Andrew

Unrest, ignorance distilled, nihilistic imbeciles -
    It's what we’ve earned
Welcome, apocalypse, what’s taken you so long?
Bring us the fitting end that we’ve been counting on
   - A23, Welcome, Apocalypse

Re: Separately configurable client meta rpc timeout

Posted by "张铎(Duo Zhang)" <pa...@gmail.com>.

You can see the comments at the top of the method, on why we do not honor
the rpc timeout, and also not the operation timeout.

So here maybe we should introduce a special scan timeout for the meta table?

Bryan Beaudreault <bb...@hubspot.com.invalid> 于2022年6月20日周一 23:45写道：

> Hi Duo, just getting back to this. Thanks for your response.
>
> Actually I'm pretty sure there is a simple retry for all scanner next
> calls. In master branch this occurs
> in AsyncScanSingleRegionRpcRetryingCaller#call(), which is called from
> #next(). The stub.scan() call in call() passes a callback onComplete which
> includes an error handling call of onError. In onError, a retry is
> scheduled at the end of the method which calls call() again. See
>
> https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584
> .
> Let me know if I'm missing something. Similar logic in branch-2 blocking
> client.
>
> But anyway, most meta calls are small scans which return their results in
> the openScanner call anyway. So improperly tuned rpc timeouts (too short)
> can cause retries in openScanner, and probably next() as well if
> applicable.
>
> I took another look and we do not have any special
> hbase.client.scanner.timeout or hbase.rpc.timeout for meta. Unless I'm
> missing something in the link above, I'm going to move forward adding these
> in the jira.
>
> On Tue, May 31, 2022 at 8:55 PM 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > Scan will not honor operation timeout configuration as its logic is a bit
> > different compared to normal read/write operations.
> >
> > For scan, usually there is no simple 'retry'(except the open scanner
> call),
> > if you hit an error, usually you need to restart the scan by making a new
> > open scanner call, not retry on the scanner next call.
> >
> > IIRC we have a special hbase.client.scanner.timeout.period and also a
> > special hbase.rpc.timeout for meta?
> >
> > Thanks.
> >
> > Bryan Beaudreault <bb...@hubspot.com.invalid> 于2022年6月1日周三
> 00:47写道：
> >
> > > Hi all,
> > >
> > > We just had a production issue where a user-facing API service had a
> low
> > > hbase.rpc.timeout, and this majorly contributed to a meta hotspotting
> > > issue. The issue is, user requests can only be submitted once the
> > necessary
> > > RegionLocation is in the MetaCache. But in a meta hotspotting scenario
> it
> > > may be impossible to return a RegionLocation for hbase:meta in a timely
> > > manner. This will trigger the rpc timeout, which may result in a number
> > of
> > > retries. This retry storm (across many client instances) can further
> > > exacerbate meta hotspotting issues.
> > >
> > > My thought is to decouple meta rpc timeout from user rpc timeouts,
> > because
> > > generally you would prefer to allow a longer meta request to succeed
> > > because it may unblock many user requests.
> > >
> > > I think our current timeouts for meta scans are a bit confusing.
> There's
> > > a hbase.client.meta.operation.timeout, but actually that does not apply
> > to
> > > meta scans. Instead they are configured via hbase.rpc.timeout
> > > and hbase.client.scanner.timeout.period.
> > >
> > > I was considering special casing meta scans so that they are configured
> > via
> > > (new) hbase.client.meta.rpc.timeout and (existing)
> > > hbase.client.meta.operation.timeout. This would be different from
> typical
> > > scan requests, but may be more intuitive overall? Does anyone have any
> > > opinions?
> > >
> > > See https://issues.apache.org/jira/browse/HBASE-27078
> > <https://issues.apache.org/jira/browse/HBASE-27078>
> > >
> >
>

Re: Separately configurable client meta rpc timeout

Posted by Bryan Beaudreault <bb...@hubspot.com.INVALID>.

Hi Duo, just getting back to this. Thanks for your response.

Actually I'm pretty sure there is a simple retry for all scanner next
calls. In master branch this occurs
in AsyncScanSingleRegionRpcRetryingCaller#call(), which is called from
#next(). The stub.scan() call in call() passes a callback onComplete which
includes an error handling call of onError. In onError, a retry is
scheduled at the end of the method which calls call() again. See
https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584.
Let me know if I'm missing something. Similar logic in branch-2 blocking
client.

But anyway, most meta calls are small scans which return their results in
the openScanner call anyway. So improperly tuned rpc timeouts (too short)
can cause retries in openScanner, and probably next() as well if applicable.

I took another look and we do not have any special
hbase.client.scanner.timeout or hbase.rpc.timeout for meta. Unless I'm
missing something in the link above, I'm going to move forward adding these
in the jira.

On Tue, May 31, 2022 at 8:55 PM 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> Scan will not honor operation timeout configuration as its logic is a bit
> different compared to normal read/write operations.
>
> For scan, usually there is no simple 'retry'(except the open scanner call),
> if you hit an error, usually you need to restart the scan by making a new
> open scanner call, not retry on the scanner next call.
>
> IIRC we have a special hbase.client.scanner.timeout.period and also a
> special hbase.rpc.timeout for meta?
>
> Thanks.
>
> Bryan Beaudreault <bb...@hubspot.com.invalid> 于2022年6月1日周三 00:47写道：
>
> > Hi all,
> >
> > We just had a production issue where a user-facing API service had a low
> > hbase.rpc.timeout, and this majorly contributed to a meta hotspotting
> > issue. The issue is, user requests can only be submitted once the
> necessary
> > RegionLocation is in the MetaCache. But in a meta hotspotting scenario it
> > may be impossible to return a RegionLocation for hbase:meta in a timely
> > manner. This will trigger the rpc timeout, which may result in a number
> of
> > retries. This retry storm (across many client instances) can further
> > exacerbate meta hotspotting issues.
> >
> > My thought is to decouple meta rpc timeout from user rpc timeouts,
> because
> > generally you would prefer to allow a longer meta request to succeed
> > because it may unblock many user requests.
> >
> > I think our current timeouts for meta scans are a bit confusing. There's
> > a hbase.client.meta.operation.timeout, but actually that does not apply
> to
> > meta scans. Instead they are configured via hbase.rpc.timeout
> > and hbase.client.scanner.timeout.period.
> >
> > I was considering special casing meta scans so that they are configured
> via
> > (new) hbase.client.meta.rpc.timeout and (existing)
> > hbase.client.meta.operation.timeout. This would be different from typical
> > scan requests, but may be more intuitive overall? Does anyone have any
> > opinions?
> >
> > See https://issues.apache.org/jira/browse/HBASE-27078
> <https://issues.apache.org/jira/browse/HBASE-27078>
> >
>