You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Bryan Beaudreault (Jira)" <ji...@apache.org> on 2022/07/14 13:28:00 UTC

[jira] [Comment Edited] (HBASE-27048) Server side scanner time limit should account for time in queue

    [ https://issues.apache.org/jira/browse/HBASE-27048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566841#comment-17566841 ] 

Bryan Beaudreault edited comment on HBASE-27048 at 7/14/22 1:27 PM:
--------------------------------------------------------------------

The problem is that these tests make use of ManualEnvironmentEdge, which defaults to a currentTime of 1. In branch-2 and 2.5, NettyServerRpcConnection properly [passes the EnvironmentEdge's time into the ServerCall|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyServerRpcConnection.java#L122], but in branch-2.4 it [still passes in System.currentTimeMillis()|https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyServerRpcConnection.java#L121]. 

In RSRpcServices getRemainingRpcTimeout, we subtract now (which is 1 in this test) from call.getReceiveTime() (which is current time millis), which ends up being a large negative number. We then subtract that value from the timeout, which results in a very large positive value timeout. I didn't consider the case where now would be much smaller than receive time, since time should always be marching forward.

I think there are two potential fixes to make:
 # branch-2.4 should be fixed to pass EnvironmentEdge in NettyServerRpcConnection, which will fix this test
 # It might be good to add a guardrail in getRemainingRpcTimeout to ensure that we only subtract from the timeout if now > receiveTime. But that also seems like an edge case that should really only happen in tests, so not sure if that's worth pushing to all 3 branches.

We definitely need to do 1, but thoughts on 2? I can submit a Jira to fix just the test or add the guardrail too. The latter is just more work submitting to all the branches, etc.


was (Author: bbeaudreault):
The problem is that these tests make use of ManualEnvironmentEdge, which defaults to a currentTime of 1. In branch-2 and 2.5, NettyServerRpcConnection properly [passes the EnvironmentEdge's time into the ServerCall|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyServerRpcConnection.java#L122], but in branch-2.4 it [still passes in System.currentTimeMillis()|https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyServerRpcConnection.java#L121]. 

In RSRpcServices getRemainingRpcTimeout, we subtract now (which is 1 in this test) from call.getReceiveTime() (which is current time millis), which ends up being a large negative number. We then subtract that value from the timeout, which results in a very large positive value timeout. I didn't consider the case where now would be much smaller than receive time, since time should always be marching forward.

I think there are two potential fixes to make:
 # branch-2.4 should be fixed to pass EnvironmentEdge in NettyServerRpcConnection, which will fix this test
 # It might be good to add a guardrail in getRemainingRpcTimeout to ensure that we only subtract from the timeout if now > receiveTime. But that also seems like an edge case that should really only happen in tests, so not sure if that's worth pushing to all 3 branches.

We definitely need to do 1, but thoughts on 2? I can submit a Jira to fix just the test or add the guardrail too

> Server side scanner time limit should account for time in queue
> ---------------------------------------------------------------
>
>                 Key: HBASE-27048
>                 URL: https://issues.apache.org/jira/browse/HBASE-27048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Major
>             Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14
>
>
> When a scan request comes in with a timeout specified and heartbeats/partials allowed, we calculate a time limit for running the scan to be half of that timeout. The idea is to return before the timeout expires.
> The calculation of that time limit is "now + timeout / 2", where now is the point at which the scan is starting to run. What's missed here is the scan may have spent upwards of a few seconds in the IPC queue before being serviced. In this case, the time limit may extend beyond the timeout of the request and the server will not return in time.
> We should calculate the time limit from ServerCall.getReceiveTime instead to avoid these timeouts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)