You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/04/11 04:52:25 UTC

[jira] [Commented] (KUDU-1409) Make krpc call timeouts more resistant to process pauses

    [ https://issues.apache.org/jira/browse/KUDU-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15234436#comment-15234436 ] 

Todd Lipcon commented on KUDU-1409:
-----------------------------------

I'm thinking of the following strategy:

- given a 5 second timeout, we set the libev timer for a slightly shorter value, like 4.8 seconds
- upon that timeout firing, we reset the timer for the remaining 200ms
- only upon the second timeout firing, do we actually consider the call failed

The idea here is that, if the process got paused, then first "pre-timeout" timer will get arbitrarily delayed. Then, when we wake up, we'll give it an extra 200ms to try to read the call response off the wire if it is in fact already waiting. In the case that there was no process pause, we pay the "cost" of an extra libev wakeup, but timeouts are rare so this shouldn't really matter. We might also be giving up a slight amount of accuracy on timeouts, but for long timeouts that shouldn't be important (they're usually chosen rather arbitrarily).

Any other good ideas here?



> Make krpc call timeouts more resistant to process pauses
> --------------------------------------------------------
>
>                 Key: KUDU-1409
>                 URL: https://issues.apache.org/jira/browse/KUDU-1409
>             Project: Kudu
>          Issue Type: Improvement
>          Components: rpc
>    Affects Versions: 0.8.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>
> In stress testing Impala on Kudu I've seen various RPC timeouts that turn out to be due to pauses on the client side. In particular, scenarios like https://issues.cloudera.org/browse/IMPALA-2800 can cause the memory allocator inside Impala to block for several seconds, and that might cause us to think we missed a timeout.
> We should be more resilient to this sort of "false" timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)