You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/04/06 00:43:25 UTC

[jira] [Commented] (KUDU-1395) Scanner KeepAlive requests can get starved on an overloaded server

    [ https://issues.apache.org/jira/browse/KUDU-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227289#comment-15227289 ] 

Todd Lipcon commented on KUDU-1395:
-----------------------------------

A couple ideas to solve this:

1) we could make KeepAlive retry with diminishing deadlines, like other RPCs do.
Pro: no server-side changes needed
Con: typically, a client would like a KeepAlive call to be light weight/fast.

2) we could add a new RPC system feature such that certain RPCs are allowed in a "fast lane"
- fast-lane RPCs would be limited to only those that we know consume very few resources and won't block on locks (eg stuff like keepalive or liveness heartbeats)
- these RPCs would take higher priority over all other RPCs regardless of deadline.
- we would probably start with a server-side annotation of which RPCs are fast-lane, rather than trusting clients to prioritize.

3) some fancier scheduler which tries to estimate and take into account RPC costs, and not just deadlines
- I'm aware of some research going on around this idea (unfortunately can't reference it yet since it's a pre-print). This can help both with multitenant fairness and better scheduling within a tenant. I'll ping the folks working on this research and see what the plans are for publication of the idea, since it might be a good fit.


> Scanner KeepAlive requests can get starved on an overloaded server
> ------------------------------------------------------------------
>
>                 Key: KUDU-1395
>                 URL: https://issues.apache.org/jira/browse/KUDU-1395
>             Project: Kudu
>          Issue Type: Bug
>          Components: impala, rpc, tserver
>    Affects Versions: 0.8.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>
> As of 0.8.0, the RPC system schedules RPCs on an earliest-deadline-first basis, rejecting those with later deadlines. This works well for RPCs which are retried on SERVER_TOO_BUSY errors, since the retries maintain the original deadline and thus get higher and higher priority as they get closer to timing out.
> We don't, however, do any retries on scanner KeepAlive RPCs. So, if a keepalive RPC arrives at a heavily overloaded tserver, it will likely get rejected, and won't retry. This means that Impala queries or other long scans that rely on KeepAlives will likely fail on overloaded clusters since the KeepAlive never gets through.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)