You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "David Alves (JIRA)" <ji...@apache.org> on 2017/03/01 03:08:45 UTC

[jira] [Commented] (KUDU-1466) C++ client errors misreported as GetTableLocations timeouts

    [ https://issues.apache.org/jira/browse/KUDU-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889414#comment-15889414 ] 

David Alves commented on KUDU-1466:
-----------------------------------

One more data point for this while running ClientTest.TestWriteWithDeadTabletServer:
{code}
I0228 15:50:00.844074  3159 tablet_server.cc:142] TabletServer shut down complete. Bye!
I0228 15:50:00.844105  3159 tablet_server.cc:133] TabletServer shutting down...
I0228 15:50:00.844163  3159 tablet_server.cc:142] TabletServer shut down complete. Bye!
W0228 15:50:00.859549 22414 meta_cache.cc:198] Tablet b1352be2ba0b4cd284e61c2d718a6fb5: Replica 20fc9932cc6f4ab0b06db208c88ceda8 (127.0.0.1:37026) has failed: Network error: Client connection negotiation failed: client connection to 127.0.0.1:37026: connect: Connection refused (error 111)
W0228 15:50:01.849710 21927 rpcz_store.cc:234] Call kudu.master.MasterService.GetTableLocations from 127.0.0.1:39701 (request call id 59) took 1ms (client timeout 1).
W0228 15:50:01.849917 21927 rpcz_store.cc:238] Trace:
0228 15:50:01.848682 (+     0us) service_pool.cc:143] Inserting onto call queue
0228 15:50:01.849198 (+   516us) service_pool.cc:202] Handling call
0228 15:50:01.849685 (+   487us) inbound_call.cc:130] Queueing success response
Metrics: {}
W0228 15:50:01.852064 22422 meta_cache.cc:765] Timed out: GetTableLocations { table: 'client-testtb', partition-key: (<start>), attempt: 1 } failed: timed out after deadline expired: GetTableLocations RPC to 127.0.0.1:60325 timed out after 0.002s (SENT)
W0228 15:50:01.852180 22422 batcher.cc:325] Timed out: Failed to write batch of 1 ops to tablet b1352be2ba0b4cd284e61c2d718a6fb5 after 39 attempt(s): GetTableLocations { table: 'client-testtb', partition-key: (<start>), attempt: 1 } failed: timed out after deadline expired: GetTableLocations RPC to 127.0.0.1:60325 timed out after 0.002s (SENT)
/data/jenkins-workspace/kudu-workspace/src/kudu/client/client-test.cc:2186: Failure
Failed
Expected to find substring 'Connection refused'. Got: 'Timed out: Failed to write batch of 1 ops to tablet b1352be2ba0b4cd284e61c2d718a6fb5 after 39 attempt(s): GetTableLocations { table: 'client-testtb', partition-key: (<start>), attempt: 1 } failed: timed out after deadline expired: GetTableLocations RPC to 127.0.0.1:60325 timed out after 0.002s (SENT)'
{code}

The weird thing to me is that the master GetTabletLocations call seems to be successful almost exactly 1 sec after the clients gets "connection refused" from the tablet server. Also weird that it has 1 ms deadline.

> C++ client errors misreported as GetTableLocations timeouts
> -----------------------------------------------------------
>
>                 Key: KUDU-1466
>                 URL: https://issues.apache.org/jira/browse/KUDU-1466
>             Project: Kudu
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.8.0
>            Reporter: Todd Lipcon
>            Assignee: Alexey Serbin
>            Priority: Critical
>
> client-test is currently very flaky due to this issue:
> - we are injecting some kind of failure on the tablet server (eg DNS resolution failure)
> - when we fail to connect to the TS, we correctly re-trigger a lookup against the master
> - depending how the backoffs and retries line up, we sometimes end up triggering the lookup retry when the remaining operation budget is very short (eg <10ms)
> -- this GetTabletLocations RPC times out since the master is unable to respond within the ridiculously short timeout
> During the course of retrying some operation, we should probably not replace the 'last_error' with a master error, so long as we have had at least one successful master lookup (thus indicating that the master is not the problem)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)