You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Bankim Bhavsar (Jira)" <ji...@apache.org> on 2019/10/28 23:27:00 UTC

[jira] [Commented] (KUDU-2963) Catalog manager never gives up on CreateTablet RPCs

    [ https://issues.apache.org/jira/browse/KUDU-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961550#comment-16961550 ] 

Bankim Bhavsar commented on KUDU-2963:
--------------------------------------

{{AsyncCreateReplica}} is the class that sends CreateTablet request and retries in case of failure.

Constructor of {{AsyncCreateReplica}} sets a default deadline of 30 secs which is less than the default deadline of 1hr set by the base class constructor {{RetryingTSRpcTask}}.
https://github.com/apache/kudu/blob/master/src/kudu/master/catalog_manager.cc#L3390

{{FLAGS_tablet_creation_timeout_ms}} is used in 2 places:
 - while setting deadline for the {{AsyncCreateReplica}}
 - issuing DeleteTablet RPCs in case of failure to CreateTablet after the timeout of FLAGS_tablet_creation_timeout_ms

The deadline is honoured and no additional RPC requests are scheduled on crossing the deadline.
https://github.com/apache/kudu/blob/master/src/kudu/master/catalog_manager.cc#L3276
https://github.com/apache/kudu/blob/master/src/kudu/master/catalog_manager.cc#L3287

I was able to verify this by tweaking {{CreateTableITest::TestCreateWhenMajorityOfReplicasFailCreation}}.
Verified that CreateTablet RCPs time out and no additional ones are issued to the tablet servers.

Next I can work on a unit test that proves that CreateTable RPCs are not retried indefinitely.

----

On other note, I noticed a bug in the {{CreateTableITest::TestCreateWhenMajorityOfReplicasFailCreation}} test that would have been the cause of flakiness fixed by this [commit|https://github.com/apache/kudu/commit/d119d529beb691a84134d02e33ecdce6102a7a35]

https://github.com/apache/kudu/blob/master/src/kudu/integration-tests/create-table-itest.cc#L133-L134

> Catalog manager never gives up on CreateTablet RPCs
> ---------------------------------------------------
>
>                 Key: KUDU-2963
>                 URL: https://issues.apache.org/jira/browse/KUDU-2963
>             Project: Kudu
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 1.11.0
>            Reporter: Adar Dembo
>            Assignee: Bankim Bhavsar
>            Priority: Major
>              Labels: newbie
>
> This is a problem when there aren't enough live tservers upon which to place a tablet's replicas, or when a chosen tserver doesn't create the replica quickly enough. If the catalog manager decides to replace the tablet, the replaced tablet's CreateTablet RPCs continue to retry ad infinitum. If the previously dead tservers then come back to life, they must needlessly process the CreateTablet RPCs.
> The tablets are eventually deleted, either through explicit DeleteTablet RPCs (triggered by the catalog manager replacement process), or by heartbeating, but it's an unnecessary drain on cluster resources.
> We should probably abort CreateTablet RPCs for tablets that have been removed from their table.
> CreateTableITest_TestCreateWhenMajorityOfReplicasFailCreation demonstrates this acutely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)