You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Yingchun Lai (Jira)" <ji...@apache.org> on 2019/11/04 12:51:00 UTC

[jira] [Updated] (KUDU-2992) Limit concurrent alter request of a table

     [ https://issues.apache.org/jira/browse/KUDU-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yingchun Lai updated KUDU-2992:
-------------------------------
    Description: 
One of our production environment clusters cause an accident some days ago, one user has a table, partition schema looks like:
{code:java}
HASH (uuid) PARTITIONS 80,RANGE (date_hour) (
    PARTITION 2019102900 <= VALUES < 2019102901,
    PARTITION 2019102901 <= VALUES < 2019102902,
    PARTITION 2019102902 <= VALUES < 2019102903,
    PARTITION 2019102903 <= VALUES < 2019102904,
    PARTITION 2019102904 <= VALUES < 2019102905,
...)
{code}
He try to remove many outdated partitions once by SparkSQL, but it returns an timeout error at first, then he try again and again, and SparkSQL failed again and again. Then the cluster became unstable, memory used and CPU load increasing.

 

I found many log like:
{code:java}
W1030 17:29:53.382287  7588 rpcz_store.cc:259] Trace:
1030 17:26:19.714799 (+     0us) service_pool.cc:162] Inserting onto call queue
1030 17:26:19.714808 (+     9us) service_pool.cc:221] Handling call
1030 17:29:53.382204 (+213667396us) ts_tablet_manager.cc:874] Deleting tablet c52c5f43f7884d08b07fd0005e878fed
1030 17:29:53.382205 (+     1us) ts_tablet_manager.cc:794] Acquired tablet manager lock
1030 17:29:53.382208 (+     3us) inbound_call.cc:162] Queueing success response
Metrics: {"tablet-delete.queue_time_us":213667360}
W1030 17:29:53.382300  7586 rpcz_store.cc:253] Call kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576 (request call id 1820316) took 213667 ms (3.56 min). Client timeout 29999 ms (30 s)
W1030 17:29:53.382292 10623 rpcz_store.cc:253] Call kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576 (request call id 1820315) took 213667 ms (3.56 min). Client timeout 29999 ms (30 s)
W1030 17:29:53.382297 10622 rpcz_store.cc:259] Trace:
1030 17:26:19.714825 (+     0us) service_pool.cc:162] Inserting onto call queue
1030 17:26:19.714833 (+     8us) service_pool.cc:221] Handling call
1030 17:29:53.382239 (+213667406us) ts_tablet_manager.cc:874] Deleting tablet 479f8c592f16408c830637a0129359e1
1030 17:29:53.382241 (+     2us) ts_tablet_manager.cc:794] Acquired tablet manager lock
1030 17:29:53.382244 (+     3us) inbound_call.cc:162] Queueing success response
Metrics: {"tablet-delete.queue_time_us":213667378}
...{code}
That means 'Acquired tablet manager lock' cost much time, right?
{code:java}
Status TSTabletManager::BeginReplicaStateTransition(
    const string& tablet_id,
    const string& reason,
    scoped_refptr<TabletReplica>* replica,
    scoped_refptr<TransitionInProgressDeleter>* deleter,
    TabletServerErrorPB::Code* error_code) {
  // Acquire the lock in exclusive mode as we'll add a entry to the
  // transition_in_progress_ map.
  std::lock_guard<RWMutex> lock(lock_);
  TRACE("Acquired tablet manager lock");
  RETURN_NOT_OK(CheckRunningUnlocked(error_code));
  ...
}{code}
But I think the root case is Kudu master send too many duplicate 'alter table/delete tablet' request to tserver. I found more info in master's log:
{code:java}
$ grep "Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d" kudu_master.zjy-hadoop-prc-ct01.bj.work.log.INFO.20191030-204137.62788 | egrep "attempt = 1\)"
I1030 20:41:42.207222 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 43 ms (attempt = 1)
I1030 20:41:42.207556 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 40 ms (attempt = 1)
I1030 20:41:42.260052 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 31 ms (attempt = 1)
I1030 20:41:42.278609 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 19 ms (attempt = 1)
I1030 20:41:42.312175 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 48 ms (attempt = 1)
I1030 20:41:42.318933 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 62 ms (attempt = 1)
I1030 20:41:42.340060 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 30 ms (attempt = 1)
...{code}
That means master received too many duplicate 'delete tablet' request from client, and then dispatch these request to tservers.

I think we should limit the concurrent alter request of a table, when a alter request is on going and hasn't been finished, the following request should be rejected. Or we should limit the size of TableInfo::pending_tasks_.

 

  was:
One of our production environment clusters cause an accident some days ago, one user has a table, partition schema looks like:
{code:java}
HASH (uuid) PARTITIONS 80,RANGE (date_hour) (
    PARTITION 2019102900 <= VALUES < 2019102901,
    PARTITION 2019102901 <= VALUES < 2019102902,
    PARTITION 2019102902 <= VALUES < 2019102903,
    PARTITION 2019102903 <= VALUES < 2019102904,
    PARTITION 2019102904 <= VALUES < 2019102905,
...)
{code}
He try to remove many outdated partitions once by SparkSQL, but it returns an timeout error at first, then he try again and again, and SparkSQL failed again and again. Then the cluster became unstable, memory used and CPU load increasing.

 

I found many log like:
{code:java}
W1030 17:29:53.382287  7588 rpcz_store.cc:259] Trace:1030 17:26:19.714799 (+     0us) service_pool.cc:162] Inserting onto call queue1030 17:26:19.714808 (+     9us) service_pool.cc:221] Handling call1030 17:29:53.382204 (+213667396us) ts_tablet_manager.cc:874] Deleting tablet c52c5f43f7884d08b07fd0005e878fed1030 17:29:53.382205 (+     1us) ts_tablet_manager.cc:794] Acquired tablet manager lock1030 17:29:53.382208 (+     3us) inbound_call.cc:162] Queueing success responseMetrics: {"tablet-delete.queue_time_us":213667360}W1030 17:29:53.382300  7586 rpcz_store.cc:253] Call kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576 (request call id 1820316) took 213667 ms (3.56 min). Client timeout 29999 ms (30 s)W1030 17:29:53.382292 10623 rpcz_store.cc:253] Call kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576 (request call id 1820315) took 213667 ms (3.56 min). Client timeout 29999 ms (30 s)W1030 17:29:53.382297 10622 rpcz_store.cc:259] Trace:1030 17:26:19.714825 (+     0us) service_pool.cc:162] Inserting onto call queue1030 17:26:19.714833 (+     8us) service_pool.cc:221] Handling call1030 17:29:53.382239 (+213667406us) ts_tablet_manager.cc:874] Deleting tablet 479f8c592f16408c830637a0129359e11030 17:29:53.382241 (+     2us) ts_tablet_manager.cc:794] Acquired tablet manager lock1030 17:29:53.382244 (+     3us) inbound_call.cc:162] Queueing success responseMetrics: {"tablet-delete.queue_time_us":213667378}
{code}
That means 'Acquired tablet manager lock' cost much time, right?
{code:java}
Status TSTabletManager::BeginReplicaStateTransition(
    const string& tablet_id,
    const string& reason,
    scoped_refptr<TabletReplica>* replica,
    scoped_refptr<TransitionInProgressDeleter>* deleter,
    TabletServerErrorPB::Code* error_code) {
  // Acquire the lock in exclusive mode as we'll add a entry to the
  // transition_in_progress_ map.
  std::lock_guard<RWMutex> lock(lock_);
  TRACE("Acquired tablet manager lock");
  RETURN_NOT_OK(CheckRunningUnlocked(error_code));
  ...
}{code}
But I think the root case is Kudu master send too many duplicate 'alter table/delete tablet' request to tserver. I found more info in master's log:
{code:java}
$ grep "Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d" kudu_master.zjy-hadoop-prc-ct01.bj.work.log.INFO.20191030-204137.62788 | egrep "attempt = 1\)"I1030 20:41:42.207222 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 43 ms (attempt = 1)I1030 20:41:42.207556 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 40 ms (attempt = 1)I1030 20:41:42.260052 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 31 ms (attempt = 1)I1030 20:41:42.278609 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 19 ms (attempt = 1)I1030 20:41:42.312175 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 48 ms (attempt = 1)I1030 20:41:42.318933 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 62 ms (attempt = 1)I1030 20:41:42.340060 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 30 ms (attempt = 1)I1030 20:41:42.475689 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 17 ms (attempt = 1)
...{code}
That means master received too many duplicate 'delete tablet' request from client, and then dispatch these request to tservers.

I think we should limit the concurrent alter request of a table, when a alter request is on going and hasn't been finished, the following request should be rejected. Or we should limit the size of TableInfo::pending_tasks_.

 


> Limit concurrent alter request of a table
> -----------------------------------------
>
>                 Key: KUDU-2992
>                 URL: https://issues.apache.org/jira/browse/KUDU-2992
>             Project: Kudu
>          Issue Type: Improvement
>          Components: master
>            Reporter: Yingchun Lai
>            Priority: Major
>
> One of our production environment clusters cause an accident some days ago, one user has a table, partition schema looks like:
> {code:java}
> HASH (uuid) PARTITIONS 80,RANGE (date_hour) (
>     PARTITION 2019102900 <= VALUES < 2019102901,
>     PARTITION 2019102901 <= VALUES < 2019102902,
>     PARTITION 2019102902 <= VALUES < 2019102903,
>     PARTITION 2019102903 <= VALUES < 2019102904,
>     PARTITION 2019102904 <= VALUES < 2019102905,
> ...)
> {code}
> He try to remove many outdated partitions once by SparkSQL, but it returns an timeout error at first, then he try again and again, and SparkSQL failed again and again. Then the cluster became unstable, memory used and CPU load increasing.
>  
> I found many log like:
> {code:java}
> W1030 17:29:53.382287  7588 rpcz_store.cc:259] Trace:
> 1030 17:26:19.714799 (+     0us) service_pool.cc:162] Inserting onto call queue
> 1030 17:26:19.714808 (+     9us) service_pool.cc:221] Handling call
> 1030 17:29:53.382204 (+213667396us) ts_tablet_manager.cc:874] Deleting tablet c52c5f43f7884d08b07fd0005e878fed
> 1030 17:29:53.382205 (+     1us) ts_tablet_manager.cc:794] Acquired tablet manager lock
> 1030 17:29:53.382208 (+     3us) inbound_call.cc:162] Queueing success response
> Metrics: {"tablet-delete.queue_time_us":213667360}
> W1030 17:29:53.382300  7586 rpcz_store.cc:253] Call kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576 (request call id 1820316) took 213667 ms (3.56 min). Client timeout 29999 ms (30 s)
> W1030 17:29:53.382292 10623 rpcz_store.cc:253] Call kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576 (request call id 1820315) took 213667 ms (3.56 min). Client timeout 29999 ms (30 s)
> W1030 17:29:53.382297 10622 rpcz_store.cc:259] Trace:
> 1030 17:26:19.714825 (+     0us) service_pool.cc:162] Inserting onto call queue
> 1030 17:26:19.714833 (+     8us) service_pool.cc:221] Handling call
> 1030 17:29:53.382239 (+213667406us) ts_tablet_manager.cc:874] Deleting tablet 479f8c592f16408c830637a0129359e1
> 1030 17:29:53.382241 (+     2us) ts_tablet_manager.cc:794] Acquired tablet manager lock
> 1030 17:29:53.382244 (+     3us) inbound_call.cc:162] Queueing success response
> Metrics: {"tablet-delete.queue_time_us":213667378}
> ...{code}
> That means 'Acquired tablet manager lock' cost much time, right?
> {code:java}
> Status TSTabletManager::BeginReplicaStateTransition(
>     const string& tablet_id,
>     const string& reason,
>     scoped_refptr<TabletReplica>* replica,
>     scoped_refptr<TransitionInProgressDeleter>* deleter,
>     TabletServerErrorPB::Code* error_code) {
>   // Acquire the lock in exclusive mode as we'll add a entry to the
>   // transition_in_progress_ map.
>   std::lock_guard<RWMutex> lock(lock_);
>   TRACE("Acquired tablet manager lock");
>   RETURN_NOT_OK(CheckRunningUnlocked(error_code));
>   ...
> }{code}
> But I think the root case is Kudu master send too many duplicate 'alter table/delete tablet' request to tserver. I found more info in master's log:
> {code:java}
> $ grep "Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d" kudu_master.zjy-hadoop-prc-ct01.bj.work.log.INFO.20191030-204137.62788 | egrep "attempt = 1\)"
> I1030 20:41:42.207222 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 43 ms (attempt = 1)
> I1030 20:41:42.207556 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 40 ms (attempt = 1)
> I1030 20:41:42.260052 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 31 ms (attempt = 1)
> I1030 20:41:42.278609 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 19 ms (attempt = 1)
> I1030 20:41:42.312175 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 48 ms (attempt = 1)
> I1030 20:41:42.318933 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 62 ms (attempt = 1)
> I1030 20:41:42.340060 62821 catalog_manager.cc:2971] Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 30 ms (attempt = 1)
> ...{code}
> That means master received too many duplicate 'delete tablet' request from client, and then dispatch these request to tservers.
> I think we should limit the concurrent alter request of a table, when a alter request is on going and hasn't been finished, the following request should be rejected. Or we should limit the size of TableInfo::pending_tasks_.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)