You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "kun'qin (JIRA)" <ji...@apache.org> on 2019/05/10 01:48:00 UTC

[jira] [Commented] (KUDU-2453) kudu should stop creating tablet infinitely

    [ https://issues.apache.org/jira/browse/KUDU-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836823#comment-16836823 ] 

kun'qin  commented on KUDU-2453:
--------------------------------

你的问题解决了吗?

> kudu should stop creating tablet infinitely
> -------------------------------------------
>
>                 Key: KUDU-2453
>                 URL: https://issues.apache.org/jira/browse/KUDU-2453
>             Project: Kudu
>          Issue Type: Bug
>          Components: master, tserver
>    Affects Versions: 1.4.0, 1.7.2
>            Reporter: HeLifu
>            Priority: Major
>
> I have met this problem again on 2018/10/26. And now the kudu version is 1.7.2.
> -----------------------------------------------------
> We modified the flag 'max_create_tablets_per_ts' (2000) of master.conf, and there are some load on the kudu cluster. Then someone else created a big table which had tens of thousands of tablets from impala-shell (that was a mistake). 
> {code:java}
> CREATE TABLE XXX(
> ...
>    PRIMARY KEY (...)
> )
> PARTITION BY HASH (...) PARTITIONS 100,
> RANGE (...)
> (
>   PARTITION "2018-10-24" <= VALUES < "2018-10-24\000",
>   PARTITION "2018-10-25" <= VALUES < "2018-10-25\000",
>   ...
>   PARTITION "2018-12-07" <= VALUES < "2018-12-07\000"
> )
> STORED AS KUDU
> TBLPROPERTIES ('kudu.master_addresses'= '...');
> {code}
> Here are the logs after creating table (only pick one tablet as example):
> {code:java}
> ------------------------------------------------------Kudu-master log
> ======e884bda6bbd3482f94c07ca0f34f99a4======
> W1024 11:40:51.914397 180146 catalog_manager.cc:2664] TS 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): Create Tablet RPC failed for tablet e884bda6bbd3482f94c07ca0f34f99a4: Remote error: Service unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService from 10.120.219.118:50247 dropped due to backpressure. The service queue is full; it has 512 items.
> I1024 11:40:51.914412 180146 catalog_manager.cc:2700] Scheduling retry of CreateTablet RPC for tablet e884bda6bbd3482f94c07ca0f34f99a4 on TS 39f15fcf42ef45bba0c95a3223dc25ee with a delay of 42 ms (attempt = 1)
> ...
> ======Be replaced by 0b144c00f35d48cca4d4981698faef72======
> W1024 11:41:22.114512 180202 catalog_manager.cc:3949] T 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Tablet e884bda6bbd3482f94c07ca0f34f99a4 (table quasi_realtime_user_feature [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed timeout. Replacing with a new tablet 0b144c00f35d48cca4d4981698faef72
> ...
> I1024 11:41:22.391916 180202 catalog_manager.cc:3806] T 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Sending DeleteTablet for 3 replicas of tablet e884bda6bbd3482f94c07ca0f34f99a4
> ...
> I1024 11:41:22.391927 180202 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_DELETED) for tablet e884bda6bbd3482f94c07ca0f34f99a4 on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST)
> ...
> W1024 11:41:22.428129 180146 catalog_manager.cc:2892] TS 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for tablet e884bda6bbd3482f94c07ca0f34f99a4 with error code TABLET_NOT_RUNNING: Already present: State transition of tablet e884bda6bbd3482f94c07ca0f34f99a4 already in progress: creating tablet
> ...
> I1024 11:41:22.428143 180146 catalog_manager.cc:2700] Scheduling retry of e884bda6bbd3482f94c07ca0f34f99a4 Delete Tablet RPC for TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 35 ms (attempt = 1)
> ...
> W1024 11:41:22.683702 180145 catalog_manager.cc:2664] TS b251540e606b4863bb576091ff961892 (kudu1.lt.163.org:7050): Create Tablet RPC failed for tablet 0b144c00f35d48cca4d4981698faef72: Remote error: Service unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService from 10.120.219.118:59735 dropped due to backpressure. The service queue is full; it has 512 items.
> I1024 11:41:22.683717 180145 catalog_manager.cc:2700] Scheduling retry of CreateTablet RPC for tablet 0b144c00f35d48cca4d4981698faef72 on TS b251540e606b4863bb576091ff961892 with a delay of 46 ms (attempt = 1)
> ...
> ======Be replaced by c0e0acc448fc42fc9e48f5025b112a75======
> W1024 11:41:52.775420 180202 catalog_manager.cc:3949] T 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Tablet 0b144c00f35d48cca4d4981698faef72 (table quasi_realtime_user_feature [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed timeout. Replacing with a new tablet c0e0acc448fc42fc9e48f5025b112a75
> ...
> ------------------------------------------------------Kudu-tserver log
> I1024 11:40:52.014571 137358 tablet_service.cc:758] Processing CreateTablet for tablet e884bda6bbd3482f94c07ca0f34f99a4 (table=quasi_realtime_user_feature [id=946d6dd03ec544eab96231e5a03bed59]), partition=HASH (user_id) PARTITION 29, RANGE (dt) PARTITION "2018-11-10" <= VALUES < "2018-11-10\000"
> ...
> I1024 11:40:52.017539 137358 ts_tablet_manager.cc:1080] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Registered tablet (data state: TABLET_DATA_READY)
> ...
> I1024 11:41:22.392292 137355 tablet_service.cc:799] Processing DeleteTablet for tablet e884bda6bbd3482f94c07ca0f34f99a4 with delete_type TABLET_DATA_DELETED (Replaced by 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST) from {username='kudu'} at 10.120.219.118:50247
> ...
> I1024 12:03:31.079942 126376 ts_tablet_manager.cc:938] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Bootstrapping tablet
> I1024 12:03:31.079965 126376 tablet_bootstrap.cc:436] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Bootstrap starting.
> ...
> I1024 12:03:31.080237 126376 tablet_bootstrap.cc:581] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: No blocks or log segments found. Creating new log.
> ...
> I1024 12:03:31.080834 126376 tablet_bootstrap.cc:436] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: No bootstrap required, opened a new log
> I1024 12:03:31.080870 126376 ts_tablet_manager.cc:955] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Time spent bootstrapping tablet: real 0.001s user 0.004s sys 0.000s
> I1024 12:03:31.080987 137368 tablet_service.cc:799] Processing DeleteTablet for tablet 0ba3a2963c524d859a85ad5fbce5bf96 with delete_type TABLET_DATA_DELETED (Replaced by 539e0ab40ec1442a94b330c18521d27f at 2018-10-24 11:42:23 CST) from {username='kudu'} at 10.120.219.118:50247
> I1024 12:03:31.080996 126376 raft_consensus.cc:304] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Replica starting. Triggering 0 pending transactions. Active config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER last_known_addr { host: "kudu5.lt.163.org" port: 7050 } }
> I1024 12:03:31.081007 126376 raft_consensus.cc:330] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Consensus starting up: Expiring failure detector timer to make a prompt election more likely
> I1024 12:03:31.081014 126376 raft_consensus.cc:605] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Becoming Follower/Learner. State: Replica: 39f15fcf42ef45bba0c95a3223dc25ee, State: Initialized, Role: FOLLOWER
> I1024 12:03:31.081053 126376 consensus_queue.cc:226] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [NON_LEADER]: Queue going to NON_LEADER mode. State: All replicated index: 0, Majority replicated index: 0, Committed index: 0, Last appended: 0.0, Last appended by leader: 0, Current term: 0, Majority size: -1, State: 0, Mode: NON_LEADER, active raft config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER last_known_addr { host: "kudu5.lt.163.org" port: 7050 } }
> ...
> I1024 12:03:35.315024 161474 raft_consensus.cc:436] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Starting pre-election (no leader contacted us within the election timeout)
> I1024 12:03:35.315050 161474 raft_consensus.cc:2700] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Snoozing failure detection for 2.642s (starting election)
> I1024 12:03:35.315111 161474 raft_consensus.cc:458] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Starting pre-election with config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER last_known_addr { host: "kudu5.lt.163.org" port: 7050 } }
> I1024 12:03:35.315141 137342 tablet_service.cc:799] Processing DeleteTablet for tablet c3f47046905e4c0f9ffe21a9e7a5bc8b with delete_type TABLET_DATA_DELETED (Replaced by 28f11e17ecdc45d2a4d72c74afc2a488 at 2018-10-24 11:45:27 CST) from {username='kudu'} at 10.120.219.118:50247
> I1024 12:03:35.315763 161474 leader_election.cc:231] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Requesting pre-vote from peer cd75ddaabaa8487b9fe1feab9d2cba83
> I1024 12:03:35.315790 161474 leader_election.cc:231] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Requesting pre-vote from peer 061221dd1c3e40a3a3338afc74bb66f5
> W1024 12:03:35.315987 137266 leader_election.cc:293] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Tablet error from VoteRequest() call to peer cd75ddaabaa8487b9fe1feab9d2cba83: Illegal state: must be running to vote when last-logged opid is not known
> W1024 12:03:35.316056 137267 leader_election.cc:293] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Tablet error from VoteRequest() call to peer 061221dd1c3e40a3a3338afc74bb66f5: Illegal state: must be running to vote when last-logged opid is not known
> I1024 12:03:35.316074 137267 leader_election.cc:258] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Election decided. Result: candidate lost.
> I1024 12:03:35.316102 161474 raft_consensus.cc:2700] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Snoozing failure detection for 4.310s (election complete)
> I1024 12:03:35.316115 161474 raft_consensus.cc:2455] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Leader pre-election lost for term 1. Reason: could not achieve majority
> ...
> I1024 12:04:01.303092 126376 ts_tablet_manager.cc:983] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Time spent starting tablet: real 30.222s user 0.000s sys 0.000s
> W1024 12:04:01.303108 126376 ts_tablet_manager.cc:1006] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Tablet startup took 30222ms
> ...
> W1024 12:04:01.303225 126376 ts_tablet_manager.cc:1008] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Trace:
> 1024 11:40:51.956652 (+ 0us) service_pool.cc:163] Inserting onto call queue
> 1024 11:40:52.014539 (+ 57887us) service_pool.cc:222] Handling call
> 1024 11:40:52.014585 (+ 46us) ts_tablet_manager.cc:357] Acquired tablet manager lock
> 1024 11:40:52.014588 (+ 3us) ts_tablet_manager.cc:370] Creating new metadata...
> 1024 11:40:52.016532 (+ 1944us) tablet_metadata.cc:556] Metadata flushed
> 1024 11:40:52.017445 (+ 913us) tablet_replica.cc:143] Creating consensus instance
> 1024 11:40:52.017601 (+ 156us) inbound_call.cc:157] Queueing success response
> 1024 12:03:31.079950 (+1359062349us) ts_tablet_manager.cc:939] Bootstrapping tablet
> 1024 12:03:31.080879 (+ 929us) ts_tablet_manager.cc:984] Starting tablet replica
> 1024 12:03:31.080893 (+ 14us) tablet_replica.cc:195] Starting instrumentation
> 1024 12:03:31.080900 (+ 7us) tablet_replica.cc:207] Starting consensus
> Metrics: {"fdatasync":1,"fdatasync_us":238,"mutex_wait_us":30221902,"raft.queue_time_us":6,"raft.run_cpu_time_us":4,"raft.run_wall_time_us":5,"tablet-open.queue_time_us":1359062354}
> {code}
> {color:#ff0000}Tablet startup took 30+ seconds.{color}
> It was a long time for him to wait, so he did "ctrl+c". But we found that the tablets in 'INITIALIZED' status was growing rapidly, -half- an hour later it was 350,000:(
> {color:#ff0000}100(hash) * 45(range) * 3(RF) * (60(minute) * 60(second) / 30(repeat/second)) / 5(tservers) = 324000 (tablets/tserver).{color}
> {color:#333333}{color:#ff0000}(created tablets: {color}{color:#ff0000}60m * 60s / 30+s * 12(threads) = 1440 (tablets per hour)){color}{color}
> We deleted this table by kudu client tool, and found that the number of 'INITIALIZED' tablets was going down slowly. By simple estimating it will take 10+ days to be back to normal.  But luckily, the application system are not affected.
> {code:java}
> ------------------------------------------------------Kudu-master log with another tablet uuid
> I1031 16:21:21.644222 180146 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_DELETED) for tablet d1fd56be8eef44e782d509a0eeae9c15 on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by ff4fd0a538944d69b8a6beea81e5bb01 at 2018-10-24 12:39:17 CST)
> W1031 16:21:21.644421 180146 catalog_manager.cc:2892] TS 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for tablet d1fd56be8eef44e782d509a0eeae9c15 with error code TABLET_NOT_RUNNING: Already present: State transition of tablet d1fd56be8eef44e782d509a0eeae9c15 already in progress: creating tablet
> I1031 16:21:21.644436 180146 catalog_manager.cc:2700] Scheduling retry of d1fd56be8eef44e782d509a0eeae9c15 Delete Tablet RPC for TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 553 ms (attempt = 6)
> ------------------------------------------------------Kudu-tserver log
> I1031 16:21:22.197888 137341 tablet_service.cc:799] Processing DeleteTablet for tablet d1fd56be8eef44e782d509a0eeae9c15 with delete_type TABLET_DATA_DELETED (Replaced by ff4fd0a538944d69b8a6beea81e5bb01 at 2018-10-24 12:39:17 CST) from {username='kudu'} at 10.120.219.118:50247
> I1031 16:21:22.230309 137131 maintenance_manager.cc:492] P 39f15fcf42ef45bba0c95a3223dc25ee: FlushDeltaMemStoresOp(70499bc0f9ac4d8196ae5a0be6ef0b8b) complete. Timing: real 0.416s	user 0.404s	sys 0.008s Metrics: {"fdatasync":3,"fdatasync_us":2583,"lbm_write_time_us":29,"lbm_writes_lt_1ms":4}
> I1031 16:21:22.321700 137341 tablet_service.cc:799] Processing DeleteTablet for tablet 74a30181dea9400a9bcfaeb56f83f379 with delete_type TABLET_DATA_DELETED (Replaced by 31e350fddea443048946f5a20d3171bd at 2018-10-31 16:21:13 CST) from {username='kudu'} at 10.120.219.118:50247
> I1031 16:21:22.350440 137341 tablet_service.cc:799] Processing DeleteTablet for tablet 7c864af01309432c9a2a4d1c88bbe52b with delete_type TABLET_DATA_DELETED (Replaced by ec4b733818d940e0af32c51bda3c7^C
>  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)