You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "ZhangZhen (JIRA)" <ji...@apache.org> on 2017/11/30 07:05:00 UTC

[jira] [Commented] (KUDU-2206) Create table timeout due to too many DRS in one tablet cause lock contention

    [ https://issues.apache.org/jira/browse/KUDU-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272245#comment-16272245 ] 

ZhangZhen commented on KUDU-2206:
---------------------------------

Try to conclude this issue. 
I have a table with about 30K DRSs in its tablet, which cause MaintenanceManager::FindBestOp takes a long time(40s) to do compaction policy calculation, and maintenance manager will hold a lock which the CreateTable rpc also need, that result in CreateTable rpc timeout.

Todd made an improvement to short circuit the compaction policy calculation as soon as we know the compaction won't be worthwhile. It works for my case as all the DRSs of my table don't overlap, thanks [~tlipcon] The review address is https://gerrit.cloudera.org/#/c/8444/ 



> Create table timeout due to too many DRS in one tablet cause lock contention
> ----------------------------------------------------------------------------
>
>                 Key: KUDU-2206
>                 URL: https://issues.apache.org/jira/browse/KUDU-2206
>             Project: Kudu
>          Issue Type: Bug
>    Affects Versions: 1.3.0
>            Reporter: ZhangZhen
>         Attachments: kudu_master.log, pstack.zip, trace_tserver07_trace.json, tserver07.flags, tserver_01_0f53a0d3.log, tserver_07_23f962e4a1.log, tsever_02_0a8bbcbb.log
>
>
> We encountered rpc timeout exception when we use sparksql, which use Java kudu client innerly, to create table on kudu cluster. The cluster has 10 tserver and 1 master on 10 machines, the target table has 10 range partitions and 5 hash partitions. 
> From the web UI, I found it spent about 3 minutes before all the tablets vote a leader, and I can see a lot delete tablet records in the UI like:
> Delete Tablet	Running	2.13 min	719f0f496bc34a469e4069b2861b4be8 Delete Tablet RPC for TS=044f1da9a27c46acb82b1386f829f4dc
> Also I find many retry records in tserver logs, like:
> W1031 23:04:40.088256  5816 consensus_peers.cc:357] T fcde65c4e4cf4df29b9ef9884ce292b2 P 0f53a0d3ef7e44ebb0365c800752d5bd -> Peer 23f962e4a1744381ad5fa0d2d8b10241 (c3-kudu-tst-st07.bj:18700): Couldn't send request to peer 23f962e4a1744381ad5fa0d2d8b10241 for tablet fcde65c4e4cf4df29b9ef9884ce292b2. Error code: TABLET_NOT_RUNNING (12). Status: Illegal state: Tablet not RUNNING: NOT_STARTED. Retrying in the next heartbeat period. Already tried 94 times.
> You can find the logs of master and tserver since master receive the create table request in the attachment.
> The kudu version is 1.3.0, the nearest commit is 00813f96b9cb0c9ec57a17e5c85242f7679db0e0
> The exception that client received is like:
> Error: org.apache.kudu.client.NonRecoverableException: RPC can not complete before timeout: KuduRpc(method=IsCreateTableDone, tablet=null, attempt=25, DeadlineTracker(timeout=30000, elapsed=28499), Traces: [0ms] sending RPC to server , [0ms] received from server  response OK, [20ms] sending RPC to server , [20ms] received from server  response OK, [40ms] sending RPC to server , [40ms] received from server  response OK, [59ms] sending RPC to server , [60ms] received from server  response OK, [80ms] sending RPC to server , [80ms] received from server  response OK, [100ms] sending RPC to server , [100ms] received from server  response OK, [140ms] sending RPC to server , [141ms] received from server  response OK, [200ms] sending RPC to server , [200ms] received from server  response OK, [319ms] sending RPC to server , [320ms] received from server  response OK, [780ms] sending RPC to server , [780ms] received from server  response OK, [2740ms] sending RPC to server , [2741ms] received from server  response OK, [3580ms] sending RPC to server , [3580ms] received from server  response OK, [4840ms] sending RPC to server , [4840ms] received from server  response OK, [7080ms] sending RPC to server , [7081ms] received from server  response OK, [8320ms] sending RPC to server , [8321ms] received from server  response OK, [11620ms] sending RPC to server , [11621ms] received from server  response OK, [13540ms] sending RPC to server , [13540ms] received from server  response OK, [16819ms] sending RPC to server , [16820ms] received from server  response OK, [19020ms] sending RPC to server , [19020ms] received from server  response OK, [21340ms] sending RPC to server , [21341ms] received from server  response OK, [24660ms] sending RPC to server , [24661ms] received from server  response OK, [26800ms] sending RPC to server , [26800ms] received from server  response OK, [27660ms] sending RPC to server , [27660ms] received from server  response OK, [28480ms] sending RPC to server , [28481ms] received from server



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)