You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "ZhangZhen (JIRA)" <ji...@apache.org> on 2017/11/01 04:35:00 UTC

[jira] [Commented] (KUDU-2206) Kudu client create table timeout

    [ https://issues.apache.org/jira/browse/KUDU-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16233637#comment-16233637 ] 

ZhangZhen commented on KUDU-2206:
---------------------------------

Hi Adar, the tserver 07 has 344 tablets running on it and as this is a staging cluster, most of the tablets are empty, only 24 tablet has data and each use about 40M disk size. And yes, this is a release 1.3.0 version of Kudu.
I just reproduce this issue and at the same time run pstack on tserver 07 10 times with 1 second between them, you can find the jstack logs in the attachment.
And sorry, I failed to use /tracing.html tool to get more clues, after choose the maintenance category and click the record button, it seems nothing is recorded. I see nothing after I click the finish. Is there some wiki about using this tool?




> Kudu client create table timeout
> --------------------------------
>
>                 Key: KUDU-2206
>                 URL: https://issues.apache.org/jira/browse/KUDU-2206
>             Project: Kudu
>          Issue Type: Bug
>    Affects Versions: 1.3.0
>            Reporter: ZhangZhen
>            Priority: Major
>         Attachments: kudu_master.log, pstack.zip, tserver_01_0f53a0d3.log, tserver_07_23f962e4a1.log, tsever_02_0a8bbcbb.log
>
>
> We encountered rpc timeout exception when we use sparksql, which use Java kudu client innerly, to create table on kudu cluster. The cluster has 10 tserver and 1 master on 10 machines, the target table has 10 range partitions and 5 hash partitions. 
> From the web UI, I found it spent about 3 minutes before all the tablets vote a leader, and I can see a lot delete tablet records in the UI like:
> Delete Tablet	Running	2.13 min	719f0f496bc34a469e4069b2861b4be8 Delete Tablet RPC for TS=044f1da9a27c46acb82b1386f829f4dc
> Also I find many retry records in tserver logs, like:
> W1031 23:04:40.088256  5816 consensus_peers.cc:357] T fcde65c4e4cf4df29b9ef9884ce292b2 P 0f53a0d3ef7e44ebb0365c800752d5bd -> Peer 23f962e4a1744381ad5fa0d2d8b10241 (c3-kudu-tst-st07.bj:18700): Couldn't send request to peer 23f962e4a1744381ad5fa0d2d8b10241 for tablet fcde65c4e4cf4df29b9ef9884ce292b2. Error code: TABLET_NOT_RUNNING (12). Status: Illegal state: Tablet not RUNNING: NOT_STARTED. Retrying in the next heartbeat period. Already tried 94 times.
> You can find the logs of master and tserver since master receive the create table request in the attachment.
> The kudu version is 1.3.0, the nearest commit is 00813f96b9cb0c9ec57a17e5c85242f7679db0e0
> The exception that client received is like:
> Error: org.apache.kudu.client.NonRecoverableException: RPC can not complete before timeout: KuduRpc(method=IsCreateTableDone, tablet=null, attempt=25, DeadlineTracker(timeout=30000, elapsed=28499), Traces: [0ms] sending RPC to server , [0ms] received from server  response OK, [20ms] sending RPC to server , [20ms] received from server  response OK, [40ms] sending RPC to server , [40ms] received from server  response OK, [59ms] sending RPC to server , [60ms] received from server  response OK, [80ms] sending RPC to server , [80ms] received from server  response OK, [100ms] sending RPC to server , [100ms] received from server  response OK, [140ms] sending RPC to server , [141ms] received from server  response OK, [200ms] sending RPC to server , [200ms] received from server  response OK, [319ms] sending RPC to server , [320ms] received from server  response OK, [780ms] sending RPC to server , [780ms] received from server  response OK, [2740ms] sending RPC to server , [2741ms] received from server  response OK, [3580ms] sending RPC to server , [3580ms] received from server  response OK, [4840ms] sending RPC to server , [4840ms] received from server  response OK, [7080ms] sending RPC to server , [7081ms] received from server  response OK, [8320ms] sending RPC to server , [8321ms] received from server  response OK, [11620ms] sending RPC to server , [11621ms] received from server  response OK, [13540ms] sending RPC to server , [13540ms] received from server  response OK, [16819ms] sending RPC to server , [16820ms] received from server  response OK, [19020ms] sending RPC to server , [19020ms] received from server  response OK, [21340ms] sending RPC to server , [21341ms] received from server  response OK, [24660ms] sending RPC to server , [24661ms] received from server  response OK, [26800ms] sending RPC to server , [26800ms] received from server  response OK, [27660ms] sending RPC to server , [27660ms] received from server  response OK, [28480ms] sending RPC to server , [28481ms] received from server



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)