You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/02/10 05:47:00 UTC

[jira] [Commented] (KUDU-3349) Kudu java client failed to demote leader and caused a lot of deleting rows timeout

    [ https://issues.apache.org/jira/browse/KUDU-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489970#comment-17489970 ] 

ASF subversion and git services commented on KUDU-3349:
-------------------------------------------------------

Commit 90895ce76590f10730ad7aac3613b69d89ff5422 in kudu's branch refs/heads/master from Hongjiang Zhang
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=90895ce ]

[java] KUDU-3349 Fix the failure to demote a leader

KuduScanToken gets a wrong tserver's uuid whose format is something
like: '<ByteString@6dffd497 size=32 contents="fc07f681d3ea4bab9bc5ec8090ab9437">',
the expected uuid should be "fc07f681d3ea4bab9bc5ec8090ab9437".

This issue caused RemoteTablet to fail to demote a leader, and the java
client always sends write ops to the demoted leader. As a result, there
are a lot of "PendingErrors overflowed. Failed to write at least 1000 rows to Kudu".

After this fix, the write ops, especially the deleting, will be faster.

Change-Id: I2974b6ec2cec2f0120b113d1bcf89fe3793a1ec5
Reviewed-on: http://gerrit.cloudera.org:8080/18166
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin <as...@cloudera.com>


> Kudu java client failed to demote leader and caused a lot of deleting rows timeout
> ----------------------------------------------------------------------------------
>
>                 Key: KUDU-3349
>                 URL: https://issues.apache.org/jira/browse/KUDU-3349
>             Project: Kudu
>          Issue Type: Bug
>            Reporter: Redriver
>            Priority: Major
>
> During deleting rows through Spark, I found a lot of PendingErrors which caused timeout, the deleting takes very long time, and finally failed sometimes.
> {code:java}
> java.lang.RuntimeException: PendingErrors overflowed. Failed to write at least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before timeout: Batch{operations=100, tablet="b037e2e266b44c4c95da4065a8d5b719" [0x00000006, 0x00000007), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=b037e2e266b44c4c95da4065a8d5b719, attempt=23, TimeoutTracker(timeout=30000, elapsed=26852), Traces: [0ms] sending RPC to server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [2ms] delaying RPC due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [2ms] received response from server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [10ms] sending RPC to server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [12ms] delaying RPC due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [12ms] received response from server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [30ms] sending RPC to server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [32ms] delaying RPC due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [32ms] received response from server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [51ms] sending RPC to server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [52ms] delaying RPC due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [52ms] received response from server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [70ms] sending RPC to server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [72ms] delaying RPC due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [72ms] received response from server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [90ms] sending RPC to server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [92ms] delaying RPC due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [92ms] received response from server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [170ms] sending RPC to server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [172ms] delaying RPC due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [172ms] received response from server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [190ms] sending RPC to server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [192ms] delaying RPC due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (error 0), [192ms] received response from server <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role FOLLOWER (erro
> ...
> {code}
> Anyway, I think I found the RCA.
> After enabling DEBUG level logging, I found something interesting. The UUID is in fact the same, but kudu-client thinks they are not. See the below.
> {code:java}
> 22/01/19 23:34:34,621 DEBUG [kudu-nio-3] client.RemoteTablet:170 : <ByteString@6dffd497 size=32 contents="fc07f681d3ea4bab9bc5ec8090ab9437"> wasn't the leader for 44fa35c99e7042329bbfa0268c1cd4de, current leader is <ByteString@5aadafd0 size=32 contents="fc07f681d3ea4bab9bc5ec8090ab9437">
> {code}
> This issue caused the kudu-client to fail to demote the leader.
> {code:java}
>   void demoteLeader(String uuid) {
>     synchronized (tabletServers) {
>       if (leaderUuid == null) {
>         LOG.debug("{} couldn't be demoted as the leader for {}, there is no known leader",
>             uuid, getTabletId());
>         return;
>       }
>       if (leaderUuid.equals(uuid)) {
>         leaderUuid = null;
>         LOG.debug("{} was demoted as the leader for {}", uuid, getTabletId());
>       } else {
>         LOG.debug("{} wasn't the leader for {}, current leader is {}", uuid,
>             getTabletId(), leaderUuid);
>       }
>     }
>   }
> {code}
> I take some time to debug, and finally found the tserver's uuid is generated by the "serverMetadataPB.getUuid().toString()". [https://github.com/apache/kudu/blob/master/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java#L246]
> The correct way to get Uuid() is "serverMetadataPB.getUuid().toStringUtf8()"
> After fixing this bug, the deleting becomes faster than before because the client will not send write to the wrong leader.
> I'll submit a patch for this issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)