You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "zhangsong (JIRA)" <ji...@apache.org> on 2017/01/25 09:38:26 UTC

[jira] [Commented] (KUDU-1847) kudu-tserver should remove itself from raft-peer-config when met tablet corruption

    [ https://issues.apache.org/jira/browse/KUDU-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837447#comment-15837447 ] 

zhangsong commented on KUDU-1847:
---------------------------------

the timeline is :  
0 raft-config is l1-f1-f2
1 one peer shiping  follower (f2) lost  its network connection with other nodes .
2 leader l1 remove f2 from the raft-config and commit it 
3 after a kudu-tserver node serving  l1 and f2 all   crashed 
4 f2 start successfullly after kudu-tserver restart.
5 l1 failed starting due to "Failed to open rowset RowSet(65780): Not found: Can't find block:
 0000000318394411
"
6 finally there is only one follower left for this tablet.

So this should not be bug, instead we need a tool to recover/reset tablet in this  scenario  when only one follower alive.

> kudu-tserver should remove itself from raft-peer-config when met tablet corruption
> ----------------------------------------------------------------------------------
>
>                 Key: KUDU-1847
>                 URL: https://issues.apache.org/jira/browse/KUDU-1847
>             Project: Kudu
>          Issue Type: Bug
>          Components: cfile, consensus
>    Affects Versions: 1.0.0
>            Reporter: zhangsong
>            Priority: Critical
>
> problem found:
> Today  one of my tables became unwritable. From kudu-master , i found there is only one "FOLLOWER" left in raft-config of a tablet. 
> After searching kudu-tserver.LOG i found error logs like this 
> "I0124 03:29:16.000665 17144 raft_consensus.cc:380] T 8870bca7167f46c88099fb3236477530 P 1fa77467172b4ed7ba1a0a10e3dd67f8 [term 173317 FOLLOWER]: Starting election with config: opid_index: 572616 local: false peers { permanent_uuid: "69947ffe22e245afb579287073c58dc2" member_type: VOTER last_known_addr { host: "peer_ip" port: 7050 } } peers { permanent_uuid: "1fa77467172b4ed7ba1a0a10e3dd67f8" member_type: VOTER last_known_addr { host: "localhost" port: 7050 } }
> I0124 03:29:16.001211 17144 leader_election.cc:223] T 8870bca7167f46c88099fb3236477530 P 1fa77467172b4ed7ba1a0a10e3dd67f8 [CANDIDATE]: Term 173317 election: Requesting vote from peer 69947ffe22e245afb579287073c58dc2
> W0124 03:29:16.001549 15548 leader_election.cc:281] T 8870bca7167f46c88099fb3236477530 P 1fa77467172b4ed7ba1a0a10e3dd67f8 [CANDIDATE]: Term 173317 election: Tablet error from VoteRequest() call to peer 69947ffe22e245afb579287073c58dc2: Illegal state: Tablet not RUNNING: FAILED: Not found: Can't find block: 0000000318394411
> I0124 03:29:16.001845 15548 leader_election.cc:248] T 8870bca7167f46c88099fb3236477530 P 1fa77467172b4ed7ba1a0a10e3dd67f8 [CANDIDATE]: Term 173317 election: Election decided. Result: candidate lost.
> "
> This logs indicate that the current follower(f1) of the tablet start leader election( after election timeout ), and found tablet on another follower(f2)  is not running (corruption) . So the election failed. 
> at the end only one follower of the tablet is alive.
> I also found the tablet of f2 has been corrupted for a several days.
> Hence i think this  is a bug that we lack logic to remove a peer from  RaftConfig  when the tablet's data  of the peer is  corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)