You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Grant Henke (Jira)" <ji...@apache.org> on 2020/06/03 02:47:00 UTC
[jira] [Resolved] (KUDU-2410) Add auto-repair function to ksck to
repair "stuck tablet" situations common on older versions
[ https://issues.apache.org/jira/browse/KUDU-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Henke resolved KUDU-2410.
-------------------------------
Fix Version/s: NA
Resolution: Won't Fix
> Add auto-repair function to ksck to repair "stuck tablet" situations common on older versions
> ---------------------------------------------------------------------------------------------
>
> Key: KUDU-2410
> URL: https://issues.apache.org/jira/browse/KUDU-2410
> Project: Kudu
> Issue Type: Improvement
> Components: supportability
> Affects Versions: 1.7.0
> Reporter: William Berkeley
> Assignee: William Berkeley
> Priority: Major
> Fix For: NA
>
>
> There's two common situations where tablets get stuck and can't recover automatically, characterized by the following ksck outputs:
> Tombstone + eviction in-flight:
> {noformat}
> Tablet 796d3d67d6e0429fb5f91c2c7bbd486d of table 'loadgen_auto_802e774c09d74a208330db4c108a7d30' is under-replicated: 1 replica(s) not RUNNING
> 16204380dc404171bebd99af2504cb14 (wdb-k015-2:7050): RUNNING
> 61dec96f5aed4cd2a47814de42d721e6 (wdb-k015-3:7050): RUNNING [LEADER]
> d1689e073948415a901c64a9e9269416 (wdb-k015-1:7050): bad state
> State: NOT_STARTED
> Data state: TABLET_DATA_TOMBSTONED
> Last status: Tablet initializing...
> 2 replicas' active configs differ from the master's.
> All the peers reported by the master and tablet servers are:
> A = 16204380dc404171bebd99af2504cb14
> B = 61dec96f5aed4cd2a47814de42d721e6
> C = d1689e073948415a901c64a9e9269416
> The consensus matrix is:
> Config source | Voters | Current term | Config index | Committed?
> ---------------+------------------------+--------------+--------------+------------
> master | A B* C | | | Yes
> A | A B C | 2 | 305 | Yes
> B | B C | 2 | 307 | No
> C | [config not available] | | |
> {noformat}
> Permanently failed + eviction in-flight:
> {noformat}
> Tablet 796d3d67d6e0429fb5f91c2c7bbd486d of table 'loadgen_auto_802e774c09d74a208330db4c108a7d30' is under-replicated: 1 replica(s) not RUNNING
> 16204380dc404171bebd99af2504cb14 (wdb-k015-2:7050): RUNNING
> 61dec96f5aed4cd2a47814de42d721e6 (wdb-k015-3:7050): RUNNING [LEADER]
> d1689e073948415a901c64a9e9269416 (wdb-k015-1:7050): missing
> 2 replicas' active configs differ from the master's.
> All the peers reported by the master and tablet servers are:
> A = 16204380dc404171bebd99af2504cb14
> B = 61dec96f5aed4cd2a47814de42d721e6
> C = d1689e073948415a901c64a9e9269416
> The consensus matrix is:
> Config source | Voters | Current term | Config index | Committed?
> ---------------+------------------------+--------------+--------------+------------
> master | A B* C | | | Yes
> A | A B C | 2 | 305 | Yes
> B | B C | 2 | 307 | No
> C | [config not available] | | |
> {noformat}
> The former case is resolved by tombstoned voting (KUDU-871), while the latter is made much, much less likely by 3-4-3 replication (KUDU-1097).
> However, tablets still get stuck on older versions, and it shouldn't be too hard to enhance ksck to detect and automatically fix these two situations by tablet copying B -> C and aborting the config change on B, respectively.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)