You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Dinesh Bhat (JIRA)" <ji...@apache.org> on 2016/10/14 18:58:20 UTC
[jira] [Commented] (KUDU-1618) Add local_replica tool to delete a replica

    [ https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15576176#comment-15576176 ] 

Dinesh Bhat commented on KUDU-1618:
-----------------------------------

I was trying to repro an issue where I was not able to do a remote tablet copy onto a local_replica if the tablet was DELETE_TOMBSTONED(but has metadata file present). However along with the issue reproduction, I saw one state of the replica which was confusing. Here are the steps I executed:
1. Bring up a cluster with 1 master, 3 tablet servers hosting 3 tablets, each tablet had 3 replicas.
2. There was a standby server which was added later.
3. KILL one tserver, after 5 mins the all replicas on that tserver failover to new standby.
4. Use 'local_replica copy_from_remote' to copy one tablet replica before bringing up, the command fails:
{noformat}
I1013 16:43:41.523896 30948 tablet_copy_service.cc:124] Beginning new tablet copy session on tablet 048c7d202da3469eb1b1973df9510007 from peer bb2517bc5f2b4980bb07c06019b5a8e9 at {real_user=dinesh, eff_user=} at 127.61.33.8:40240: session id = bb2517bc5f2b4980bb07c06019b5a8e9-048c7d202da3469eb1b1973df9510007
I1013 16:43:41.524291 30948 tablet_copy_session.cc:142] T 048c7d202da3469eb1b1973df9510007 P 19acc272821d425582d3dfb9ed2ab7cd: Tablet Copy: opened 0 blocks and 1 log segments
Already present: Tablet already exists: 048c7d202da3469eb1b1973df9510007
{noformat}
5. Remove the metadata file and WAL log for that tablet, and the copy_from_fremote succeeds at this point(expected).
6. Bring up the killed tserver, now all replicas on this are tombstoned except one tablet for which we did a copy_from_remote in step 5. Master who was incessantly trying to TOMBSTONED the evicted replicas on the tserver which was down earlier, throws some interesting log:
{noformat}
[dinesh@ve0518 debug]$ I1013 16:55:54.551717 26141 catalog_manager.cc:2591] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 048c7d202da3469eb1b1973df9510007 on bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new config with opid_index 4)
W1013 16:55:54.552803 26141 catalog_manager.cc:2552] TS bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): delete failed for tablet 048c7d202da3469eb1b1973df9510007 due to a CAS failure. No further retry: Illegal state: Request specified cas_config_opid_index_less_or_equal of -1 but the committed config has opid_index of 5
I1013 16:55:54.884133 26141 catalog_manager.cc:2591] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet e9481b695d34483488af07dfb94a8557 on bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new config with opid_index 3)
I1013 16:55:54.885964 26141 catalog_manager.cc:2567] TS bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): tablet e9481b695d34483488af07dfb94a8557 (table test-table [id=ca8f507e47684ddfa147e2cd232ed773]) successfully deleted
I1013 16:55:54.915202 26141 catalog_manager.cc:2591] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet e3ff6a1529cf46c5b9787fe322a749e6 on bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new config with opid_index 3)
I1013 16:55:54.916774 26141 catalog_manager.cc:2567] TS bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): tablet e3ff6a1529cf46c5b9787fe322a749e6 (table test-table [id=ca8f507e47684ddfa147e2cd232ed773]) successfully deleted
{noformat}
7. It continuously spews log messages like this now:
{noformat}
[dinesh@ve0518 debug]$ W1013 16:55:36.608486  6519 raft_consensus.cc:461] T 048c7d202da3469eb1b1973df9510007 P bb2517bc5f2b4980bb07c06019b5a8e9 [term 5 NON_PARTICIPANT]: Failed to trigger leader election: Illegal state: Not starting election: Node is currently a non-participant in the raft config: opid_index: 5 OBSOLETE_local: false peers { permanent_uuid: "9acfc108d9b446c1be783b6d6e7b49ef" member_type: VOTER last_known_addr { host: "127.95.58.0" port: 33932 } } peers { permanent_uuid: "b11d2af1457b4542808407b4d4d1bd29" member_type: VOTER last_known_addr { host: "127.95.58.2" port: 40670 } } peers { permanent_uuid: "19acc272821d425582d3dfb9ed2ab7cd" member_type: VOTER last_known_addr { host: "127.61.33.8" port: 63532 } }
{noformat}

> Add local_replica tool to delete a replica
> ------------------------------------------
>
>                 Key: KUDU-1618
>                 URL: https://issues.apache.org/jira/browse/KUDU-1618
>             Project: Kudu
>          Issue Type: Improvement
>          Components: ops-tooling
>    Affects Versions: 1.0.0
>            Reporter: Todd Lipcon
>            Assignee: Dinesh Bhat
>
> Occasionally we've hit cases where a tablet is corrupt in such a way that the tserver fails to start or crashes soon after starting. Typically we'd prefer the tablet just get marked FAILED but in the worst case it causes the whole tserver to fail.
> For these cases we should add a 'local_replica' subtool to fully remove a local tablet. Related, it might be useful to have a 'local_replica archive' which would create a tarball from the data in this tablet for later examination by developers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)