You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/10/27 20:06:58 UTC

[jira] [Commented] (KUDU-1735) CHECK failure when aborting an ignored config change operation

    [ https://issues.apache.org/jira/browse/KUDU-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613056#comment-15613056 ] 

Todd Lipcon commented on KUDU-1735:
-----------------------------------

Here's the interesting logs:

When the tablet starts up, we can see that a few CONFIG_CHANGEs are pending
{code}
I1026 22:09:10.486959  9692 tablet_bootstrap.cc:1019] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: ReplayState: Previous OpId: 1.2288, Committed OpId: 1.2287, Pending Replicates: 3, Pending Commits: 1
I1026 22:09:10.487023  9692 tablet_bootstrap.cc:1019] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: Dumping REPLICATES: 
I1026 22:09:10.487025  9692 tablet_bootstrap.cc:1019] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84:    type: REPLICATE replicate { id { term: 1 index: 2286 } timestamp: 6052023659982815232 op_type: WRITE_OP write_request { tablet_id: "e58b90e0125e4fbfb967b85066f2bac1" schema { columns { name: "
I1026 22:09:10.492393  9692 tablet_bootstrap.cc:1019] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84:    type: REPLICATE replicate { id { term: 1 index: 2287 } timestamp: 6052023660001042432 op_type: CHANGE_CONFIG_OP change_config_record { tablet_id: "e58b90e0125e4fbfb967b85066f2bac1" old_config 
I1026 22:09:10.492408  9692 tablet_bootstrap.cc:1019] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84:    type: REPLICATE replicate { id { term: 1 index: 2288 } timestamp: 6052023660156993536 op_type: CHANGE_CONFIG_OP change_config_record { tablet_id: "e58b90e0125e4fbfb967b85066f2bac1" old_config 
I1026 22:09:10.492410  9692 tablet_bootstrap.cc:1019] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: Dumping COMMITS: 
I1026 22:09:10.492413  9692 tablet_bootstrap.cc:1019] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84:    type: COMMIT commit { op_type: CHANGE_CONFIG_OP commited_op_id { term: 1 index: 2287 } }
{code}

according to the log, we don't have 2288 committed yet. However, the metadata is actually present on disk so the committed config has index 2288:
{code}
I1026 22:09:10.657603  9692 raft_consensus.cc:286] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84 [term 1 FOLLOWER]: Replica starting. Triggering 3 pending transactions. Active config: opid_index: 2288 OBSOLETE_local: false peers { permanent_uuid: "3c305734ab9d4e0ebfbd0def74841a5d" member_type: V
{code}

so we ignore the pending config change:
{code}
I1026 22:09:10.658336  9692 raft_consensus.cc:630] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84 [term 1 FOLLOWER]: Ignoring setting pending config change with OpId 1.2288 because the committed config has OpId index 2288. The config change we are ignoring is: Old config: { opid_index: 2287 OBSOL
{code}

but then when the DeleteTablet comes, we get the crash:

{code}
I1026 22:09:11.081184 14990 tablet_service.cc:660] Processing DeleteTablet for tablet e58b90e0125e4fbfb967b85066f2bac1 with delete_type TABLET_DATA_TOMBSTONED (TS c9633273962a4521a32d5e177a118a84 not found in new config with opid_index 7024) from {real_user=kudu, eff_user=} at 10.17.224.34:58742
I1026 22:09:11.081382 14990 raft_consensus_state.cc:370] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84Trying to abort 1 pending transactions.
I1026 22:09:11.081481 14990 raft_consensus_state.cc:375] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84Aborting transaction as it isn't in flight: id { term: 1 index: 2288 } timestamp: 6052023660156993536 op_type: CHANGE_CONFIG_OP change_config_record { tablet_id: "e58b90e0125e4fbfb967b85066f2bac
F1026 22:09:11.081563 14990 raft_consensus.cc:2055] Check failed: state_->IsConfigChangePendingUnlocked() T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84 [term 1 FOLLOWER]: Aborting CHANGE_CONFIG_OP but there was no pending config set. Op: id { term: 1 index: 2288 } timestamp: 6052023660156993536 
{code}

> CHECK failure when aborting an ignored config change operation
> --------------------------------------------------------------
>
>                 Key: KUDU-1735
>                 URL: https://issues.apache.org/jira/browse/KUDU-1735
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.0.1
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> The following sequence causes a CHECK failure:
> - a tablet server receives a CONFIG_CHANGE operation
> - the tablet server commits the operation (writing the new consensus config to disk), but crashes before it can write the associated COMMIT message to the log
> - the server is down for long enough that it is removed from the configuration again while it's down
> - when it comes back up, it sees the CONFIG_CHANGE again as a pending replicate. When it's added to PendingRounds, it is ignored as we can see that this configuration is already committed.
> - the tserver gets the request from the master to DeleteTablet because it's no longer part of the configuration
> -- when trying to abort the operation, it fires a CHECK "Aborting CHANGE_CONFIG_OP but there was no pending config set."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)