You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/04/30 19:31:12 UTC

[jira] [Created] (KUDU-1436) Concurrent remote bootstrap calls from same server can crash or result in corrupt replicas

Todd Lipcon created KUDU-1436:
---------------------------------

             Summary: Concurrent remote bootstrap calls from same server can crash or result in corrupt replicas
                 Key: KUDU-1436
                 URL: https://issues.apache.org/jira/browse/KUDU-1436
             Project: Kudu
          Issue Type: Bug
          Components: consensus, tserver
    Affects Versions: 0.8.0
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon
            Priority: Blocker


In the case that a BeginRemoteBootstrapSession call times out, it's possible that the client will send a second call and both get processed. This triggers the following race, if the second call gets processed first:

- C2: initializing remote bootstrap session
- C1: waiting on lock
- C2: finishes initializing, drops lock, and starts to copy the tablet metadata out of the session object (outside of any lock)
- C1: acquires lock and follows the "Re-initializing" code path. This code path calls Clear() on its snapshots of the tablet metadata.
- C2: may crash or copy an incomplete copy of the metadata (eg with missing fields)
- C2 responds to the client

This can cause a number of issues:
- If C2 ends up getting a partially-initialized metadata protobuf, we can trigger a crash in RPC (we don't handle sending responses that have missing required fields)
- C2 might actually get a fully correct response back to the client. But, in the meantime C1 has managed to un-anchor and re-anchor logs and blocks. This means that C2 will eventually copy log entries which are newer than its metadata snapshot which can trigger an assertion like KUDU-1046

Basically all bets are off when this race is triggered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)