You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2016/07/06 00:28:11 UTC

[jira] [Created] (KUDU-1512) Remote bootstrap always fails under heavy insert load

Jean-Daniel Cryans created KUDU-1512:
----------------------------------------

             Summary: Remote bootstrap always fails under heavy insert load
                 Key: KUDU-1512
                 URL: https://issues.apache.org/jira/browse/KUDU-1512
             Project: Kudu
          Issue Type: Bug
          Components: consensus
            Reporter: Jean-Daniel Cryans
            Priority: Blocker


I just noticed on a test cluster a case where after remote bootstrapping a tablet we were lacking the proper logs to start replicating to it. Here's a bit of log:

{noformat}
I0701 17:07:23.387614 61379 consensus_peers.cc:296] T 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b -> Peer d80a7427c7d040ac8d949d0cadb3e7c5 (e1316.halxg.cloudera.com:7050): Sending request to remotely bootstrap
...
I0701 17:12:15.867938 65505 log.cc:728] Deleting log segment in path: /data/1/kudu-tserver/wals/807ff8e42640482d8d947b693d56ce03/wal-000000217 (GCed ops < 256735)
... (TS stopped GC logs while remote bootstrap was finishing)
I0701 17:22:48.354138   413 remote_bootstrap_service.cc:242] Request end of remote bootstrap session d80a7427c7d040ac8d949d0cadb3e7c5-807ff8e42640482d8d947b693d56ce03 received from {real_user=kudu, eff_user=} at 10.20.130.116:48132
I0701 17:22:48.494417 65505 log.cc:728] Deleting log segment in path: /data/1/kudu-tserver/wals/807ff8e42640482d8d947b693d56ce03/wal-000000218 (GCed ops < 276284)
I0701 17:23:02.591763  7627 consensus_queue.cc:577] T 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b [LEADER]: Connected to new peer: Peer: d80a7427c7d040ac8d949d0cadb3e7c5, Is new: false, Last received: 21.256735, Next index: 256736, Last known committed idx: 256493, Last exchange result: ERROR, Needs remote bootstrap: false
I0701 17:23:02.608044  7627 consensus_peers.cc:181] T 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b -> Peer d80a7427c7d040ac8d949d0cadb3e7c5 (e1316.halxg.cloudera.com:7050): Could not obtain request from queue for peer: d80a7427c7d040ac8d949d0cadb3e7c5. Status: Not found: Failed to read ops 256736..279156: Segment 218 which contained index 256736 has been GCed
{noformat}

So 9e59a4c24de44e3f9de219df865b4f3b was sending data to d80a7427c7d040ac8d949d0cadb3e7c5 for about 16 minutes while receiving inserts. As soon as the new follower was done bootstrapping, we GC'd the logs we were holding for it. What happened after is that the leader dropped that new node from the config, and started all over again... over and over. 

Eventually the other follower died for a different reason and we never recovered the tablet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)