You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2018/03/10 00:02:56 UTC

[kudu-CR] WIP: KUDU-2293 fix cleanup after failed copies

Hello Mike Percy, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9452

to look at the new patch set (#2).

Change subject: WIP: KUDU-2293 fix cleanup after failed copies
......................................................................

WIP: KUDU-2293 fix cleanup after failed copies

Before, when a tablet server failed to tablet copy, Kudu would perform a
best-effort cleanup of the partially-copied replica and leave the tablet
tombstoned. If this tombstoning were to fail due to disk issues (e.g.
out of space), Kudu would allow this and the tablet would remain in
TABLET_DATA_COPYING, both in-memory and on-disk. This would lead to a
CHECK failure the if another tablet copy were started for the same
replica, as server would attempt to copy over a replica with data
already marked TABLET_DATA_COPYING.

The above behavior arose from trying to balance two invariants:
- keep on-disk state consistent with in-memory state when possible
- when a tablet copy fails, leave it in as dead of a state as we can
  (i.e. TABLET_DATA_TOMBSTONED with no transitions in progress)

This patch updates the Abort() logic to lean towards the latter
invariant: if a tablet copy fails, at least its in-memory state will be
set as TABLET_DATA_TOMBSTONED. This may not be true on-disk, but that's
okay because either 1) the tablet server will eventually overwrite it
via another tablet copy (at which time its data must _not_ be in the
TABLET_DATA_COPYING state), or 2) the server will be restarted and the
tablet will be tombstoned upon seeing a non-TABLET_DATA_READY state
anyway.

The following semantic changes are made to help clarify cleanup for
tablet copies. Before, the semantics for tablet copy states were:
- kInitialized: Start() has not completed successfully
- kStarted: Started() has completed successfully, Finish() has not
  been called
- kFinished: Finish() has been called, but may not have finished

This patch changes them to be:
- kInitialized: Start() may have been called, but has not attempted to
  update anything on-disk
- kStarted: Start() has been called and has updated on-disk state,
  Finish() has not completed successfully
- kFinished: Finish() has been called and completed successfully

An test is added to tablet_copy-itest that tests failures to copy,
ensuring that the tablet is left in such an state that further copies
are possible without crashing. The test uses EIO injection to fail the
copies, but the logic is the same as if full disks were used instead.

Change-Id: I0459d484a3064aa2de392328e2910c9f6741be04
---
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tablet/tablet_metadata.h
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
M src/kudu/tserver/ts_tablet_manager.cc
5 files changed, 139 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/52/9452/2
-- 
To view, visit http://gerrit.cloudera.org:8080/9452
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0459d484a3064aa2de392328e2910c9f6741be04
Gerrit-Change-Number: 9452
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>