You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2017/09/15 00:32:39 UTC
[kudu-CR] KUDU-2135 (part 2): don't use previously failed disks

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7270

to look at the new patch set (#29).

Change subject: KUDU-2135 (part 2): don't use previously failed disks
......................................................................

KUDU-2135 (part 2): don't use previously failed disks

This patch updates CheckIntegrity() to ensure that disks that have
previously failed are not used and that servers will only open data
where it's previously known to be.

Rather than comparing all instances against an agreed-upon set of UUIDs,
the single path set with the largest timestamp is used as the "main" one
to compare against (if only old path sets are available, the first
healthy one is used). If no instances are healthy, CheckIntegrity() will
fail, as all disks are failed.

Once this main instance is determined, disks deemed FAILED by the main
path set are marked failed in memory and not used. Additionally,
instances that are not in the same path indicated by the main path set
are similarly not used.

Unit testing is done for the new iteration of CheckIntegrity().
A test is also updated in data_dirs-test to ensure that previoulsy
failed disks will be marked failed.

Some notes:
- If there are any unhealthy instances when upgrading the path sets (i.e.
  adding disk states, paths, timestamp), a complete mapping of UUIDs to
  paths will not be available, and CheckIntegrity() will fail.
- The main path set's disk states are updated to reflect the failure of
  the instances. Since data on a failed disk is not used, disks
  that are successfully read from but are already marked FAILED by the
  main path set are not marked HEALTHY.
- In the case of a server restart where all healthy disks fail to start
  up and some known failed disks start working again, the server will
  successfully start up with the bad disks

This patch is a part of a series of patches to handle disk failures. To
see how this fits, see 2.6 in:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ifddf0817fe1a82044077f5544c400c88de20769f
---
M src/kudu/fs/block_manager_util-test.cc
M src/kudu/fs/block_manager_util.cc
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/fs.proto
5 files changed, 326 insertions(+), 115 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/7270/29
-- 
To view, visit http://gerrit.cloudera.org:8080/7270
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ifddf0817fe1a82044077f5544c400c88de20769f
Gerrit-PatchSet: 29
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>