You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2017/08/23 20:53:41 UTC

[kudu-CR] open FS layout in presence of disk failure

Andrew Wong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7784

Change subject: open FS layout in presence of disk failure
......................................................................

open FS layout in presence of disk failure

Currently, if a Kudu server starts up with a failed disk, the server
will crash. There are a number of reasons for this, but the pressing
ones are that the path instance files may not be readable, meaning the
directories' UUIDs may not be available.

This patch changes this by introducing an "unhealthy" instance in-memory
for all each instance that failsto load, lock, canonicalize, etc. Such
instances are ignored when it comes to checking the integrity of the FS
layout, and are simply marked failed by the directory manager.

Testing is done in data_dirs-test, log_block_manager-test, and
fs_manager-test to ensure failed directories do not impede the mangers'
startups.

Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager_util-test.cc
M src/kudu/fs/block_manager_util.cc
M src/kudu/fs/block_manager_util.h
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/fs_manager-test.cc
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
13 files changed, 607 insertions(+), 162 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/84/7784/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>

[kudu-CR] open FS layout in presence of disk failure

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 4: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.h
File src/kudu/fs/block_manager_util.h:

PS4, Line 83:   // Whether or not the instance is healthy. If the instance file lives on a
            :   // disk that has failed, this should return false.
            :   bool healthy() const {
            :     return health_status_.ok();
            :   }
> we usually don't have this extra getter (i.e. consider just doing pimf.heal
I had the same initial reaction but it seems to make the code that uses it a little more reader-friendly.


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7784/1/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

PS1, Line 592: uid_by_root_.swap(uuid_by_root);
             : 
             :   // From this point onwards, the above in-memory maps must be consistent with
             :   // the main path set.
             : 
             :   // Initialize the 'fullness' status of the data directories.
             :   for (const auto& dd : data_dirs_) {
             :     uint16_t uuid_idx;
             :     CHECK(FindUuidIndexByDataDir(dd.get(), &uuid_idx));
             :     if (ContainsKey(failed_data_dirs_, uuid_idx)) {
             :       continue;
             :     }
             :     Status refresh_status = dd->RefreshIsFull(DataDir::RefreshMode::ALWAYS);
             :     if (PREDICT_FALSE(!refresh_status.ok())) {
             :       if (refresh_status.IsDiskFailure()) {
             :         MarkDataDirFailed(uuid_idx, refresh_status.ToString());
             :         continue;
             :       }
             :       return refresh_status;
             :     }
             :   }
re: why is this here and not higher up, yes this is here because MarkDataDirFailed needs the above to work.


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 6:

Failure seems to be a bunch of clock sync errors in DEBUG mode.

-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No

[kudu-CR] open FS layout in presence of disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7784

to look at the new patch set (#2).

Change subject: open FS layout in presence of disk failure
......................................................................

open FS layout in presence of disk failure

Currently, if a Kudu server starts up with a failed disk, the server
will crash. There are a number of reasons for this, but the pressing
ones are that the path instance files may not be readable, meaning the
directories' UUIDs may not be available.

This patch changes this by introducing an "unhealthy" instance in-memory
for all each instance that failsto load, lock, canonicalize, etc. Such
instances are ignored when it comes to checking the integrity of the FS
layout, and are simply marked failed by the directory manager.

Testing is done in data_dirs-test, log_block_manager-test, and
fs_manager-test to ensure failed directories do not impede the mangers'
startups.

Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager_util-test.cc
M src/kudu/fs/block_manager_util.cc
M src/kudu/fs/block_manager_util.h
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/fs_manager-test.cc
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
13 files changed, 612 insertions(+), 162 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/84/7784/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] open FS layout in presence of disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 3:

Just rebased.

-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No

[kudu-CR] open FS layout in presence of disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 1:

(1 comment)

Ah, it seems the issue was with some error-handling wiring that I changed. Should be good now.

http://gerrit.cloudera.org:8080/#/c/7784/1/src/kudu/fs/fs_manager.cc
File src/kudu/fs/fs_manager.cc:

Line 35: #include "kudu/fs/error_manager.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has submitted this change and it was merged.

Change subject: open FS layout in presence of disk failure
......................................................................


open FS layout in presence of disk failure

Currently, if a Kudu server starts up with a failed disk, the server
will crash. There are a number of reasons for this, but the pressing
one is that the path instance files may not be readable, meaning the
directories' UUIDs may not be available.

This patch changes this by introducing an "unhealthy" instance in-memory
for each instance that fails to lock, load, canonicalize, etc. Such
instances are ignored when it comes to checking the integrity of the FS
layout, and are simply marked failed by the directory manager.

Testing is done in data_dirs-test, log_block_manager-test, and
fs_manager-test to ensure failed directories do not impede the managers'
startups.

Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Reviewed-on: http://gerrit.cloudera.org:8080/7784
Reviewed-by: Alexey Serbin <as...@cloudera.com>
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <mp...@apache.org>
Reviewed-by: David Ribeiro Alves <da...@gmail.com>
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager_util-test.cc
M src/kudu/fs/block_manager_util.cc
M src/kudu/fs/block_manager_util.h
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/fs_manager-test.cc
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
13 files changed, 645 insertions(+), 173 deletions(-)

Approvals:
  David Ribeiro Alves: Looks good to me, approved
  Mike Percy: Looks good to me, approved
  Alexey Serbin: Looks good to me, but someone else must approve
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot

[kudu-CR] open FS layout in presence of disk failure

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 4:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.cc
File src/kudu/fs/block_manager_util.cc:

PS4, Line 189: them
nit s/them/it


PS4, Line 189: it's
nit its


http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.h
File src/kudu/fs/block_manager_util.h:

PS4, Line 56:   // On success, either 'metadata_' is overwritten with the contents of the
            :   // file, or, in the case of disk failure, returns OK and sets
            :   // 'health_status_' to a non-OK value.
            :   Status LoadFromDisk();
this is weird. whats the reason for this not to return a non-ok status right away?


PS4, Line 83:   // Whether or not the instance is healthy. If the instance file lives on a
            :   // disk that has failed, this should return false.
            :   bool healthy() const {
            :     return health_status_.ok();
            :   }
we usually don't have this extra getter (i.e. consider just doing pimf.healt_status().ok())


http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

PS4, Line 794:   if (FindUuidByRoot(root, &uuid)) {
             :     return FindUuidIndexByUuid(uuid, uuid_idx);
             :   }
             :   return false;
there are no chances of races here, right?


http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

PS4, Line 366: Finds the UUID for the given canonicalized root directory name.
nit, for consistency append the same suffix as the previous new method (returning false...)


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 5:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7784/5/src/kudu/fs/block_manager_util.cc
File src/kudu/fs/block_manager_util.cc:

PS5, Line 68: ;
nit: this is not needed, right?  Usually those macros are used as 

RETURN_NOT_OK_FAIL_INSTANCE_PREPEND(...);

so the semicolon is added at the call site.


http://gerrit.cloudera.org:8080/#/c/7784/5/src/kudu/fs/data_dirs-test.cc
File src/kudu/fs/data_dirs-test.cc:

PS5, Line 244:   Status s = dd_manager_->MarkDataDirFailed(kNumDirs - 1);
             :   ASSERT_STR_CONTAINS(s.ToString(), "All data dirs have failed");
As I see the call to CreateDataDirGroup() is removed.  Is that correct?  I mean, I would expect to see an attempt to do something with the object after marking all directories as failed (that looks like just a preparation for some meaningful action).


PS5, Line 363:   void OpenDataDirManager() {
             :     ASSERT_OK(DataDirManager::OpenExistingForTests(env_, test_roots_,
             :         DataDirManagerOptions(), &dd_manager_));
             :   }
nit: consider making this method returning Status and then using ASSERT_OK() -- that way it would be more easier to find the place where it fails, if it does so.


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 6:

I retriggered the build

-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No

[kudu-CR] open FS layout in presence of disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 4:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/7784/3//COMMIT_MSG
Commit Message:

PS3, Line 20: mangers'
> looks like a typo
It is, indeed, "managers'," referring to the startups of the directory manager, log block manager, and fs manager.


http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.cc
File src/kudu/fs/block_manager_util.cc:

PS4, Line 189: it's
> nit its
Done


PS4, Line 189: them
> nit s/them/it
Done


http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.h
File src/kudu/fs/block_manager_util.h:

PS4, Line 56:   // On success, either 'metadata_' is overwritten with the contents of the
            :   // file, or, in the case of disk failure, returns OK and sets
            :   // 'health_status_' to a non-OK value.
            :   Status LoadFromDisk();
> this is weird. whats the reason for this not to return a non-ok status righ
Disk failure errors here would be ignored anyway. By adding this to the contract, we don't have to sully the call-site with disk-handling code, and we can keep the error handling (i.e. setting health_status_) internal.


PS4, Line 83:   // Whether or not the instance is healthy. If the instance file lives on a
            :   // disk that has failed, this should return false.
            :   bool healthy() const {
            :     return health_status_.ok();
            :   }
> we usually don't have this extra getter (i.e. consider just doing pimf.heal
As per Mike's comment, I think it helps for readability.

Also, I get the feeling that we don't use this elsewhere (at least for tablets/replicas) because other components have multiple states. In this case, it's pretty cut-and-dry: healthy, not healthy.


PS4, Line 83:   // Whether or not the instance is healthy. If the instance file lives on a
            :   // disk that has failed, this should return false.
            :   bool healthy() const {
            :     return health_status_.ok();
            :   }
> I had the same initial reaction but it seems to make the code that uses it 
Done


http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

Line 116: using std::make_pair;
> warning: using decl 'make_pair' is unused [misc-unused-using-decls]
Done


PS4, Line 794:   if (FindUuidByRoot(root, &uuid)) {
             :     return FindUuidIndexByUuid(uuid, uuid_idx);
             :   }
             :   return false;
> there are no chances of races here, right?
Nope, the first call is guarded by a scoped lock, as is the second. These values don't change after they're first set in Open().


http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

PS4, Line 366: Finds the UUID for the given canonicalized root directory name.
> nit, for consistency append the same suffix as the previous new method (ret
Done


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/fs_manager-test.cc
File src/kudu/fs/fs_manager-test.cc:

Line 279:   FLAGS_env_inject_eio = 0;
> nit: shouldn't need this due to flag saver in KuduTest unless I'm missing s
See Adar's comment.


Line 279:   FLAGS_env_inject_eio = 0;
> Unfortunately, KuduTest doesn't restore the flags until _after_ TearDown() 
Thanks for the explanation.


Line 302:   FLAGS_env_inject_eio = 0;
> same
See above.


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/fs_manager.cc
File src/kudu/fs/fs_manager.cc:

PS3, Line 180:   set<string> all_roots;
             :   all_roots.insert(wal_fs_root_);
             :   for (const string& data_fs_root : data_fs_roots_) {
             :     all_roots.insert(data_fs_root);
             :   }
> nit: consider
Done


PS3, Line 228:       if (!ContainsKey(unique_roots, root.path)) {
             :         canonicalized_data_fs_roots_.emplace_back(root);
             :         canonicalized_all_fs_roots_.emplace_back(root);
             :         InsertOrDie(&unique_roots, root.path);
             :       }
> Consider using InsertIfNotPresent() instead to avoid double look-ups:
Done


PS3, Line 248: string wal_root
> Is it necessary to make a copy here?  Why not to use const reference here i
Done


PS3, Line 251: string meta_root
> ditto
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.h
File src/kudu/fs/block_manager_util.h:

PS4, Line 56:   // On success, either 'metadata_' is overwritten with the contents of the
            :   // file, or, in the case of disk failure, returns OK and sets
            :   // 'health_status_' to a non-OK value.
            :   Status LoadFromDisk();
> Andrew and I previously discussed this approach. You can see our comments h
Even if internalizing the failure would be ok (which I still think it isn't, though I don't feel super strong about it), the error message that we're ignoring the failure is still wrong, since we arent'


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7784

to look at the new patch set (#6).

Change subject: open FS layout in presence of disk failure
......................................................................

open FS layout in presence of disk failure

Currently, if a Kudu server starts up with a failed disk, the server
will crash. There are a number of reasons for this, but the pressing
one is that the path instance files may not be readable, meaning the
directories' UUIDs may not be available.

This patch changes this by introducing an "unhealthy" instance in-memory
for each instance that fails to lock, load, canonicalize, etc. Such
instances are ignored when it comes to checking the integrity of the FS
layout, and are simply marked failed by the directory manager.

Testing is done in data_dirs-test, log_block_manager-test, and
fs_manager-test to ensure failed directories do not impede the managers'
startups.

Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager_util-test.cc
M src/kudu/fs/block_manager_util.cc
M src/kudu/fs/block_manager_util.h
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/fs_manager-test.cc
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
13 files changed, 645 insertions(+), 173 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/84/7784/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot

[kudu-CR] open FS layout in presence of disk failure

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.h
File src/kudu/fs/block_manager_util.h:

PS4, Line 56:   // On success, either 'metadata_' is overwritten with the contents of the
            :   // file, or, in the case of disk failure, returns OK and sets
            :   // 'health_status_' to a non-OK value.
            :   Status LoadFromDisk();
> Disk failure errors here would be ignored anyway. By adding this to the con
I'm not sure I fully agree. For one the error message is confusing: "LOG(ERROR) << "Ignoring error with status: " << _s_prepended.ToString(); \" seems like we're ignoring the error when we really aren't. 
Then I'd expect that upon getting an OK here the metadata had been loaded, when it's not the case. Given there is a single call site I think it would make sense to be explicit. That is to return a non-ok status and to special case the isDiskFailure() status along with logging the error message there. Otherwise you're pushing logic from the call site onto this class which is a bit dirtier.


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 1:

looking into the jenkins failure.

-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No

[kudu-CR] open FS layout in presence of disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 3:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/7784/3//COMMIT_MSG
Commit Message:

PS3, Line 15: failsto
> fails to
Done


PS3, Line 15: all each
> just "each"
Done


PS3, Line 15: failsto
> fails to
Done


PS3, Line 15: all
> s/all//
Done


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/block_manager_util.h
File src/kudu/fs/block_manager_util.h:

PS3, Line 57: not OK()
> 1. Why not just return an error?
1. Errors here would be ignored anyway. By adding this to the contract, we don't have to dirty the call-site with disk-handling code and we can keep the error handling (i.e. setting health_status_) internal.

2. Done


PS3, Line 88: Status
> nit: const Status& ?
Done


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/data_dirs-test.cc
File src/kudu/fs/data_dirs-test.cc:

PS3, Line 244:   ASSERT_DEATH({
             :     dd_manager_->MarkDataDirFailed(kNumDirs - 1);
             :   }, ".*All data dirs have failed");
> I still don't like this change; I prefer that "library code" like the Direc
Done


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

Line 385:       return Status::IOError("Could not create directory manager with disks failed");
> Maybe include root.second in the error message?
Done


Line 607:         MarkDataDirFailed(uuid_idx, refresh_status.ToString());
> As future cleanup, it would be nice if MarkDataDirFailed() took a Status ra
Fair point. I'll add a TODO in MarkDataDirFailed.


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

PS3, Line 53: typedef std::pair<std::string, Status> CanonicalizedRoot;
> I think it would be easier to understand the code that uses this if it was 
Done


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

PS3, Line 302: Returned values
> This comment is a bit confusing. I think it would make more sense to say it
Done


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 1592:     if (dd_manager_->IsDataDirFailed(uuid_idx)) {
> If we stored the status passed to us in MarkDataDirFailed, we could return 
Added TODOs here and in MarkDataDirFailed(). I also wouldn't mind putting them in this patch, but for now, TODOs.


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 6: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No

[kudu-CR] open FS layout in presence of disk failure

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 4: Code-Review+1

Don't have anything to add to David/Mike's comments.

-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No

[kudu-CR] open FS layout in presence of disk failure

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 5: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No

[kudu-CR] open FS layout in presence of disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7784

to look at the new patch set (#5).

Change subject: open FS layout in presence of disk failure
......................................................................

open FS layout in presence of disk failure

Currently, if a Kudu server starts up with a failed disk, the server
will crash. There are a number of reasons for this, but the pressing
one is that the path instance files may not be readable, meaning the
directories' UUIDs may not be available.

This patch changes this by introducing an "unhealthy" instance in-memory
for each instance that fails to lock, load, canonicalize, etc. Such
instances are ignored when it comes to checking the integrity of the FS
layout, and are simply marked failed by the directory manager.

Testing is done in data_dirs-test, log_block_manager-test, and
fs_manager-test to ensure failed directories do not impede the managers'
startups.

Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager_util-test.cc
M src/kudu/fs/block_manager_util.cc
M src/kudu/fs/block_manager_util.h
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/fs_manager-test.cc
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
13 files changed, 635 insertions(+), 174 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/84/7784/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot

[kudu-CR] open FS layout in presence of disk failure

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 3:

(5 comments)

A quick first pass

http://gerrit.cloudera.org:8080/#/c/7784/3//COMMIT_MSG
Commit Message:

PS3, Line 20: mangers'
looks like a typo

Is that supposed to be "directory managers'" or "managers'"  ?


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/fs_manager.cc
File src/kudu/fs/fs_manager.cc:

PS3, Line 180:   set<string> all_roots;
             :   all_roots.insert(wal_fs_root_);
             :   for (const string& data_fs_root : data_fs_roots_) {
             :     all_roots.insert(data_fs_root);
             :   }
nit: consider

set<string> all_roots(data_fs_roots_.begin(), data_fs_roots_.end());
all_roots.insert(wal_fs_root_);


PS3, Line 228:       if (!ContainsKey(unique_roots, root.first)) {
             :         canonicalized_data_fs_roots_.emplace_back(root);
             :         canonicalized_all_fs_roots_.emplace_back(root);
             :         InsertOrDie(&unique_roots, root.first);
             :       }
Consider using InsertIfNotPresent() instead to avoid double look-ups:

if (InsertIfNotPresent(&unique_roots, root.first)) {
  canonicalized_data_fs_roots_.emplace_back(root);
  canonicalized_all_fs_roots_.emplace_back(root);
}


PS3, Line 248: string wal_root
Is it necessary to make a copy here?  Why not to use const reference here instead?


PS3, Line 251: string meta_root
ditto


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 6: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No

[kudu-CR] open FS layout in presence of disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7784

to look at the new patch set (#4).

Change subject: open FS layout in presence of disk failure
......................................................................

open FS layout in presence of disk failure

Currently, if a Kudu server starts up with a failed disk, the server
will crash. There are a number of reasons for this, but the pressing
one is that the path instance files may not be readable, meaning the
directories' UUIDs may not be available.

This patch changes this by introducing an "unhealthy" instance in-memory
for each instance that fails to lock, load, canonicalize, etc. Such
instances are ignored when it comes to checking the integrity of the FS
layout, and are simply marked failed by the directory manager.

Testing is done in data_dirs-test, log_block_manager-test, and
fs_manager-test to ensure failed directories do not impede the mangers'
startups.

Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager_util-test.cc
M src/kudu/fs/block_manager_util.cc
M src/kudu/fs/block_manager_util.h
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/fs_manager-test.cc
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
13 files changed, 636 insertions(+), 169 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/84/7784/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot

[kudu-CR] open FS layout in presence of disk failure

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 6: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No

[kudu-CR] open FS layout in presence of disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 6:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7784/5/src/kudu/fs/block_manager_util.cc
File src/kudu/fs/block_manager_util.cc:

PS5, Line 68: 
> nit: this is not needed, right?  Usually those macros are used as 
Done


http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.h
File src/kudu/fs/block_manager_util.h:

PS4, Line 56:   // On success, either 'metadata_' is overwritten with the contents of the
            :   // file, or, in the case of disk failure, returns OK and sets
            :   // 'health_status_' to a non-OK value.
            :   Status LoadFromDisk();
> I'm not sure I fully agree. For one the error message is confusing: "LOG(ER
Hrm, I'd prefer to keep this internal. If anything of the PIMF fails, it gets set to failed. I'll improve the error message though.


PS4, Line 56:   // On success, either 'metadata_' is overwritten with the contents of the
            :   // file, or, in the case of disk failure, returns OK and sets
            :   // 'health_status_' to a non-OK value.
            :   Status LoadFromDisk();
> Andrew and I previously discussed this approach. You can see our comments h
Thanks for resurfacing this.


http://gerrit.cloudera.org:8080/#/c/7784/5/src/kudu/fs/data_dirs-test.cc
File src/kudu/fs/data_dirs-test.cc:

PS5, Line 244:   Status s = dd_manager_->MarkDataDirFailed(kNumDirs - 1);
             :   ASSERT_STR_CONTAINS(s.ToString(), "All data dirs have failed");
> As I see the call to CreateDataDirGroup() is removed.  Is that correct?  I 
Ah, good point, I suppose I can add it in again. Done


PS5, Line 363:     // manager.
             :     KuduTest::SetUp();
             :   }
             : 
> nit: consider making this method returning Status and then using ASSERT_OK(
Done


http://gerrit.cloudera.org:8080/#/c/7784/5/src/kudu/fs/log_block_manager-test.cc
File src/kudu/fs/log_block_manager-test.cc:

PS5, Line 1315:   ASSERT_NE(data_dir, test_dirs[failed_idx]);
> paranoid nit: does it make sense to check that the failed dir is exactly te
Sure, although there is test coverage for this in data_dirs-test


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7784/5/src/kudu/fs/log_block_manager-test.cc
File src/kudu/fs/log_block_manager-test.cc:

PS5, Line 1315: ASSERT_EQ(1, dd_manager_->GetFailedDataDirs().size());
paranoid nit: does it make sense to check that the failed dir is exactly test_dirs[failed_idx] ?


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.h
File src/kudu/fs/block_manager_util.h:

PS4, Line 56:   // On success, either 'metadata_' is overwritten with the contents of the
            :   // file, or, in the case of disk failure, returns OK and sets
            :   // 'health_status_' to a non-OK value.
            :   Status LoadFromDisk();
> I'm not sure I fully agree. For one the error message is confusing: "LOG(ER
Andrew and I previously discussed this approach. You can see our comments here: https://gerrit.cloudera.org/#/c/7270/6/src/kudu/fs/data_dirs.cc@408. FWIW, I find the idea of "internalizing" the disk failure to PIMF to be reasonable.


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 6: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No

[kudu-CR] open FS layout in presence of disk failure

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 3:

(6 comments)

I'm still looking through this patch but so far I only have nits

http://gerrit.cloudera.org:8080/#/c/7784/3//COMMIT_MSG
Commit Message:

PS3, Line 15: all
s/all//


PS3, Line 15: failsto
fails to


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/block_manager_util.h
File src/kudu/fs/block_manager_util.h:

PS3, Line 57: not OK()
1. Why not just return an error?
2. If we really want to return OK, be a little more explicit: "... or, in the case of disk failure, returns OK but will set health_status_ to a non-OK value."


PS3, Line 88: Status
nit: const Status& ?


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

PS3, Line 53: typedef std::pair<std::string, Status> CanonicalizedRoot;
I think it would be easier to understand the code that uses this if it was renamed and made a struct:

  struct CanonicalizedRootAndStatus {
    std::string path;
    Status status;
  };


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

PS3, Line 302: Returned values
This comment is a bit confusing. I think it would make more sense to say it like:

Canonicalized forms of the root directories. Canonicalized during Init() with ordering maintained.


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/fs_manager-test.cc
File src/kudu/fs/fs_manager-test.cc:

Line 279:   FLAGS_env_inject_eio = 0;
> nit: shouldn't need this due to flag saver in KuduTest unless I'm missing s
Unfortunately, KuduTest doesn't restore the flags until _after_ TearDown() runs, and env_inject_eio!=0 can cause test directory cleanup to fail.

That said, having to do this for every test is annoying, so maybe we should manually destroy the FlagSaver at the beginning of TearDown instead?


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 3:

(3 comments)

couple other minor nits, overall lgtm

http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

PS3, Line 831: dirs_to_update
unused?


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/fs_manager-test.cc
File src/kudu/fs/fs_manager-test.cc:

Line 279:   FLAGS_env_inject_eio = 0;
nit: shouldn't need this due to flag saver in KuduTest unless I'm missing something


Line 302:   FLAGS_env_inject_eio = 0;
same


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] open FS layout in presence of disk failure

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: open FS layout in presence of disk failure
......................................................................


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7784/3//COMMIT_MSG
Commit Message:

PS3, Line 15: all each
just "each"


PS3, Line 15: failsto
fails to


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/data_dirs-test.cc
File src/kudu/fs/data_dirs-test.cc:

PS3, Line 244:   ASSERT_DEATH({
             :     dd_manager_->MarkDataDirFailed(kNumDirs - 1);
             :   }, ".*All data dirs have failed");
I still don't like this change; I prefer that "library code" like the DirectoryManager pass back an appropriate Status that would let callers decide whether to crash or not.


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

Line 385:       return Status::IOError("Could not create directory manager with disks failed");
Maybe include root.second in the error message?


Line 607:         MarkDataDirFailed(uuid_idx, refresh_status.ToString());
As future cleanup, it would be nice if MarkDataDirFailed() took a Status rather than a string. The former has more information in it.


http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 1592:     if (dd_manager_->IsDataDirFailed(uuid_idx)) {
If we stored the status passed to us in MarkDataDirFailed, we could return it to callers here so we wouldn't need to create artificial Statuses like this.


-- 
To view, visit http://gerrit.cloudera.org:8080/7784
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes