You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2017/05/31 18:19:15 UTC

[kudu-CR] disk failure: coordinate IO error handling

Andrew Wong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7029

Change subject: disk failure: coordinate IO error handling
......................................................................

disk failure: coordinate IO error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs-test-util.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/master/sys_catalog.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
13 files changed, 233 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>

[kudu-CR] disk failure: coordinate fs error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate fs error handling
......................................................................


Patch Set 3:

Am currently making changes to handle more errors than just EIOs.

-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#20).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
ErrorManager is added to coordinate error handling.

Callbacks are registered with the ErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the ErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

This is a part of a series of patches to handle disk failure. See
section 2.2 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
15 files changed, 188 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/20
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 20
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#19).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
ErrorManager is added to coordinate error handling.

Callbacks are registered with the ErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the ErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
12 files changed, 163 insertions(+), 21 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/19
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 19
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#21).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

IO errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

This is a part of a series of patches to handle disk failure. See
section 2.2 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
16 files changed, 202 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/21
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 21
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 20:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/tablet_server.cc
File src/kudu/tserver/tablet_server.cc:

Line 141:     fs_manager_->UnsetErrorNotificationCb();
Nit: move this down to just before tablet_manager_->Shutdown() so it's clear that the two are related.


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 20
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 24:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

Line 272:   // Exposes the FsErrorManager used to fs errors.
missing a word?


http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

PS24, Line 125:   // The input to the callback is expected to be the UUID of a failed DataDir.
I think it's clearer to have this information with the typedef of ErrorNotificationCb instead. Here, it sounds like it's describing some input to this call, not the parameter passed when the callback is invoked.

Maybe: "If a disk error is detected, this callback will be invoked with the relevant DataDir UUID as its parameter." or something?


Line 130:   // This must be called before the callback's callee is destroyed.
nit: can you add whether this is idempotent? safe to call if nothing is set yet?


http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:

PS24, Line 169: in-memory
what's this mean? I don't think this line adds much value since the fact that it's a constructor already sort of tells us that it creates an instance


Line 195:   FsErrorManager* error_manager() override { return error_manager_; }
is this used? tried grepping for it and didn't see any call sites, but maybe it's coming in the next patch


http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 189:   void FailTabletsInDataDir(const string& uuid);
nit: std::string in header


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 24
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 8:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

PS8, Line 60:     case EIO: // Fallthrough intended
            :     case ENODEV: // Fallthrough intended
            :     case ENOSPC: // Fallthrough intended
            :     case ENXIO: // Fallthrough intended
            :     case EROFS: // Fallthrough intended
hm, I dont think "fallthrough intended" is really necessary here. Usually such a comment would be used when there is actual code inside of a 'case' label and then falls through to the below one, like:

switch (foo) {
  case A:
    DoSomething();
    FALLTHROUGH_INTENDED;
  case B:
    DoSomethingElse();
    break;
}

(see gutil/macros.h for the above macro and example usage -- it actually has some effect on clang warnings as well)


PS8, Line 74:  that blocks known to the
            : // BlockManager can call.
not sure what this means


PS8, Line 80: Callback<void(const std::set<std::string>&)>* cb
maybe use a typedef for this?


Line 81:     shutdown_replicas_cb_ = cb;
std::move(cb)?


Line 125:   Callback<void(const std::set<std::string>&)>* shutdown_replicas_cb_;
nit: I think it's better to keep the specifics of what the callback _does_ out of this class. i.e from this class's perspectives, it's not necessary that it is going to shutdown replicas. Instead maybe just something like notify_cb_?


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

PS8, Line 121: *
why is this taking a pointer to a callback instead of just taking a callback?


Line 214:   void SetTabletFailedCallback(Callback<void(std::set<std::string>)>* cb);
same


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 684:   if (delete_tablet_data) {
not quite following the purpose of this parameter. In the case that delete_tablet_data is false, then how much of the rest of this function is still relevant? eg opt_last_logged_opid won't be used, the above CAS-related stuff isn't used, right? Maybe refactoring the function so the common bits can be reused, but introducing a new function like MarkFailedAndShutdown(tablet_id) would be clearer?


Line 804:       // Disk failures should have been handled.
is there a DCHECK we could add here that the replica is already marked failed if not?


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 131:                       bool delete_tablet_data = true);
can you update the doc as to why you would want to DeleteTablet without deleting the data?


PS8, Line 190:   // Asynchronously shut down a tablet replica by deleting the tablet.
             :   // TODO(awong): rather than deleting the tablet, don't do anything to the
             :   // tablet metadata and just store disk failure state instead.
             :   void FailTabletReplicas(const std::set<std::string>& tablet_ids);
the docs don't seem to match the name here. Is this shutting it down or trying to delete on-disk data, etc?


Line 294:   Callback<void(const std::set<std::string>&)> shutdown_replicas_cb_;
I think if you made the other classes take the callback by value instead of by pointer, you wouldn't need this member


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 26:

Failure seems unrelated.

-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 26
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#9).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Usage and testing will
come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
12 files changed, 269 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/9
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#7).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Usage and testing will
come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 238 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#5).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Usage and testing will
come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 238 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 20:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

Line 297:   uint16_t* GetUuidIdxForUuid(const std::string& uuid) {
I looked through here and I believe this is safe.

That said, this API is a bit scary at first glance. What lifetime guarantees do we make about the pointee here? (To me it looks like the ptr is good as long as the DataDirManager has been opened and was not yet destroyed.) We should at least document them.

Still, I think it would be a little more intuitive if this returned a bool or a Status and took a uint16_t* out-parameter.


http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

Line 34: typedef Callback<void(const std::string&)> ErrorNotificationCb;
doc callback argument


Line 57:   void RunErrorNotificationCb(const std::string& uuid) const {
nit: arguments need docs


Line 58:     notify_cb_.Run(uuid);
It makes me a little nervous not to submit this to a thread pool. We'd have to be careful to release any locks held which may not be easy to do at a low level if we are calling back into a higher level class. Although I guess if we don't need it then "YAGNI"


Line 61:   void RunErrorNotificationCb(const DataDir* dir) const {
Can we remove this one?


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 20
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 16:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/tablet_server.cc
File src/kudu/tserver/tablet_server.cc:

Line 109:   fs_manager_->SetErrorNotificationCb(Bind(&TSTabletManager::FailTabletsInDataDir,
Why not do this in TSTabletManager::Init (and unset it in TSTabletManager::Shutdown)? Then FailTabletsInDataDir() could be private.


http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 1067:   uint16_t* uuid_idx = fs_manager_->dd_manager()->GetUuidIdxForUuid(uuid);
Maybe pull out fs_manager_->dd_manager() into a local variable so you don't have to repeat it?


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 16
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#6).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Usage and testing will
come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 239 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#8).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Usage and testing will
come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 236 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/8
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#23).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

IO errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

This is a part of a series of patches to handle disk failure. See
section 2.2 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
16 files changed, 194 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/23
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 23
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#11).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/master/sys_catalog.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
14 files changed, 260 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/11
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 26: Code-Review+2 Verified+1

Overriding Jenkins, I agree with Andrew that the failure looks unrelated to this change.

-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 26
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#4).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Usage and testing will
come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 233 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has submitted this change and it was merged.

Change subject: disk failure: coordinate error handling
......................................................................


disk failure: coordinate error handling

IO errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager, is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

This is a part of a series of patches to handle disk failure. See
section 2.2 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Reviewed-on: http://gerrit.cloudera.org:8080/7029
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Adar Dembo <ad...@cloudera.com>
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
15 files changed, 184 insertions(+), 28 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved; Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 27
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 22:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7029/22/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 118: using std::set;
> warning: using decl 'set' is unused [misc-unused-using-decls]
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 22
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7029/14/src/kudu/fs/block_manager-stress-test.cc
File src/kudu/fs/block_manager-stress-test.cc:

Line 29: #include "kudu/fs/log_block_manager-test-util.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


http://gerrit.cloudera.org:8080/#/c/7029/14/src/kudu/fs/log_block_manager-test.cc
File src/kudu/fs/log_block_manager-test.cc:

Line 28: #include "kudu/fs/log_block_manager-test-util.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 25: Code-Review+1

lgtm. I know Adar was looking at this at one point, so not sure if he has any further comments.

-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 25
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#26).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

IO errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager, is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

This is a part of a series of patches to handle disk failure. See
section 2.2 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
15 files changed, 184 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/26
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 26
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#17).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
ErrorManager is added to coordinate error handling.

Callbacks are registered with the ErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the ErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tablet/mvcc.cc
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
16 files changed, 184 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/17
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 17
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#22).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

IO errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

This is a part of a series of patches to handle disk failure. See
section 2.2 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
16 files changed, 196 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/22
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 22
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#15).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tablet/mvcc.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
14 files changed, 235 insertions(+), 36 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/15
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 15
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#16).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
ErrorManager is added to coordinate error handling.

Callbacks are registered with the ErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the ErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tablet/mvcc.cc
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
16 files changed, 184 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/16
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 16
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate fs error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate fs error handling
......................................................................


Patch Set 3:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

PS3, Line 26: #include <set>
> include ordering
Done


http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

PS3, Line 27: EioHandleableImpl
> weird to be calling an interface "impl". also not sure about the interface 
Right, should be an Interface. The "contract" here is in order to make use of KUDU_RETURN_OR_HANDLE_EIO, the caller _must_ implement EioHandleableIf.


http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

PS3, Line 685: LOG(WARNING) << Substitute("Block $0 is on a disk that has failed: $1",
             :                                block_id.ToString(),
             :                                dd_manager_.FindDataDirByUuidIndex(uuid_idx)->dir());
> isn't this gonna spam the log if we delete a whole tablet?
Right.. would VLOG(1) or something bet better here? I think it'd be weird to do nothing, but not sure how to accomplish that without being spammy in some cases.


http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/file_block_manager.h
File src/kudu/fs/file_block_manager.h:

PS3, Line 77: FsErrorManager* error_manager = nullptr
> why allow this to be null? is that because of tests?
Yeah, i've removed this so it must not be null.


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 15:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

Line 199:   // Manager for coordinating error-handling.
Since this is a raw pointer, you should say something about lifespans here. I bet the FsErrorManager has to outlive the BlockManager, right?

Actually, since this isn't optional, I'd add it as a standalone argument to the FBM/LBM constructors instead.


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

Line 32: typedef Callback<void(const std::set<std::string>&)> ErrorNotificationCallback;
Hmm, why must this be defined here and in fs_manager.h? Surely we could define it in the "dependent" header and have the "depending" header include it?


Line 36: // Evaluates the expression and handles it if it results in an error.
These macros probably don't belong in this patch; I don't see them used at all.


PS15, Line 62: blocks
What does "blocks" mean here? If this is ReadableBlock/WritableBlock, there's not enough context to understand why they'd need to call the callback.


Line 66:   FsErrorManager() : dd_manager_(nullptr), notify_cb_(Bind(&DoNothingErrorNotification)) {}
Nit: separate on each line.


Line 82:   void FailTabletsInDataDir(DataDir* dir) {
This is a pretty meaty method; why not move it to a .cc file?


Line 100:     LOG(ERROR) << strings::Substitute("Dir $0 not tracked by DataDirManager", dir->dir());
Should this be a CHECK/DCHECK? Under what circumstances would you expect this to fire in production?


PS15, Line 106:   // Callback that fails the TSTabletManager's tablet peers.
              :   // The referenced TSTabletManager may be deleted before the FsErrorManager,
              :   // so this may be null.
This is too much cross-layer mingling. Reword it so it's scoped to only how FsErrorManager interacts with it, and how registrants should behave when it's called.


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

Line 388:   FileReadableBlock(FileBlockManager* block_manager, BlockId block_id,
Why did you have to change this? And below?


Line 544:   error_manager_->SetDataDirManager(&dd_manager_);
I don't particularly care for this. Why not hoist the DataDirManager out of the block managers and let it be owned by the FsManager? At that point the FsManager creates the BlockManager, DataDirManager, and FsErrorManager, and it can order construction so that everyone gets the right pointer. Probably DataDirManager first, then FsErrorManager, then BlockManager?


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/file_block_manager.h
File src/kudu/fs/file_block_manager.h:

Line 106
FWIW, this is a style that I employ, reminiscent of Javadoc, where the logical sections of the function's description are separated by empty lines to emphasize their differences.


Line 98:   FsErrorManager* error_manager() { return error_manager_; }
Declare this virtual in BlockManager and override it here. Same for log_block_manager.h.


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

Line 49: typedef Callback<void(const std::set<std::string>&)> ErrorNotificationCallback;
> did you consider a design where the callback just notifies the failure by U
I've only finished reviewing this patch (and not the rest) so I don't even know what set<string> means (could use a comment!). But I agree with Todd that fewer cross-references between these classes would probably lead to a clearer implementation.


Line 124:   // Registers a callback with the FsErrorManager.
Should also explain what this means. When will this callback be invoked? When invoked, what's the significance of its argument? Is it important to unregister it?


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:

Line 365:   FsErrorManager* error_manager_;
Add a little comment here.


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 611: Status TSTabletManager::TransitionTabletState(
Did you intend for this decomposition to be in this patch? Since there's still just one call into TransitionTabletState(), it's not clear why bother to decompose here. Todd was sort of getting at the same thing when he asked why this is public; presumably its for a different patch?


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 15
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 26:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7029/25/src/kudu/fs/block_manager-stress-test.cc
File src/kudu/fs/block_manager-stress-test.cc:

Line 241:   unique_ptr<FsErrorManager> test_error_manager_;
> Nit: can drop std:: prefix; we have "using std::unique_ptr" at the top of t
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 26
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 16:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 1067:   uint16_t* uuid_idx = fs_manager_->dd_manager()->GetUuidIdxForUuid(uuid);
> Maybe pull out fs_manager_->dd_manager() into a local variable so you don't
Did you miss this?


http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS20, Line 1074: (const st
Didn't think this case was possible.


http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 188: 
I still think it'd be OK to make this private and wire/unwire in Init/Shutdown (as opposed to the ctor/dtor). Did Todd object to that specific kind of wiring? I thought the 'weirdness' he felt had to do with side effects in the ctor.


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 16
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 16:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/tablet_server.cc
File src/kudu/tserver/tablet_server.cc:

Line 109:   fs_manager_->SetErrorNotificationCb(Bind(&TSTabletManager::FailTabletsInDataDir,
> The idea here is to keep all the "wiring" at as high a level as possible. T
I can't speak for Todd, but I think his beef was primarily with the ctor/dtor aspect of the wiring. Init/Shutdown within TSTabletManager seems like it respects that, and it has the added bonus that the dependency affects two modules (FsManager and TSTabletManager) rather than three (FsManager, TSTabletManager, and TabletServer).

Your second point is valid, though if that happens we can always refactor.


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 16
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#10).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Usage and testing will
come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/master/sys_catalog.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
13 files changed, 267 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/10
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 20:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 1072: void TSTabletManager::FailTabletsInDataDir(const string& uuid) {
Why don't we just have this be a thunk that says "something failed" and then have the TSTabletManager do something like this:

  for (const auto& tablet_id : dd_mgr->GetFailedTablets()) {
    Tablet* t = Lookup(tablet_id);
    t->Shutdown();
  }

Maybe I'm missing something for this part, but we need a lock somewhere to track which data dirs are "broken" so that we can ensure that asynchronous things like tablet copy will also abort if it's in the process of copying a tablet that has blocks now on a failed disk.


Line 1081:     LOG(INFO) << "Shutting down tablet (not implemented in this patch): " << tablet_id;
If this doesn't do anything, why not just have an empty body of the function?


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 20
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#14).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tablet/mvcc.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
14 files changed, 233 insertions(+), 34 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/14
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 14
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate fs error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#2).

Change subject: disk failure: coordinate fs error handling
......................................................................

disk failure: coordinate fs error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs-test-util.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/master/sys_catalog.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
13 files changed, 240 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 15:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

Line 49: typedef Callback<void(const std::set<std::string>&)> ErrorNotificationCallback;
did you consider a design where the callback just notifies the failure by UUID, and then the implementer of the callback has to call fs_manager->GetTabletsOnDiskId(uuid) to get the tablet list?

It seems slightly more straight-forward to me and also looser coupling, since it moves more of the behavior of what to do on disk failures out of the fs/ layer and into the tserver/ or tablet/ layers.

I think this could also remove the need for ErrorManager to point back into DataDirManager, eliminating a possible cycle. I haven't drawn out a collaboration diagram, but seems like:

FsManager owns a DataDirManager
FsManager owns an ErrorManager
ErrorManager references a DataDirManager

so we have a sort of triangle here which is a little smelly.

Curious for Adar's opinion on this too


PS15, Line 125: SetNotificationCallback
think this would be better SetErrorNotificationCallback to match the type


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS15, Line 164:   fs_manager_->SetNotificationCallback(Bind(&TSTabletManager::ShutdownTabletReplicasAsync,
              :                                             Unretained(this)));
it seems a bit messy to have this being "wired" from the constructor. usually it's less error-prone to have the components wired by whoever creates them, rather than "wiring themselves" during construction.

Put another way, I wouldn't expect that creating a TSTabletManager on top of a given FSManager has side effects on the FSManager


PS15, Line 175:   // Since the FsManager is an external entity, revoke its access.
              :   fs_manager_->SetNotificationCallback(Callback<void(const set<string>&)>());
same, seems a little not-quite-right here


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

PS15, Line 132:   // Transitions the tablet state specified by 'tablet_id' with the specified
              :   // 'reason'.
              :   Status TransitionTabletState(const string& tablet_id,
              :                                const string& reason,
              :                                scoped_refptr<tablet::TabletReplica>* replica,
              :                                scoped_refptr<TransitionInProgressDeleter>* deleter,
              :                                boost::optional<TabletServerErrorPB::Code>* error_code);
should this be private?


PS15, Line 197:  a tablet replica
the prototype indicates it operates on a bunch of replicas at once


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 15
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 16:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/tablet_server.cc
File src/kudu/tserver/tablet_server.cc:

Line 109:   fs_manager_->SetErrorNotificationCb(Bind(&TSTabletManager::FailTabletsInDataDir,
> Why not do this in TSTabletManager::Init (and unset it in TSTabletManager::
The idea here is to keep all the "wiring" at as high a level as possible. Todd pointed out that it seems a bit strange for the initialization of a TSTabletManager to have an effect on the FsManager (these were originally in the c'tor and d'tor of TSTabletManager).

Another thought might be that if the error manager is ever used for other types of errors, setting these callbacks in one place may be favorable vs spreading them across multiple Init()s.


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 16
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#12).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tablet/mvcc.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
14 files changed, 231 insertions(+), 34 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/12
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 3:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

PS8, Line 60:     LOG(WARNING) << strings::Substitute("Dir $0 failed, marking for failure", dir->dir());
            :     uint16_t uuid_idx;
            :     if (dd_manager_->FindUuidIndexByDataDir(dir, &uuid_idx)) {
            :       // If the directory is registered as failed, ignore it.
            :       if (dd_manager_->IsDataDirFailed(
> hm, I dont think "fallthrough intended" is really necessary here. Usually s
Done


PS8, Line 74: 
            :       return;
> not sure what this means
Done


PS8, Line 80: 
> maybe use a typedef for this?
Done


Line 81:   // Must outlive the eio error manager.
> std::move(cb)?
Done


Line 125
> nit: I think it's better to keep the specifics of what the callback _does_ 
Done. I agree the error_manager is meant to handle more than just disk failures.

I was thinking of having something like (failure type => get_observers), (failure type => notify_cb), (failure type => do other stuff) might be warranted later on. Or just leave it completely open-ended (failure type => handle error). The question is: design that now or when we have more handleable errors?


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

PS8, Line 121: *
> why is this taking a pointer to a callback instead of just taking a callbac
Done


Line 214:   void SetTabletFailedCallback(Callback<void(std::set<std::string>)>* cb);
> same
Done


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 684:     LOG(WARNING) << s.ToString();
> not quite following the purpose of this parameter. In the case that delete_
Done. Split out the above code to transition tablet state. Only that and replica->Shutdown() are needed from what I can tell.


Line 804:   MonoTime start(MonoTime::Now());
> is there a DCHECK we could add here that the replica is already marked fail
Moving this logic out of this patch, but done.


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 131: 
> can you update the doc as to why you would want to DeleteTablet without del
Took your other suggestion of bring out the needed logic out into its own function.


PS8, Line 190:   // TODO(awong): rather than deleting the tablet, don't do anything to the
             :   // tablet metadata and just store disk failure state instead.
             :   void FailTabletReplicas(const std::set<std::string>& tablet_ids);
             : 
> the docs don't seem to match the name here. Is this shutting it down or try
Done


Line 294: 
> I think if you made the other classes take the callback by value instead of
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate fs error handling

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: disk failure: coordinate fs error handling
......................................................................


Patch Set 3:

(4 comments)

think that this patch (no tests) and https://gerrit.cloudera.org/#/c/7030/3 should be merged and have simple unit tests (you can add your large itest separately)

I think the interface to handle EIOs is weird. If I understood correctly the purpose is to make sure that the method exists for the EIO handling macro, but the macro doesn't need it to be an interface, i.e. you can keep the method and not have an interface and the macro will work all the same.

so basically you have three levels of entities, the "block" files (which can get EIOs from Env) the block managers (and additionally the block containers for the log block manager) and the fs manager.

Here's a rough sketch of an alternative idea:
Have the multiple file*block impls have two methods (that don't need to be in an interface since they are not used externally)

DataDir* get_data_dir();
void HandleEIO(DataDir* data_dir);

The macro will call the second method with the result of the the first as argument.

Then, instead of all being aware the error_managed, each passes the error to the layer above, i.e. the file*block* passes the error to the block_manager which passes the error to the fs_manager, which finally passes the error to the error manager.

I might be missing some details that make this a non-starter, please let me know.

It would also be good to pass most (if not all stuff) at ctor time and not have branching testing whether certain things are set.

http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

PS3, Line 26: #include <set>
include ordering


http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

PS3, Line 27: EioHandleableImpl
weird to be calling an interface "impl". also not sure about the interface itself, who uses this "contract" (and what is the contract, contracts are supposed to be externally facing things whereas this is used internally to where the EIOs happen)?


http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

PS3, Line 685: LOG(WARNING) << Substitute("Block $0 is on a disk that has failed: $1",
             :                                block_id.ToString(),
             :                                dd_manager_.FindDataDirByUuidIndex(uuid_idx)->dir());
isn't this gonna spam the log if we delete a whole tablet?


http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/file_block_manager.h
File src/kudu/fs/file_block_manager.h:

PS3, Line 77: FsErrorManager* error_manager = nullptr
why allow this to be null? is that because of tests?


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 25:

(1 comment)

Just one nit.

http://gerrit.cloudera.org:8080/#/c/7029/25/src/kudu/fs/block_manager-stress-test.cc
File src/kudu/fs/block_manager-stress-test.cc:

Line 241:   std::unique_ptr<FsErrorManager> test_error_manager_;
Nit: can drop std:: prefix; we have "using std::unique_ptr" at the top of the file.


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 25
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 25:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

Line 272: 
> missing a word?
You're right that this belongs in the next patch.


http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

PS24, Line 125:   // relevant DataDir's UUID as its input parameter.
> I think it's clearer to have this information with the typedef of ErrorNoti
Done


Line 130:   // This must be called before the callback's callee is destroyed. Calls to
> nit: can you add whether this is idempotent? safe to call if nothing is set
Done


http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:

PS24, Line 169: d 'error_
> what's this mean? I don't think this line adds much value since the fact th
Fair point, had it for symmetry with file_block_manager. I'll just remove these lines and leave the below as notes.


Line 195:   FRIEND_TEST(LogBlockManagerTest, TestMetadataTruncation);
> is this used? tried grepping for it and didn't see any call sites, but mayb
Yeah, used in a future patch. Moved.


http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 189:   void FailTabletsInDataDir(const std::string& uuid);
> nit: std::string in header
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 25
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#25).

Change subject: disk failure: coordinate error handling
......................................................................

disk failure: coordinate error handling

IO errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

This is a part of a series of patches to handle disk failure. See
section 2.2 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
15 files changed, 184 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/25
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 25
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 20:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

Line 297:   uint16_t* GetUuidIdxForUuid(const std::string& uuid) {
> I looked through here and I believe this is safe.
Hmm, for consistency with FindUuidIndexByDataDIr, sure.


http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

Line 34: typedef Callback<void(const std::string&)> ErrorNotificationCb;
> doc callback argument
Done


Line 57:   void RunErrorNotificationCb(const std::string& uuid) const {
> nit: arguments need docs
Done


Line 58:     notify_cb_.Run(uuid);
> It makes me a little nervous not to submit this to a thread pool. We'd have
That's exactly the reason I even considered a thread pool. Since error handling at a low level may have locks held, we need to asynchronously run this at a higher level when those locks aren't held.

e.g. if we hit a block error while we FlushMRS, we can't shut down the tablet because we'll be caught waiting for the rowsets_flush_sem lock.


Line 61:   void RunErrorNotificationCb(const DataDir* dir) const {
> Can we remove this one?
Ah, yeah that is meant to be in a different patch. Thanks for pointing that out.


http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/tablet_server.cc
File src/kudu/tserver/tablet_server.cc:

Line 141:     fs_manager_->UnsetErrorNotificationCb();
> Nit: move this down to just before tablet_manager_->Shutdown() so it's clea
Done


http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 1067:   }
> Maybe pull out fs_manager_->dd_manager() into a local variable so you don't
Done


Line 1067:   }
> Did you miss this?
Arg I did, thanks for pointing that out.


http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 117: using rpc::ResultTracker;
> warning: using decl 'ResultTracker' is unused [misc-unused-using-decls]
Done


Line 121: using std::unique_ptr;
> warning: using decl 'unique_ptr' is unused [misc-unused-using-decls]
Done


Line 1072: void TSTabletManager::FailTabletsInDataDir(const string& uuid) {
> Why don't we just have this be a thunk that says "something failed" and the
I think the point of the thunk would be to let handling happen without knowing about which directory has failed, right? While shutting down tablets should be idempotent anyway, we should be able to limit the number of calls to Shutdown().

Also the FailTabletsInDataDir() is the one who's marking the dir as failed.

Your point about tablet copies is definitely valid. I think right now if we were tablet copying while something else triggers a disk failure, the tablet copy won't be able to tell that there's a disk failure. That said, coverage for that should be in a different patch.


PS20, Line 1074: !uuid_idx
> Didn't think this case was possible.
Hrm, considering how it should be used, it's not. I'll make that a bit more strict.


Line 1081:     LOG(INFO) << "Shutting down tablet (not implemented in this patch): " << tablet_id;
> If this doesn't do anything, why not just have an empty body of the functio
The way I've split these patches up, I'm sort of building this function up with each patch (and only using this function in the final patch, once everything else is complete).

The hope was that it would help you see where the changes are being used without having to review everything, since they'll only be functional with all the patches.


http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 188:   // Forces shutdown of the tablet replicas in the data dir corresponding to 'uuid'.
> I still think it'd be OK to make this private and wire/unwire in Init/Shutd
As per our hipchat discussion, keeping it here


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 20
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 4:

For now I'm keeping usage and testing separate since placement and lifespan are discussion points that can be localized to a single patch.

-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] disk failure: coordinate IO error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate IO error handling
......................................................................


Patch Set 1:

Definitely open to feedback on the observer pattern-ish design

-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] disk failure: coordinate error handling

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 16:

(22 comments)

http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

Line 199: 
> Since this is a raw pointer, you should say something about lifespans here.
Done


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

Line 32: typedef Callback<void(const std::string&)> ErrorNotificationCb;
> Hmm, why must this be defined here and in fs_manager.h? Surely we could def
Done, keeping it here and putting error_manager.h in fs_manager.h.


Line 36: // multiple layers, many of which we prefer to keep separate. The FsErrorManager
> These macros probably don't belong in this patch; I don't see them used at 
Done


PS15, Line 62: 
> What does "blocks" mean here? If this is ReadableBlock/WritableBlock, there
Done


Line 66: }  // namespace kudu
> Nit: separate on each line.
Done


Line 82
> This is a pretty meaty method; why not move it to a .cc file?
Moved all the logic out of the error manager.


Line 100
> Should this be a CHECK/DCHECK? Under what circumstances would you expect th
You're right that this should never happen. Done


PS15, Line 106: 
              : 
              : 
> This is too much cross-layer mingling. Reword it so it's scoped to only how
Done


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

Line 388:   FileReadableBlock(const FileBlockManager* block_manager, BlockId block_id,
> Why did you have to change this? And below?
For a future patch, it was needed to set the DataDirManager in the FsErrorManager. This isn't a requirement anymore so I'll revert it.


Line 544:                                            "file_block_manager",
> I don't particularly care for this. Why not hoist the DataDirManager out of
I've restructured this a bit so the error manager doesn't need to know about any external state other than the callbacks.

I do intend on moving the DataDirManager out of the BlockManager (and also likely renaming it to DirectoryManager or something), but in another patch.


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/file_block_manager.h
File src/kudu/fs/file_block_manager.h:

Line 106
> FWIW, this is a style that I employ, reminiscent of Javadoc, where the logi
I see, I've seen this style around, just not with such short sections. Reverting


Line 98: 
> Declare this virtual in BlockManager and override it here. Same for log_blo
Done


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

Line 49: 
> I've only finished reviewing this patch (and not the rest) so I don't even 
The key, I think, was pushing all of the logic into the clearly-defined layers (e.g. fail the tablets in the TSTabletManager instead of here).

Doing so also opens it up for a bit more flexibility (using a string instead of set<string> as an input to this cb).


Line 49: 
> did you consider a design where the callback just notifies the failure by U
Done


Line 124:   // This callback will be run when block operations run into disk failures.
> Should also explain what this means. When will this callback be invoked? Wh
Done


PS15, Line 125: e input to the callback
> think this would be better SetErrorNotificationCallback to match the type
Done


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:

Line 365: 
> Add a little comment here.
Done


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS15, Line 164:   // Search for tablets in the metadata dir.
              :   vector<string> tablet_ids;
> it seems a bit messy to have this being "wired" from the constructor. usual
Done, moved to the wiring to TabletServer Start()/Shutdown()


PS15, Line 175:   int loaded_count = 0;
              :   for (const string& tablet_id : tablet_ids) {
> same, seems a little not-quite-right here
Done


Line 611:   TabletDataState data_state = replica->tablet_metadata()->tablet_data_state();
> Did you intend for this decomposition to be in this patch? Since there's st
Hrm, yeah originally there was a usage that was very specific to disk failure handling that's in a separate patch. Will revert.


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

PS15, Line 132:                     scoped_refptr<tablet::TabletReplica>* replica) const;
              : 
              :   // Same as LookupTablet but doesn't acquired the shared lock.
              :   bool LookupTabletUnlocked(const std::string& tablet_id,
              :                             scoped_refptr<tablet::TabletReplica>* replica) const;
              : 
              :   virtual Status GetTabletReplica(const std::string& tablet_id,
> should this be private?
Done


PS15, Line 197: 
> the prototype indicates it operates on a bunch of replicas at once
Restructured the calls.


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 16
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes