You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2017/05/01 18:29:17 UTC

[kudu-CR] WIP Don't suicide on EIO

Andrew Wong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6773

Change subject: WIP Don't suicide on EIO
......................................................................

WIP Don't suicide on EIO

Rather than suiciding when reaching an EIO, this patch adds a
mechanism that triggers error-handling in the form of a callback.
This handler is attached to the lowest non-env operations that may
result in EIO.
E.g. PosixRWFile::Write() is an env operation that may result in an
EIO. LogBlockContainer::WriteData()'s call to it must be wrapped in
the new error-handling macro KUDU_RETURN_OR_HANDLE_ERROR. Thus, all
direct readers/writers of files must now implement EIO-handling code
and wrap disk IO in KUDU_RETURN_OR_HANDLE_ERROR.

This patch is marked WIP, as the correct behavior after a disk fails
is not yet included. For now, upon failure, the block managers will
keep track of the dead disks and will mark the appropriate tablet
peers as failed. Additionally, the error handling has been moved
between layers offline, so more cleanup is being done to align with
the above description.

Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
---
M src/kudu/cfile/cfile_writer.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs-test-util.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tablet/tablet.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
M src/kudu/util/env_posix.cc
M src/kudu/util/fault_injection.cc
M src/kudu/util/fault_injection.h
M src/kudu/util/status.h
19 files changed, 359 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/6773/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6773
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>

[kudu-CR] WIP Don't suicide on EIO

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6773

to look at the new patch set (#4).

Change subject: WIP Don't suicide on EIO
......................................................................

WIP Don't suicide on EIO

Rather than suiciding when reaching an EIO, this patch adds a
mechanism that triggers error-handling in the form of a callback.
This handler is attached to the lowest non-env operations that may
result in EIO.
E.g. PosixRWFile::Write() is an env operation that may result in an
EIO. LogBlockContainer::WriteData()'s call to it must be wrapped in
the new error-handling macro KUDU_RETURN_OR_HANDLE_EIO. Thus, all
direct readers/writers of files must now implement EIO-handling code
and wrap disk IO in KUDU_RETURN_OR_HANDLE_EIO.

This patch is marked WIP, as the correct behavior after a disk fails
is not yet included. For now, upon failure, the block managers will
keep track of the dead disks and will mark the appropriate tablet
peers as failed. Additionally, the error handling has been moved
between layers offline, so more cleanup is being done to align with
the above description.

Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer_mm_ops.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
M src/kudu/util/env.h
M src/kudu/util/env_posix.cc
M src/kudu/util/fault_injection.cc
M src/kudu/util/fault_injection.h
M src/kudu/util/status.h
24 files changed, 463 insertions(+), 50 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/6773/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6773
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] WIP Don't suicide on EIO

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: WIP Don't suicide on EIO
......................................................................


Patch Set 11:

(2 comments)

had some old comments, feel free to ignore if no longer relevant.
did the contents of this patch end up somewhere else?

http://gerrit.cloudera.org:8080/#/c/6773/3/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

PS3, Line 48: FsErrorMan
maybe EioHandler or EioHandlerManager would be a better name>


Line 135:   virtual Status FlushDataAsync() = 0;
spurious change


-- 
To view, visit http://gerrit.cloudera.org:8080/6773
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] WIP Don't suicide on EIO

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has abandoned this change.

Change subject: WIP Don't suicide on EIO
......................................................................


Abandoned

The contents of this patch have been split across gerrit/c/[7028, 7031]

-- 
To view, visit http://gerrit.cloudera.org:8080/6773
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] WIP Don't suicide on EIO

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6773

to look at the new patch set (#10).

Change subject: WIP Don't suicide on EIO
......................................................................

WIP Don't suicide on EIO

Rather than suiciding when reaching an EIO, this patch adds a
mechanism that triggers error-handling in the form of a callback.
This handler is attached to the lowest non-env operations that may
result in EIO.
E.g. PosixRWFile::Write() is an env operation that may result in an
EIO. LogBlockContainer::WriteData()'s call to it must be wrapped in
the new error-handling macro KUDU_RETURN_OR_HANDLE_EIO. Thus, all
direct readers/writers of files must now implement EIO-handling code
and wrap disk IO in KUDU_RETURN_OR_HANDLE_EIO.

Also included is a new tablet data state called TABLET_DATA_CORRUPTED
in which tablet data will not be deleted until explicitly requested
or during a tablet copy.

This patch is marked WIP, as the correct behavior after a disk fails
is not yet included. For now, upon failure, the block managers will
keep track of the dead disks and will mark the appropriate tablet
peers as failed. Additionally, the error handling has been moved
between layers offline, so more cleanup is being done to align with
the above description.

Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tablet/tablet_metadata.h
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
M src/kudu/util/status.h
22 files changed, 398 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/6773/10
-- 
To view, visit http://gerrit.cloudera.org:8080/6773
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] WIP Don't suicide on EIO

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6773

to look at the new patch set (#7).

Change subject: WIP Don't suicide on EIO
......................................................................

WIP Don't suicide on EIO

Rather than suiciding when reaching an EIO, this patch adds a
mechanism that triggers error-handling in the form of a callback.
This handler is attached to the lowest non-env operations that may
result in EIO.
E.g. PosixRWFile::Write() is an env operation that may result in an
EIO. LogBlockContainer::WriteData()'s call to it must be wrapped in
the new error-handling macro KUDU_RETURN_OR_HANDLE_EIO. Thus, all
direct readers/writers of files must now implement EIO-handling code
and wrap disk IO in KUDU_RETURN_OR_HANDLE_EIO.

Also included is a new tablet data state called TABLET_DATA_CORRUPTED
in which tablet data will not be deleted until explicitly requested
or during a tablet copy.

This patch is marked WIP, as the correct behavior after a disk fails
is not yet included. For now, upon failure, the block managers will
keep track of the dead disks and will mark the appropriate tablet
peers as failed. Additionally, the error handling has been moved
between layers offline, so more cleanup is being done to align with
the above description.

Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tablet/tablet_metadata.h
A src/kudu/tablet/tablet_peer_mm_ops.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
M src/kudu/util/env.h
M src/kudu/util/env_posix.cc
M src/kudu/util/fault_injection.cc
M src/kudu/util/fault_injection.h
M src/kudu/util/status.h
27 files changed, 756 insertions(+), 56 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/6773/7
-- 
To view, visit http://gerrit.cloudera.org:8080/6773
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] WIP Don't suicide on EIO

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6773

to look at the new patch set (#2).

Change subject: WIP Don't suicide on EIO
......................................................................

WIP Don't suicide on EIO

Rather than suiciding when reaching an EIO, this patch adds a
mechanism that triggers error-handling in the form of a callback.
This handler is attached to the lowest non-env operations that may
result in EIO.
E.g. PosixRWFile::Write() is an env operation that may result in an
EIO. LogBlockContainer::WriteData()'s call to it must be wrapped in
the new error-handling macro KUDU_RETURN_OR_HANDLE_EIO. Thus, all
direct readers/writers of files must now implement EIO-handling code
and wrap disk IO in KUDU_RETURN_OR_HANDLE_EIO.

This patch is marked WIP, as the correct behavior after a disk fails
is not yet included. For now, upon failure, the block managers will
keep track of the dead disks and will mark the appropriate tablet
peers as failed. Additionally, the error handling has been moved
between layers offline, so more cleanup is being done to align with
the above description.

Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
---
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tablet/tablet.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
M src/kudu/util/env.h
M src/kudu/util/env_posix.cc
M src/kudu/util/env_util.cc
M src/kudu/util/fault_injection.cc
M src/kudu/util/fault_injection.h
M src/kudu/util/status.h
19 files changed, 350 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/6773/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6773
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] WIP Don't suicide on EIO

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6773

to look at the new patch set (#8).

Change subject: WIP Don't suicide on EIO
......................................................................

WIP Don't suicide on EIO

Rather than suiciding when reaching an EIO, this patch adds a
mechanism that triggers error-handling in the form of a callback.
This handler is attached to the lowest non-env operations that may
result in EIO.
E.g. PosixRWFile::Write() is an env operation that may result in an
EIO. LogBlockContainer::WriteData()'s call to it must be wrapped in
the new error-handling macro KUDU_RETURN_OR_HANDLE_EIO. Thus, all
direct readers/writers of files must now implement EIO-handling code
and wrap disk IO in KUDU_RETURN_OR_HANDLE_EIO.

Also included is a new tablet data state called TABLET_DATA_CORRUPTED
in which tablet data will not be deleted until explicitly requested
or during a tablet copy.

This patch is marked WIP, as the correct behavior after a disk fails
is not yet included. For now, upon failure, the block managers will
keep track of the dead disks and will mark the appropriate tablet
peers as failed. Additionally, the error handling has been moved
between layers offline, so more cleanup is being done to align with
the above description.

Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tablet/tablet_metadata.h
A src/kudu/tablet/tablet_peer_mm_ops.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
M src/kudu/util/env.h
M src/kudu/util/env_posix.cc
M src/kudu/util/fault_injection.cc
M src/kudu/util/fault_injection.h
M src/kudu/util/status.h
27 files changed, 756 insertions(+), 56 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/6773/8
-- 
To view, visit http://gerrit.cloudera.org:8080/6773
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] WIP Don't suicide on EIO

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6773

to look at the new patch set (#11).

Change subject: WIP Don't suicide on EIO
......................................................................

WIP Don't suicide on EIO

Rather than suiciding when reaching an EIO, this patch adds a
mechanism that triggers error-handling in the form of a callback.
This handler is attached to the lowest non-env operations that may
result in EIO.
E.g. PosixRWFile::Write() is an env operation that may result in an
EIO. LogBlockContainer::WriteData()'s call to it must be wrapped in
the new error-handling macro KUDU_RETURN_OR_HANDLE_EIO. Thus, all
direct readers/writers of files must now implement EIO-handling code
and wrap disk IO in KUDU_RETURN_OR_HANDLE_EIO.

Also included is a new tablet data state called TABLET_DATA_CORRUPTED
in which tablet data will not be deleted until explicitly requested
or during a tablet copy.

This patch is marked WIP, as the correct behavior after a disk fails
is not yet included. For now, upon failure, the block managers will
keep track of the dead disks and will mark the appropriate tablet
peers as failed. Additionally, the error handling has been moved
between layers offline, so more cleanup is being done to align with
the above description.

Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tablet/tablet_metadata.h
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
M src/kudu/util/status.h
22 files changed, 354 insertions(+), 56 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/6773/11
-- 
To view, visit http://gerrit.cloudera.org:8080/6773
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] WIP Don't suicide on EIO

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6773

to look at the new patch set (#9).

Change subject: WIP Don't suicide on EIO
......................................................................

WIP Don't suicide on EIO

Rather than suiciding when reaching an EIO, this patch adds a
mechanism that triggers error-handling in the form of a callback.
This handler is attached to the lowest non-env operations that may
result in EIO.
E.g. PosixRWFile::Write() is an env operation that may result in an
EIO. LogBlockContainer::WriteData()'s call to it must be wrapped in
the new error-handling macro KUDU_RETURN_OR_HANDLE_EIO. Thus, all
direct readers/writers of files must now implement EIO-handling code
and wrap disk IO in KUDU_RETURN_OR_HANDLE_EIO.

Also included is a new tablet data state called TABLET_DATA_CORRUPTED
in which tablet data will not be deleted until explicitly requested
or during a tablet copy.

This patch is marked WIP, as the correct behavior after a disk fails
is not yet included. For now, upon failure, the block managers will
keep track of the dead disks and will mark the appropriate tablet
peers as failed. Additionally, the error handling has been moved
between layers offline, so more cleanup is being done to align with
the above description.

Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tablet/tablet_metadata.h
A src/kudu/tablet/tablet_peer_mm_ops.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
M src/kudu/util/env.h
M src/kudu/util/env_posix.cc
M src/kudu/util/fault_injection.cc
M src/kudu/util/fault_injection.h
M src/kudu/util/status.h
27 files changed, 752 insertions(+), 56 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/6773/9
-- 
To view, visit http://gerrit.cloudera.org:8080/6773
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] WIP Don't suicide on EIO

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6773

to look at the new patch set (#6).

Change subject: WIP Don't suicide on EIO
......................................................................

WIP Don't suicide on EIO

Rather than suiciding when reaching an EIO, this patch adds a
mechanism that triggers error-handling in the form of a callback.
This handler is attached to the lowest non-env operations that may
result in EIO.
E.g. PosixRWFile::Write() is an env operation that may result in an
EIO. LogBlockContainer::WriteData()'s call to it must be wrapped in
the new error-handling macro KUDU_RETURN_OR_HANDLE_EIO. Thus, all
direct readers/writers of files must now implement EIO-handling code
and wrap disk IO in KUDU_RETURN_OR_HANDLE_EIO.

Also included is a new tablet data state called TABLET_DATA_CORRUPTED
in which tablet data will not be deleted until explicitly requested
or during a tablet copy.

This patch is marked WIP, as the correct behavior after a disk fails
is not yet included. For now, upon failure, the block managers will
keep track of the dead disks and will mark the appropriate tablet
peers as failed. Additionally, the error handling has been moved
between layers offline, so more cleanup is being done to align with
the above description.

Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tablet/tablet_metadata.h
M src/kudu/tablet/tablet_peer_mm_ops.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
M src/kudu/util/env.h
M src/kudu/util/env_posix.cc
M src/kudu/util/fault_injection.cc
M src/kudu/util/fault_injection.h
M src/kudu/util/status.h
27 files changed, 516 insertions(+), 60 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/6773/6
-- 
To view, visit http://gerrit.cloudera.org:8080/6773
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] WIP Don't suicide on EIO

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6773

to look at the new patch set (#5).

Change subject: WIP Don't suicide on EIO
......................................................................

WIP Don't suicide on EIO

Rather than suiciding when reaching an EIO, this patch adds a
mechanism that triggers error-handling in the form of a callback.
This handler is attached to the lowest non-env operations that may
result in EIO.
E.g. PosixRWFile::Write() is an env operation that may result in an
EIO. LogBlockContainer::WriteData()'s call to it must be wrapped in
the new error-handling macro KUDU_RETURN_OR_HANDLE_EIO. Thus, all
direct readers/writers of files must now implement EIO-handling code
and wrap disk IO in KUDU_RETURN_OR_HANDLE_EIO.

Also included is a new tablet data state called TABLET_DATA_CORRUPTED
in which tablet data will not be deleted until explicitly requested
or during a tablet copy.

This patch is marked WIP, as the correct behavior after a disk fails
is not yet included. For now, upon failure, the block managers will
keep track of the dead disks and will mark the appropriate tablet
peers as failed. Additionally, the error handling has been moved
between layers offline, so more cleanup is being done to align with
the above description.

Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/master/catalog_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tablet/tablet_metadata.h
M src/kudu/tablet/tablet_peer_mm_ops.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
M src/kudu/util/env.h
M src/kudu/util/env_posix.cc
M src/kudu/util/fault_injection.cc
M src/kudu/util/fault_injection.h
M src/kudu/util/status.h
28 files changed, 519 insertions(+), 60 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/6773/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6773
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] WIP Don't suicide on EIO

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6773

to look at the new patch set (#3).

Change subject: WIP Don't suicide on EIO
......................................................................

WIP Don't suicide on EIO

Rather than suiciding when reaching an EIO, this patch adds a
mechanism that triggers error-handling in the form of a callback.
This handler is attached to the lowest non-env operations that may
result in EIO.
E.g. PosixRWFile::Write() is an env operation that may result in an
EIO. LogBlockContainer::WriteData()'s call to it must be wrapped in
the new error-handling macro KUDU_RETURN_OR_HANDLE_EIO. Thus, all
direct readers/writers of files must now implement EIO-handling code
and wrap disk IO in KUDU_RETURN_OR_HANDLE_EIO.

This patch is marked WIP, as the correct behavior after a disk fails
is not yet included. For now, upon failure, the block managers will
keep track of the dead disks and will mark the appropriate tablet
peers as failed. Additionally, the error handling has been moved
between layers offline, so more cleanup is being done to align with
the above description.

Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
---
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tablet/tablet.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
M src/kudu/util/env.h
M src/kudu/util/env_posix.cc
M src/kudu/util/env_util.cc
M src/kudu/util/fault_injection.cc
M src/kudu/util/fault_injection.h
M src/kudu/util/status.h
20 files changed, 426 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/6773/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6773
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] WIP Don't suicide on EIO

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: WIP Don't suicide on EIO
......................................................................


Patch Set 11:

(17 comments)

This patch has been split into a few smaller ones! Will abandon this one.

http://gerrit.cloudera.org:8080/#/c/6773/10/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 24: #include <set>
> warning: #includes are not sorted properly [llvm-include-order]
Done


http://gerrit.cloudera.org:8080/#/c/6773/10/src/kudu/master/sys_catalog.cc
File src/kudu/master/sys_catalog.cc:

Line 310:   RETURN_NOT_OK_PREPEND(tablet_replica_->Init(tablet,
> warning: missing username/bug in TODO [google-readability-todo]
This TODO isn't actionable. Removed.


http://gerrit.cloudera.org:8080/#/c/6773/10/src/kudu/tablet/tablet_metadata.h
File src/kudu/tablet/tablet_metadata.h:

Line 172:   void AddOrphanedBlocks(const std::vector<BlockId>& blocks);
> warning: function 'kudu::tablet::TabletMetadata::AddOrphanedBlocks' has a d
Changed to 'blocks' to align with other functions.


http://gerrit.cloudera.org:8080/#/c/6773/10/src/kudu/tserver/tablet_server-test.cc
File src/kudu/tserver/tablet_server-test.cc:

Line 2490
> error: use of undeclared identifier 'set' [clang-diagnostic-error]
Done


Line 2490
> error: use of undeclared identifier 'failed_data_dirs' [clang-diagnostic-er
Done


Line 2490
> error: unexpected type name 'uint16_t': expected expression [clang-diagnost
Done


Line 2491
> error: use of undeclared identifier 'failed_data_dirs' [clang-diagnostic-er
Done


Line 2492
> error: use of undeclared identifier 'failed_data_dirs' [clang-diagnostic-er
Done


Line 2505
> error: use of undeclared identifier 'failed_data_dirs' [clang-diagnostic-er
Done


Line 2506
> error: use of undeclared identifier 'failed_data_dirs' [clang-diagnostic-er
Done


http://gerrit.cloudera.org:8080/#/c/6773/10/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 1085: 
> error: unknown type name 'set'; did you mean 'std::set'? [clang-diagnostic-
Done


Line 1085: 
> warning: all parameters should be named in a function [readability-named-pa
Done


Line 1085: 
> error: expected ')' [clang-diagnostic-error]
Done


Line 1088:   // middle of a bootstrap, tablet copy, etc.
> error: reference to type 'const string' (aka 'const basic_string<char>') co
Done


Line 1088:   // middle of a bootstrap, tablet copy, etc.
> error: use of undeclared identifier 'tablet_ids'; did you mean 'tablet_id'?
Done


http://gerrit.cloudera.org:8080/#/c/6773/10/src/kudu/util/status.h
File src/kudu/util/status.h:

Line 100:   if (s.posix_code() == (err_number)) (err_handler); \
> warning: macro argument should be enclosed in parentheses [misc-macro-paren
Done


Line 108:   if (s.posix_code() == (err_number)) (err_handler); \
> warning: macro argument should be enclosed in parentheses [misc-macro-paren
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6773
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c221a5ea97c7b7ec5f8d395b5527dc16e63f828
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes