You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2017/07/15 00:25:26 UTC

[kudu-CR] disk failure: don't fail CHECKs for disk failures

Andrew Wong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7442

Change subject: disk failure: don't fail CHECKs for disk failures
......................................................................

disk failure: don't fail CHECKs for disk failures

Disk failures are a special case of errors that will be handled. Certain
code paths pass along disk failure Statuses until they eventually hit a
CHECK and crash Kudu.

With this patch, these code paths will instead allow Kudu to continue
running, under the assumption that the errors are handled.

A small test is added to tablet-test to insert some data and fail.
Rather than crashing, the end-state of the tablet must indicated a disk
failure.

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/local_tablet_writer.h
M src/kudu/tablet/tablet-test-base.h
M src/kudu/tablet/tablet-test.cc
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tserver/ts_tablet_manager.cc
10 files changed, 130 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
......................................................................


Patch Set 14: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 14
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 23 Nov 2017 04:13:51 +0000
Gerrit-HasComments: No

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
......................................................................

shutdown tablets on disk failure at runtime

Before, various code paths pass along disk failure Statuses until they
eventually hit a CHECK failure and crash the server. Such fatal errors
were "safe" by design, as they would ensure no additional changes were
made durable to each tablet. This patch aims to achieve similar behavior
for failed replicas while keeping the server alive.

These failures are permitted provided the following have occurred for
each tablet in the affected directory:
- The failed directory is immediately marked as failed, preventing
  further tablets from being striped across a failed disk.
- The tablet's MvccManager is shut down to prevent further writes from
  being made durable and preventing I/O to the tablet.
- A request is submitted to a threadpool to eventually completely shut
  down the replica, leaving it for eviction.

NOTE: failures of metadata file and the WAL directory are fatal. Code
paths that update these explicitly crash the server.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 3 of:
https://docs.google.com/document/d/1yGVzDzV14mKReZ7EzlZZV_KfDBRnHJkRtlDox_RPXAA/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Reviewed-on: http://gerrit.cloudera.org:8080/7442
Tested-by: Kudu Jenkins
Reviewed-by: Andrew Wong <aw...@cloudera.com>
---
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 152 insertions(+), 42 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Andrew Wong: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 15
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: shutdown tablets on disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#4).

Change subject: disk failure: shutdown tablets on disk failure
......................................................................

disk failure: shutdown tablets on disk failure

Disk failures are a special case of errors that will be handled. Certain
code paths pass along disk failure Statuses until they eventually hit a
check failure and crash the server. These fatal errors were "safe"
before as they would ensure no additional changes were made durable to
each tablet.

These failures are not permitted provided the following have occurred:
- tell the tablet's MvccManager that it's shutting down
- tell the replica that it's shutting down
- submit a request to the threadpool that the tablet is shutting down
- the data directory is marked failed to prevent further IO

Additionally, scan paths that previously never returned due to the
fatality of disk failures now return with a TABLET_FAILED response.

Testing is done in separate patches.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 2.4 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 157 insertions(+), 49 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#10).

Change subject: shutdown tablets on disk failure at runtime
......................................................................

shutdown tablets on disk failure at runtime

Before, various code paths pass along disk failure Statuses until they
eventually hit a CHECK failure and crash the server. Such fatal errors
were "safe" by design, as they would ensure no additional changes were
made durable to each tablet. This patch aims to achieve similar behavior
for failed replicas while keeping the server alive.

These failures are permitted provided the following have occurred for
each tablet in the affected directory:
- The failed directory is immediately marked as failed, preventing
  further tablets from being striped across a failed disk.
- The tablet's MvccManager is shut down to prevent further writes from
  being made durable and preventing I/O to the tablet.
- A request is submitted to a threadpool to eventually completely shut
  down the replica, eventually marking it for eviction.

Beyond the above functionality, to cancel replica maintenance ops along
with the rest of the error handling, I updated the locking behavior of
TabletReplica so access to its maintenance ops can be done in a
thread-safe way (by guarding the list of ops with the replica's lock).

NOTE: failures of the metadata directory and the WAL directory are
fatal. Code paths that update these explicitly crash the server.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 3 of:
https://docs.google.com/document/d/1yGVzDzV14mKReZ7EzlZZV_KfDBRnHJkRtlDox_RPXAA/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 166 insertions(+), 41 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/10
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 10
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
......................................................................


Patch Set 11:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc@1367
PS11, Line 1367: Tablet* tablet = replica->tablet()
prefer: shared_ptr<Tablet> tablet = replica->shared_tablet();

because the TabletReplica is allowed to delete the Tablet when it shuts down.


http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc@1368
PS11, Line 1368: (!tablet || tablet->HasBeenStopped())
hmm. this doesn't really make sense to me. Don't we want to shut down the replica regardless of this stuff?



-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 11
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 22 Nov 2017 05:39:20 +0000
Gerrit-HasComments: Yes

[kudu-CR] disk failure: don't fail CHECKs for disk failures

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#2).

Change subject: disk failure: don't fail CHECKs for disk failures
......................................................................

disk failure: don't fail CHECKs for disk failures

Disk failures are a special case of errors that will be handled. Certain
code paths pass along disk failure Statuses until they eventually hit a
CHECK and crash Kudu.

With this patch, these code paths will instead allow Kudu to continue
running, under the assumption that the errors are handled.

A small test is added to tablet-test to insert some data and fail.
Rather than crashing, the end-state of the tablet must indicated a disk
failure.

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/local_tablet_writer.h
M src/kudu/tablet/tablet-test-base.h
M src/kudu/tablet/tablet-test.cc
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tserver/ts_tablet_manager.cc
10 files changed, 130 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#7).

Change subject: shutdown tablets on disk failure at runtime
......................................................................

shutdown tablets on disk failure at runtime

Before, various code paths pass along disk failure Statuses until they
eventually hit a CHECK failure and crash the server. Such fatal errors
were "safe" by design, as they would ensure no additional changes were
made durable to each tablet. This patch aims to achieve similar behavior
for failed replicas while keeping the server alive.

These failures are permitted provided the following have occurred for
each tablet in the affected directory:
- The failed directory is immediately marked as failed, preventing
  further tablets from being striped across a failed disk.
- The tablet's MvccManager is shut down to prevent further writes from
  being made durable and preventing I/O to the tablet.
- A request is submitted to a threadpool to eventually completely shut
  down the replica, eventually marking it for eviction.

NOTE: failures of the metadata directory and the WAL directory are
fatal. Code paths that update these explicitly crash the server.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 3 of:
https://docs.google.com/document/d/1yGVzDzV14mKReZ7EzlZZV_KfDBRnHJkRtlDox_RPXAA/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 168 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 7
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#11).

Change subject: shutdown tablets on disk failure at runtime
......................................................................

shutdown tablets on disk failure at runtime

Before, various code paths pass along disk failure Statuses until they
eventually hit a CHECK failure and crash the server. Such fatal errors
were "safe" by design, as they would ensure no additional changes were
made durable to each tablet. This patch aims to achieve similar behavior
for failed replicas while keeping the server alive.

These failures are permitted provided the following have occurred for
each tablet in the affected directory:
- The failed directory is immediately marked as failed, preventing
  further tablets from being striped across a failed disk.
- The tablet's MvccManager is shut down to prevent further writes from
  being made durable and preventing I/O to the tablet.
- A request is submitted to a threadpool to eventually completely shut
  down the replica, eventually marking it for eviction.

Beyond the above functionality, to cancel replica maintenance ops along
with the rest of the error handling, I updated the locking behavior of
TabletReplica so access to its maintenance ops can be done in a
thread-safe way (by guarding the list of ops with the replica's lock).

NOTE: failures of the metadata directory and the WAL directory are
fatal. Code paths that update these explicitly crash the server.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 3 of:
https://docs.google.com/document/d/1yGVzDzV14mKReZ7EzlZZV_KfDBRnHJkRtlDox_RPXAA/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 162 insertions(+), 47 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/11
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 11
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
......................................................................


Patch Set 14:

Carrying forward Mike's +2


-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 14
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 23 Nov 2017 04:14:07 +0000
Gerrit-HasComments: No

[kudu-CR] disk failure: don't fail CHECKs for disk failures

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: disk failure: don't fail CHECKs for disk failures
......................................................................


Patch Set 2:

(2 comments)

Like I alluded in some comments, I think this patch needs to be split out into one patch per change. Each removal of a CHECK should include tests for all of the new cases that used to be impossible to hit without a crash.

I also think it would be good to have an overall error handling design documented, either in a doc or a header file or something, so that we have a base layer of error handling set up and then systematically implement the new error handling in each crash case until we don't crash anymore and we know we safely handle each former crash case.

http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet_replica_mm_ops.cc
File src/kudu/tablet/tablet_replica_mm_ops.cc:

Line 32: DECLARE_bool(suicide_on_eio);
unused?


Line 137:     LOG(ERROR) << Substitute("FlushMRS failed on $0", tablet_replica_->tablet_id());
Hrm. Is it safe to change a CHECK() to a LOG(ERROR) here? I would like to see tests for all the new possible error cases we are unwrapping here.

Diving into FlushUnlocked(), it looks like we create a new MRS and then we call FlushInternal(). Can we fail to create a new MRS? In FlushInternal(), there is a CHECK based on state_ and I am not sure how that is enforced.

I think this change by itself deserved its own patch.


-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#13).

Change subject: shutdown tablets on disk failure at runtime
......................................................................

shutdown tablets on disk failure at runtime

Before, various code paths pass along disk failure Statuses until they
eventually hit a CHECK failure and crash the server. Such fatal errors
were "safe" by design, as they would ensure no additional changes were
made durable to each tablet. This patch aims to achieve similar behavior
for failed replicas while keeping the server alive.

These failures are permitted provided the following have occurred for
each tablet in the affected directory:
- The failed directory is immediately marked as failed, preventing
  further tablets from being striped across a failed disk.
- The tablet's MvccManager is shut down to prevent further writes from
  being made durable and preventing I/O to the tablet.
- A request is submitted to a threadpool to eventually completely shut
  down the replica, leaving it for eviction.

NOTE: failures of metadata file and the WAL directory are fatal. Code
paths that update these explicitly crash the server.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 3 of:
https://docs.google.com/document/d/1yGVzDzV14mKReZ7EzlZZV_KfDBRnHJkRtlDox_RPXAA/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 155 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/13
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 13
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
......................................................................


Patch Set 6:

(3 comments)

I'm starting to understand the flow of this. I was wondering, is there somewhere that we discuss, at the class design or implementation level, how this error handling is going to work? I know there is a high level design document but I don't think it covers this level of detail and I think that error handling is complex enough that it warrants at least a couple of paragraphs somewhere about how it's intended to work in terms of the control flow.

For example, IO error detected in some low-level routine, calls back into error manager, which calls back into tablet manager, which transitions to stopping state and triggers an async task that shuts down each component. And a little bit of discussion of the edge cases. What do you think? It might help crystallize our thinking on this topic, including lifecycle states required and new transitions that were previously impossible.

http://gerrit.cloudera.org:8080/#/c/7442/6/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7442/6/src/kudu/tserver/ts_tablet_manager.cc@742
PS6, Line 742: TransitionTabletState
how about StartTabletStateTransition() since this is effectively a locked version of StartTabletStateTransitionUnlocked(), plus a lookup


http://gerrit.cloudera.org:8080/#/c/7442/6/src/kudu/tserver/ts_tablet_manager.cc@1357
PS6, Line 1357:   while (true) {
hrm. This could run for quite a while, unless we can interrupt tablet copy...


http://gerrit.cloudera.org:8080/#/c/7442/6/src/kudu/tserver/ts_tablet_manager.cc@1366
PS6, Line 1366: replica->tablet()->Stopped()
maybe we want to DCHECK(Stopped()) ?



-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 03 Nov 2017 01:47:50 +0000
Gerrit-HasComments: Yes

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#12).

Change subject: shutdown tablets on disk failure at runtime
......................................................................

shutdown tablets on disk failure at runtime

Before, various code paths pass along disk failure Statuses until they
eventually hit a CHECK failure and crash the server. Such fatal errors
were "safe" by design, as they would ensure no additional changes were
made durable to each tablet. This patch aims to achieve similar behavior
for failed replicas while keeping the server alive.

These failures are permitted provided the following have occurred for
each tablet in the affected directory:
- The failed directory is immediately marked as failed, preventing
  further tablets from being striped across a failed disk.
- The tablet's MvccManager is shut down to prevent further writes from
  being made durable and preventing I/O to the tablet.
- A request is submitted to a threadpool to eventually completely shut
  down the replica, leaving it for eviction.

NOTE: failures of the metadata directory and the WAL directory are
fatal. Code paths that update these explicitly crash the server.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 3 of:
https://docs.google.com/document/d/1yGVzDzV14mKReZ7EzlZZV_KfDBRnHJkRtlDox_RPXAA/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/fs/data_dirs.cc
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
9 files changed, 156 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/12
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 12
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#14).

Change subject: shutdown tablets on disk failure at runtime
......................................................................

shutdown tablets on disk failure at runtime

Before, various code paths pass along disk failure Statuses until they
eventually hit a CHECK failure and crash the server. Such fatal errors
were "safe" by design, as they would ensure no additional changes were
made durable to each tablet. This patch aims to achieve similar behavior
for failed replicas while keeping the server alive.

These failures are permitted provided the following have occurred for
each tablet in the affected directory:
- The failed directory is immediately marked as failed, preventing
  further tablets from being striped across a failed disk.
- The tablet's MvccManager is shut down to prevent further writes from
  being made durable and preventing I/O to the tablet.
- A request is submitted to a threadpool to eventually completely shut
  down the replica, leaving it for eviction.

NOTE: failures of metadata file and the WAL directory are fatal. Code
paths that update these explicitly crash the server.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 3 of:
https://docs.google.com/document/d/1yGVzDzV14mKReZ7EzlZZV_KfDBRnHJkRtlDox_RPXAA/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 152 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/14
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 14
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#5).

Change subject: shutdown tablets on disk failure at runtime
......................................................................

shutdown tablets on disk failure at runtime

Before, various code paths pass along disk failure Statuses until they
eventually hit a CHECK failure and crash the server. Such fatal errors
were "safe" by design, as they would ensure no additional changes were
made durable to each tablet. This patch aims to achieve similar behavior
for failed replicas while keeping the server alive.

These failures are permitted provided the following have occurred for
each tablet in the affected directory:
* The failed directory is immediately marked as failed, preventing
  further tablets from being striped across a failed disk.
* The tablet's MvccManager is shut down to prevent further writes from
  being made durable and preventing I/O to the tablet.
* A request is submitted to a threadpool to eventually completely shut
  down the replica, eventually marking it for eviction.

Note: failures of the metadata directory and the WAL directory are
fatal.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 2.4 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 155 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
......................................................................


Patch Set 12: Code-Review+2

I've convinced myself this should work.


-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 12
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 23 Nov 2017 02:13:49 +0000
Gerrit-HasComments: No

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
......................................................................


Patch Set 10:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet.h
File src/kudu/tablet/tablet.h:

http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet.h@157
PS2, Line 157: 
> Seems like this has a new (sort of awkward) contract now since you have to 
Hrm, the contract was meant to freeze any further or currently-running calls to these functions. Good point that we might just be able to get by by returning a status.


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.h
File src/kudu/tablet/tablet_replica.h:

http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.h@268
PS10, Line 268:   void RegisterMaintenanceOps(MaintenanceManager* maintenance_manager);
> warning: function 'kudu::tablet::TabletReplica::RegisterMaintenanceOps' has
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc
File src/kudu/tablet/tablet_replica.cc:

http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@733
PS10, Line 733: void TabletReplica::MakeUnavailable(const string& reason) {
> My suggestion for this function:
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@739
PS10, Line 739:       tablet_->Stop();
> is there a reason we need to call tablet_->Stop() while holding TabletRepli
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@740
PS10, Line 740:       tablet_->CancelMaintenanceOps();
> this isn't needed because Stop() does it automatically
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@749
PS10, Line 749: Status::IOError(reason, "", EIO)
> This seems a bit unexpected. If we need a Status why not pass it into MakeU
Done


http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet_replica_mm_ops.cc
File src/kudu/tablet/tablet_replica_mm_ops.cc:

http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet_replica_mm_ops.cc@137
PS2, Line 137:   CHECK(!tablet->rowsets_flush_sem_.try_lock());
> Hrm. Is it safe to change a CHECK() to a LOG(ERROR) here? I would like to s
Duly noted. I'm still thinking about how to best test this (other than the integration tests for disk failure as a whole), which I could move into this patch as well.

Also note that this is being replaced with a LOG(ERROR) and then another CHECK below. If there is a disk failure, it _shouldn't_ matter what state the MRSs are in, since the tablet shouldn't be used anyway. In the error-handling code, it's set to FAILED, so scans won't be able to touch it, and further transactions and maintenance ops should no longer be run.


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@125
PS10, Line 125:   // Transitions the tablet state specified by 'tablet_id' with the specified
> How about using this for the comment:
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@127
PS10, Line 127:   Status StartTransitionTabletState(const std::string& tablet_id,
> 1. why is this public?
1. Done
2. Done
3. I think we do need to transition to avoid bootstrap/copy overlapping with shutdown


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@212
PS10, Line 212: FailTablet
> Perhaps a better name for this method would be FailTabletAndScheduleShutdow
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@309
PS10, Line 309: FailTabletAsync
> The Async suffix makes me think that this schedules work on a background th
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc@1316
PS10, Line 1316:     if (!s.IsDiskFailure()) {
               :       return s;
               :     }
               :     LOG(FATAL) << Substitute("Tablet $0's consensus metadata is in a failed disk", tablet_id);
> 1) s/in a failed disk/on a failed disk/
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc@1351
PS10, Line 1351:   LOG(INFO) << Substitute("Forcing failure of tablet $0", tablet_id);
> use LogPrefix(tablet_id) here
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc@1372
PS10, Line 1372:     s = StartTransitionTabletState(tablet_id, "failing tablet",
> What is the purpose of all this transitioning stuff in here?
Logically, that's what it's doing, but I think shutdown, bootstrapping, and copying are designed to be synchronized in this way (they shouldn't overlap)


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc@1384
PS10, Line 1384: CHECK(!replica->error().ok());
> At what point did we set this already?
This happens at the synchronous part of the callback (MakeUnavailable()), although you're right that it's not entirely necessary. Will remove.



-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 10
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 22 Nov 2017 03:47:44 +0000
Gerrit-HasComments: Yes

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#9).

Change subject: shutdown tablets on disk failure at runtime
......................................................................

shutdown tablets on disk failure at runtime

Before, various code paths pass along disk failure Statuses until they
eventually hit a CHECK failure and crash the server. Such fatal errors
were "safe" by design, as they would ensure no additional changes were
made durable to each tablet. This patch aims to achieve similar behavior
for failed replicas while keeping the server alive.

These failures are permitted provided the following have occurred for
each tablet in the affected directory:
- The failed directory is immediately marked as failed, preventing
  further tablets from being striped across a failed disk.
- The tablet's MvccManager is shut down to prevent further writes from
  being made durable and preventing I/O to the tablet.
- A request is submitted to a threadpool to eventually completely shut
  down the replica, eventually marking it for eviction.

NOTE: failures of the metadata directory and the WAL directory are
fatal. Code paths that update these explicitly crash the server.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 3 of:
https://docs.google.com/document/d/1yGVzDzV14mKReZ7EzlZZV_KfDBRnHJkRtlDox_RPXAA/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 168 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/9
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 9
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
......................................................................


Patch Set 13: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 13
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 23 Nov 2017 02:34:52 +0000
Gerrit-HasComments: No

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
......................................................................


Patch Set 10:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc
File src/kudu/tablet/tablet_replica.cc:

http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@733
PS10, Line 733: void TabletReplica::MakeUnavailable(const string& reason) {
My suggestion for this function:

  void TabletReplica::MakeUnavailable(const Status& reason) {
    scoped_refptr<Tablet> tablet;
    {
      lock_guard<simple_spinlock> l(lock_);
      SetError(reason);
      tablet = tablet_;
      for (MaintenanceOp* op : maintenance_ops_) {
        op->CancelAndDisable();
      }
    }
    if (tablet) tablet->Stop();
  }


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@739
PS10, Line 739:       tablet_->Stop();
is there a reason we need to call tablet_->Stop() while holding TabletReplica::lock_ ?


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@740
PS10, Line 740:       tablet_->CancelMaintenanceOps();
this isn't needed because Stop() does it automatically


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@749
PS10, Line 749: Status::IOError(reason, "", EIO)
This seems a bit unexpected. If we need a Status why not pass it into MakeUnavailable() ?


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@125
PS10, Line 125:   // Transitions the tablet state specified by 'tablet_id' with the specified
How about using this for the comment:

  // Marks the replica indicated by 'tablet_id' as being in a transitional state. Returns an error status if the replica is already in a transitional state.


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@127
PS10, Line 127:   Status StartTransitionTabletState(const std::string& tablet_id,
1. why is this public?
2. How about naming this: BeginReplicaStateTransition()
3. I'm not sure we need this helper...


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@212
PS10, Line 212: FailTablet
Perhaps a better name for this method would be FailTabletAndScheduleShutdown() ... I know it's long but at least it's very descriptive.


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@309
PS10, Line 309: FailTabletAsync
The Async suffix makes me think that this schedules work on a background thread. All this seems to do is shut down the tablet, not sure what it has to do with failing it, so perhaps it should be called ShutdownTablet() or something... or maybe we don't even need a special function for this


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc@1316
PS10, Line 1316:     if (!s.IsDiskFailure()) {
               :       return s;
               :     }
               :     LOG(FATAL) << Substitute("Tablet $0's consensus metadata is in a failed disk", tablet_id);
1) s/in a failed disk/on a failed disk/
2) It would be better to include the server UUID in a fatal message, i.e. use LogPrefix(tablet_id)
3) nit: I think here and below it would be slightly easier to read if you don't negate the condition:

    if (s.IsDiskFailure()) {
      LOG(FATAL) << LogPrefix(tablet_id) << "consensus metadata is on a failed disk";
    }
    return s;


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc@1351
PS10, Line 1351:   LOG(INFO) << Substitute("Forcing failure of tablet $0", tablet_id);
use LogPrefix(tablet_id) here


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc@1372
PS10, Line 1372:     s = StartTransitionTabletState(tablet_id, "failing tablet",
What is the purpose of all this transitioning stuff in here?

Correct me if I'm wrong, but I think this function could be reduced to:

  void TSTabletManager::FailTabletAsync(scoped_refptr<TabletReplica>& replica, Status reason) {
    replica->SetError(reason); // Transition to FAILED state.
    replica->Shutdown();
  }


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc@1384
PS10, Line 1384: CHECK(!replica->error().ok());
At what point did we set this already?



-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 10
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 21 Nov 2017 08:06:55 +0000
Gerrit-HasComments: Yes

[kudu-CR] disk failure: don't fail CHECKs for disk failures

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#3).

Change subject: disk failure: don't fail CHECKs for disk failures
......................................................................

disk failure: don't fail CHECKs for disk failures

Disk failures are a special case of errors that will be handled. Certain
code paths pass along disk failure Statuses until they eventually hit a
CHECK and crash Kudu.

With this patch, these code paths will instead allow Kudu to continue
running, under the assumption that the errors are handled.

A small test is added to tablet-test to insert some data and fail.
Rather than crashing, the end-state of the tablet must indicated a disk
failure.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 2.4 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/local_tablet_writer.h
M src/kudu/tablet/tablet-test-base.h
M src/kudu/tablet/tablet-test.cc
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tserver/ts_tablet_manager.cc
10 files changed, 130 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
......................................................................


Patch Set 12:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc@1367
PS11, Line 1367:     }
> prefer: shared_ptr<Tablet> tablet = replica->shared_tablet();
Done


http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc@1368
PS11, Line 1368: he tablet is healthy, or is already e
> hmm. this doesn't really make sense to me. Don't we want to shut down the r
Fair point, I think. This might be confusing because it is its own separate function. I'll put it in FailTabletAndScheduleShutdown(). That way, it's easier to see what/where things are set.

Re: passing the replica, I still think that's tricky because we're going to have to transition externally (e.g. what if we ended up calling shutdown in the middle of a copy?)



-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 12
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 23 Nov 2017 00:31:55 +0000
Gerrit-HasComments: Yes

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#6).

Change subject: shutdown tablets on disk failure at runtime
......................................................................

shutdown tablets on disk failure at runtime

Before, various code paths pass along disk failure Statuses until they
eventually hit a CHECK failure and crash the server. Such fatal errors
were "safe" by design, as they would ensure no additional changes were
made durable to each tablet. This patch aims to achieve similar behavior
for failed replicas while keeping the server alive.

These failures are permitted provided the following have occurred for
each tablet in the affected directory:
* The failed directory is immediately marked as failed, preventing
  further tablets from being striped across a failed disk.
* The tablet's MvccManager is shut down to prevent further writes from
  being made durable and preventing I/O to the tablet.
* A request is submitted to a threadpool to eventually completely shut
  down the replica, eventually marking it for eviction.

Note: failures of the metadata directory and the WAL directory are
fatal.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 2.4 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 155 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: don't fail CHECKs for disk failures

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: disk failure: don't fail CHECKs for disk failures
......................................................................


Patch Set 2:

(4 comments)

I didn't finish getting through this whole patch yet but here are some high level comments.

I feel like I'm little out of the loop. Is there a design doc or a header comment or something to summarize how disk error detection and handling works?

Currently, there are cases that we didn't handle to be "safe" without certain durability guarantees. So it needs to be a little clearer how we are handling those, and I guess it either needs to be explicit error handling in each case (rollback, etc) or a way to mark the tablet as permanently failed / dead without affecting the rest of the server.

http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/delta_tracker.cc
File src/kudu/tablet/delta_tracker.cc:

Line 696:   // TODO(unknown): need to figure out what to do with error handling if this
This should be TODO(todd).

Keeping this TODO makes me nervous. Is this handled or not? :)


http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/local_tablet_writer.h
File src/kudu/tablet/local_tablet_writer.h:

Line 104:     if (tablet_->IsDataInFailedDir()) {
This sort of feels out of place. See other comments


http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet.cc
File src/kudu/tablet/tablet.cc:

Line 803: void Tablet::ApplyRowOperations(WriteTransactionState* tx_state) {
Seems like this should return Status now?


http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet.h
File src/kudu/tablet/tablet.h:

Line 157:   void ApplyRowOperations(WriteTransactionState* tx_state);
Seems like this has a new (sort of awkward) contract now since you have to check some flags after it returns. That should at least be documented but more likely I think this should return a Status on failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
......................................................................


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc@1368
PS11, Line 1368: he tablet is healthy, or is already e
> Fair point, I think. This might be confusing because it is its own separate
we can still lock the thing and then use the current replica. Replicas can be replaced but their lifecycles are a DAG



-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 12
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 23 Nov 2017 00:41:15 +0000
Gerrit-HasComments: Yes