You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2017/07/15 00:25:26 UTC

[kudu-CR] disk failure: don't fail CHECKs for disk failures

Andrew Wong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7442

Change subject: disk failure: don't fail CHECKs for disk failures
......................................................................

disk failure: don't fail CHECKs for disk failures

Disk failures are a special case of errors that will be handled. Certain
code paths pass along disk failure Statuses until they eventually hit a
CHECK and crash Kudu.

With this patch, these code paths will instead allow Kudu to continue
running, under the assumption that the errors are handled.

A small test is added to tablet-test to insert some data and fail.
Rather than crashing, the end-state of the tablet must indicated a disk
failure.

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/local_tablet_writer.h
M src/kudu/tablet/tablet-test-base.h
M src/kudu/tablet/tablet-test.cc
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tserver/ts_tablet_manager.cc
10 files changed, 130 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
......................................................................


Patch Set 14: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 14
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 23 Nov 2017 04:13:51 +0000
Gerrit-HasComments: No

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
......................................................................

shutdown tablets on disk failure at runtime

Before, various code paths pass along disk failure Statuses until they
eventually hit a CHECK failure and crash the server. Such fatal errors
were "safe" by design, as they would ensure no additional changes were
made durable to each tablet. This patch aims to achieve similar behavior
for failed replicas while keeping the server alive.

These failures are permitted provided the following have occurred for
each tablet in the affected directory:
- The failed directory is immediately marked as failed, preventing
  further tablets from being striped across a failed disk.
- The tablet's MvccManager is shut down to prevent further writes from
  being made durable and preventing I/O to the tablet.
- A request is submitted to a threadpool to eventually completely shut
  down the replica, leaving it for eviction.

NOTE: failures of metadata file and the WAL directory are fatal. Code
paths that update these explicitly crash the server.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 3 of:
https://docs.google.com/document/d/1yGVzDzV14mKReZ7EzlZZV_KfDBRnHJkRtlDox_RPXAA/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Reviewed-on: http://gerrit.cloudera.org:8080/7442
Tested-by: Kudu Jenkins
Reviewed-by: Andrew Wong <aw...@cloudera.com>
---
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 152 insertions(+), 42 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Andrew Wong: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 15
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: shutdown tablets on disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#4).

Change subject: disk failure: shutdown tablets on disk failure
......................................................................

disk failure: shutdown tablets on disk failure

Disk failures are a special case of errors that will be handled. Certain
code paths pass along disk failure Statuses until they eventually hit a
check failure and crash the server. These fatal errors were "safe"
before as they would ensure no additional changes were made durable to
each tablet.

These failures are not permitted provided the following have occurred:
- tell the tablet's MvccManager that it's shutting down
- tell the replica that it's shutting down
- submit a request to the threadpool that the tablet is shutting down
- the data directory is marked failed to prevent further IO

Additionally, scan paths that previously never returned due to the
fatality of disk failures now return with a TABLET_FAILED response.

Testing is done in separate patches.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 2.4 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 157 insertions(+), 49 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#10).

Change subject: shutdown tablets on disk failure at runtime
......................................................................

shutdown tablets on disk failure at runtime

Before, various code paths pass along disk failure Statuses until they
eventually hit a CHECK failure and crash the server. Such fatal errors
were "safe" by design, as they would ensure no additional changes were
made durable to each tablet. This patch aims to achieve similar behavior
for failed replicas while keeping the server alive.

These failures are permitted provided the following have occurred for
each tablet in the affected directory:
- The failed directory is immediately marked as failed, preventing
  further tablets from being striped across a failed disk.
- The tablet's MvccManager is shut down to prevent further writes from
  being made durable and preventing I/O to the tablet.
- A request is submitted to a threadpool to eventually completely shut
  down the replica, eventually marking it for eviction.

Beyond the above functionality, to cancel replica maintenance ops along
with the rest of the error handling, I updated the locking behavior of
TabletReplica so access to its maintenance ops can be done in a
thread-safe way (by guarding the list of ops with the replica's lock).

NOTE: failures of the metadata directory and the WAL directory are
fatal. Code paths that update these explicitly crash the server.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 3 of:
https://docs.google.com/document/d/1yGVzDzV14mKReZ7EzlZZV_KfDBRnHJkRtlDox_RPXAA/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 166 insertions(+), 41 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/10
-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 10
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] shutdown tablets on disk failure at runtime

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.

Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
......................................................................


Patch Set 11:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc@1367
PS11, Line 1367: Tablet* tablet = replica->tablet()
prefer: shared_ptr<Tablet> tablet = replica->shared_tablet();

because the TabletReplica is allowed to delete the Tablet when it shuts down.


http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc@1368
PS11, Line 1368: (!tablet || tablet->HasBeenStopped())
hmm. this doesn't really make sense to me. Don't we want to shut down the replica regardless of this stuff?



-- 
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 11
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 22 Nov 2017 05:39:20 +0000
Gerrit-HasComments: Yes