You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Adar Dembo (Code Review)" <ge...@cloudera.org> on 2017/07/01 00:35:19 UTC

[kudu-CR] WIP disk failure: coordinate disk failure handling

Adar Dembo has posted comments on this change.

Change subject: WIP disk failure: coordinate disk failure handling
......................................................................


Patch Set 9:

(23 comments)

I focused on the block manager and some MM ops. You should definitely get a review from Todd/David/Mike on the transaction changes, the tablet read-path changes, and the consensus changes.

http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

Line 141:   virtual void HandleError(const Status& s, DataDir* dir) const = 0;
Why does HandleError() need to be part of the external API to WritableBlock and ReadableBlock? IIUC the macros correctly, it should be sufficient to expose them in LogReadableBlock/LogWritableBlock as non-virtual methods.


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

Line 45:   RETURN_NOT_OK(s_); \
Wrong patch, but shouldn't this be just 'return s'? We've already established that s_ isn't OK.


Line 56:   RETURN_NOT_OK(s_); \
Same.


Line 61: #define HANDLE_DISK_FAILURE(status_expr, err_handler) do { \
Since this is so similar to RETURN_NOT_OK_HANDLE would prefer if the macro names were a little more similar. Maybe RETURN_NOT_OK_HANDLE_DISK_FAILURE and just HANDLE_DISK_FAILURE?


Line 64:       (err_handler); \
Nit: indentation


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

Line 370:     if (PREDICT_FALSE(IsDiskFailure(sync))) {
Shouldn't these be HANDLE_DISK_FAILURE calls instead?


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 399:   void HandleError(const Status& s, DataDir* dir) const;
This isn't an accessor; bring it up into the other section and doc it.


Line 402:   friend class LogWritableBlock;
Don't need? HandleError() is public after all.


Line 498: Status LogBlockContainer::Create(LogBlockManager* block_manager,
What about disk failures during container creation?


Line 604:   // Open the existing metadata and data files for writing.
What about failures here?


Line 648:   RETURN_NOT_OK(block_manager()->env()->NewRandomAccessFile(metadata_path, &metadata_reader));
What about failures here?


Line 656:     read_status = pb_reader.ReadNextPB(&record);
And here?


Line 858:   return metadata_file_->Append(pb);
Failures here?


Line 881:     return metadata_file_->Sync();
What about this one?


Line 886: Status LogBlockContainer::ReopenMetadataWriter() {
And failures here?


Line 1090:   DataDir* mutable_data_dir() {
Where is this used?


Line 1305:   DataDir* mutable_data_dir() {
And this?


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/delta_tracker.cc
File src/kudu/tablet/delta_tracker.cc:

Line 696:   Status s = FlushDMS(old_dms.get(), &dfr, flush_type);
Shouldn't we just RETURN_NOT_OK here and drop the CHECK? Isn't the expectation that the MM op will figure out whether the failure is fatal or not? Why do we also have to consider it here?


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/metadata.proto
File src/kudu/tablet/metadata.proto:

Line 160:   // The Tablet has been stopped after a failure. In this state, like FAILED,
Why do we need a new state for this? Why isn't FAILED sufficient?


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/tablet_replica_mm_ops.cc
File src/kudu/tablet/tablet_replica_mm_ops.cc:

Line 138:     tablet_replica_->tablet()->rowsets_flush_sem_.unlock();
Can you use an RAII-style guard for the acquisition forof rowsets_flush_sem_?


Line 196:                             max_idx_to_replay_size);
Nit: indentation.


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tserver/tserver.proto
File src/kudu/tserver/tserver.proto:

Line 97:     // The tablet is in a failed state.
TABLET_NOT_RUNNING is insufficiently descriptive?


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/util/status.h
File src/kudu/util/status.h:

Line 433: inline bool IsDiskFailure(const Status& s) {
Would be nice to add a comment to each explaining why it's included.


-- 
To view, visit http://gerrit.cloudera.org:8080/7030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes