You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2017/05/31 18:19:15 UTC

[kudu-CR] disk failure: basic test coverage for disk failure

Andrew Wong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7031

Change subject: disk failure: basic test coverage for disk failure
......................................................................

disk failure: basic test coverage for disk failure

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets.

With improper disk-failure-handling, this scenario alone would have been
enough to leave the server with only a single copy of data, as the two
servers with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvagable,
and data would be replicated on the remaining disks.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/multidir-cluster-itest.cc
M src/kudu/integration-tests/ts_itest-base.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
3 files changed, 137 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>

[kudu-CR] disk failure: basic test coverage for disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#2).

Change subject: disk failure: basic test coverage for disk failure
......................................................................

disk failure: basic test coverage for disk failure

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets.

With improper disk-failure-handling, this scenario alone would have been
enough to leave the server with only a single copy of data, as the two
servers with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvagable,
and data would be replicated on the remaining disks.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
M src/kudu/integration-tests/ts_itest-base.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
4 files changed, 235 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: basic test coverage for disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#3).

Change subject: disk failure: basic test coverage for disk failure
......................................................................

disk failure: basic test coverage for disk failure

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets.

With improper disk-failure-handling, this scenario alone would have been
enough to leave the server with only a single copy of data, as the two
servers with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvagable,
and data would be replicated on the remaining disks.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
M src/kudu/integration-tests/ts_itest-base.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
4 files changed, 242 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: test coverage for disk failure recovery

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.

Adar Dembo has posted comments on this change.

Change subject: disk failure: test coverage for disk failure recovery
......................................................................


Patch Set 7:

(16 comments)

I didn't look at the test failures. Looks good, though I'd recommend you go over each "compound" test function (i.e. a function that does a ton of work) and add NO_FATALS and/or ASSERT_OK as needed.

http://gerrit.cloudera.org:8080/#/c/7031/7/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

PS7, Line 51: uint32_t
Simple int would be fine here, I think.


Line 67:   if (FLAGS_block_manager == "file") {
Would it be useful to parameterize these tests and run them on both block managers?


PS7, Line 68: **
Isn't this middle entry also '??'? I thought block paths were data/ab/cd/abcdefghijklmnop.


Line 91:     vector<TServerDetails*> tservers;
Maybe omit this and iterate on the tablet_servers_ map directly?


Line 150:           if (counts_after - counts_before > 0) {
Would if counts_after > counts_before be simpler?


Line 176:     workload->StopAndJoin();
IIUC, we're writing just enough to figure out who is writing where, and then we stop?


Line 184:       ext_tserver->GetInt64Metric(&METRIC_ENTITY_server, nullptr, &METRIC_data_dirs_failed,
Need ASSERT_OK here too?


Line 200: 	  ASSERT_OK(row->SetInt32(0, i + start_row));
Got some tabs here.


Line 204: 	  ignore_result(session->Flush());
Flushing one row at a time? You sure you don't want to flush the whole thing together?


Line 229:   vector<ExternalTabletServer*> ext_tservers = SetupDefaultCluster();
Need NO_FATALS here?


Line 233:   workload_a.set_write_pattern(TestWorkload::WritePattern::INSERT_SEQUENTIAL_ROWS);
Why is sequential important?


PS7, Line 238: this avoid
this to avoid


Line 243:   GetDataDirsWrittenToByWorkload(ext_tservers, &workload_a, 3, &dirs_with_A_data,
Need NO_FATALS on these calls, right?


PS7, Line 256: MonoDelta::FromSeconds(120)
Maybe define this delta once and reuse it?


Line 259:   SetServerSurvivalFlags(ext_tservers);
NO_FATALS here.


Line 271:   WaitForDiskFailures(ext_tservers[0]);
NO_FATALS on these too.


-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP disk failure: tests for disk failure recovery

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Hello Tidy Bot, David Ribeiro Alves, Kudu Jenkins, Adar Dembo, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#13).

Change subject: WIP disk failure: tests for disk failure recovery
......................................................................

WIP disk failure: tests for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and during
scans, ensuring the servers return to a normal state; another is added
to test disk failures during Scans.  All of these tests are
parameterized to run with both the LBM and FBM.

WIP: until other patches are merged, I'm expecting this to fail or be
extremely flaky.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/disk_failure-itest.cc
1 file changed, 351 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/13
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-Change-Number: 7031
Gerrit-PatchSet: 13
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: tests for disk failure recovery

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#11).

Change subject: disk failure: tests for disk failure recovery
......................................................................

disk failure: tests for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and during
scans, ensuring the servers return to a normal state. All of these tests
are parameterized to run with both the LBM and FBM.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
2 files changed, 407 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/11
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: tests for disk failure recovery

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.

Adar Dembo has posted comments on this change.

Change subject: disk failure: tests for disk failure recovery
......................................................................


Patch Set 10:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/7031/7/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

PS7, Line 68: 
> Done
Not done? Also apply this change to your other testing change.


Line 233: // servers with EIOs would have been shut down entirely.
> Applying sequentially means there are only MRS flushes.
OK, could you add a comment explaining that?


http://gerrit.cloudera.org:8080/#/c/7031/10/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

PS10, Line 58: const vector<string> kDefaultFlags = {
             :   // Flush frequently to trigger writes.
             :   "--flush_threshold_mb=1",
             :   "--flush_threshold_secs=1",
             : 
             :   // Ensure a tablet will only store data on a single disk.
             :   "--fs_target_data_dirs_per_tablet=1",
             : };
Nit: move these to GetDefaultFlags() (i.e. closer to where they're actually used).


Line 73:   void SetupDefaultTables() {
It's creating just one table though. So SetupDefaultTable() maybe?


Line 93:     if (GetParam() == FBM) {
How about place the string representation of the gflag directly in the enum, then you won't need this conversion.


Line 107:     CHECK_EQ(3, tablet_servers_.size());
Nit: compare with FLAGS_num_tablet_servers, so that if we want to update that value, there are two places in the code to update instead of one.


Line 332:   SetServerSurvivalFlags(ext_tservers);
NO_FATALS.


Line 350:   write_workload.set_write_pattern(TestWorkload::WritePattern::INSERT_SEQUENTIAL_ROWS);
Add comment justifying SEQUENTIAL.


-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: test coverage for disk failure recovery

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#7).

Change subject: disk failure: test coverage for disk failure recovery
......................................................................

disk failure: test coverage for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and scans,
ensuring the servers return to a normal state.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
2 files changed, 363 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP disk failure: tests for disk failure recovery

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has abandoned this change. ( http://gerrit.cloudera.org:8080/7031 )

Change subject: WIP disk failure: tests for disk failure recovery
......................................................................


Abandoned

Other tests subsumed this sort of testing for disk failures
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-Change-Number: 7031
Gerrit-PatchSet: 14
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: tests for disk failure recovery

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has posted comments on this change.

Change subject: disk failure: tests for disk failure recovery
......................................................................


Patch Set 7:

(34 comments)

http://gerrit.cloudera.org:8080/#/c/7031/3//COMMIT_MSG
Commit Message:

PS3, Line 16: exercises 
> typo
Done


http://gerrit.cloudera.org:8080/#/c/7031/3/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

Line 26: #include "kudu/gutil/map-util.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


Line 38: 
> warning: using decl 'map' is unused [misc-unused-using-decls]
Done


PS3, Line 61: 
> nit: indent
These have been removed to use TestWorkload instead


PS3, Line 62: 
> nit: using...
same


Line 81:       NO_FATALS(CreateTable(table_id));
> warning: redundant 'test_info_' declaration [readability-redundant-declarat
seems harmless?


PS3, Line 82:       WaitForTSAndReplicas(table_id);
            :     }
            :   }
            : 
            :   // Sets up a cluster with three servers with two disks each.
            :   vector<ExternalTabletServer*> SetupDefaultCluster() {
            :     FLAGS_num_tablet_servers = 3;
            :     CreateCluster("survivable_cluster", kDefaultFlags, {}, /* num_data_dirs */ 2);
            :     CreateClient(&client_);
            :     vector<TServerDetails*> tservers;
            :     AppendValuesFromMap(tablet_servers_, &tservers);
            :     CHECK_EQ(3, tservers.size());
            :     vector<ExternalTabletServer*> ext_tservers;
            :     for (auto* details : tservers) {
            :       ext_tservers.push_back(cluster_->tablet_server_by_uuid(details->uuid()));
            :     }
            :     return ext_tservers;
> nit move this comment to before the test declaration, or at least give an i
Done


PS3, Line 150:   if (counts_after - counts_before > 0) {
             :             dirs_written->push_back(e.first);
             :           } else {
             :             dirs_not_written->push_back(e.first);
             :           }
             :         }
             :       }
> this is a bit funky (depending on the number of files) any way we can be mo
Done. I've changed it to count the files before and compare it with the counts after running whatever function.


PS3, Line 214: d on tw
> tablet servers
Done


http://gerrit.cloudera.org:8080/#/c/7031/6/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

Line 28: #include "kudu/integration-tests/test_workload.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


Line 66: string GlobForBlockFileInDataDir(const string& data_dir) {
> warning: invalid case style for global constant 'ts_flags' [readability-ide
Done


Line 101:   void SetServerSurvivalFlags(vector<ExternalTabletServer*>& ext_tservers) {
> warning: non-const reference parameter 'ext_tservers', make it const or use
Done


Line 126:                                       const std::function<void(void)>& f,
> warning: the parameter 'f' is copied for each invocation but only used as a
Done


Line 165:   void GetDataDirsWrittenToByWorkload(const vector<ExternalTabletServer*> ext_tservers,
> warning: the const qualified parameter 'ext_tservers' is copied for each in
Done


Line 166:                                       TestWorkload* workload,
> warning: non-const reference parameter 'workload', make it const or use a p
Done


http://gerrit.cloudera.org:8080/#/c/7031/7/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

Line 30: #include "kudu/util/path_util.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


PS7, Line 51: uint32_t
> Simple int would be fine here, I think.
Done


Line 67:   if (FLAGS_block_manager == "file") {
> Would it be useful to parameterize these tests and run them on both block m
Done


PS7, Line 68: **
> Isn't this middle entry also '??'? I thought block paths were data/ab/cd/ab
Done


Line 91:     vector<TServerDetails*> tservers;
> Maybe omit this and iterate on the tablet_servers_ map directly?
Done


Line 101:   void SetServerSurvivalFlags(vector<ExternalTabletServer*>& ext_tservers) {
> warning: non-const reference parameter 'ext_tservers', make it const or use
Done


Line 150:           if (counts_after - counts_before > 0) {
> Would if counts_after > counts_before be simpler?
Done


Line 165:   void GetDataDirsWrittenToByWorkload(const vector<ExternalTabletServer*> ext_tservers,
> warning: the const qualified parameter 'ext_tservers' is copied for each in
Done


Line 176:     workload->StopAndJoin();
> IIUC, we're writing just enough to figure out who is writing where, and the
Yes, exactly


Line 184:       ext_tserver->GetInt64Metric(&METRIC_ENTITY_server, nullptr, &METRIC_data_dirs_failed,
> Need ASSERT_OK here too?
Done


Line 200: 	  ASSERT_OK(row->SetInt32(0, i + start_row));
> Got some tabs here.
Done


Line 204: 	  ignore_result(session->Flush());
> Flushing one row at a time? You sure you don't want to flush the whole thin
Done


Line 229:   vector<ExternalTabletServer*> ext_tservers = SetupDefaultCluster();
> Need NO_FATALS here?
Done in SetupDefaultCluster()


Line 233:   workload_a.set_write_pattern(TestWorkload::WritePattern::INSERT_SEQUENTIAL_ROWS);
> Why is sequential important?
Applying sequentially means there are only MRS flushes.


PS7, Line 238: this avoid
> this to avoid
Done


Line 243:   GetDataDirsWrittenToByWorkload(ext_tservers, &workload_a, 3, &dirs_with_A_data,
> Need NO_FATALS on these calls, right?
Done


PS7, Line 256: MonoDelta::FromSeconds(120)
> Maybe define this delta once and reuse it?
Done


Line 259:   SetServerSurvivalFlags(ext_tservers);
> NO_FATALS here.
Done


Line 271:   WaitForDiskFailures(ext_tservers[0]);
> NO_FATALS on these too.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] disk failure: test coverage for disk failure recovery

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#6).

Change subject: disk failure: test coverage for disk failure recovery
......................................................................

disk failure: test coverage for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and scans,
ensuring the servers return to a normal state.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
2 files changed, 363 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: tests for disk failure recovery

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#12).

Change subject: disk failure: tests for disk failure recovery
......................................................................

disk failure: tests for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and during
scans, ensuring the servers return to a normal state. All of these tests
are parameterized to run with both the LBM and FBM.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
2 files changed, 407 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/12
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: tests for disk failure recovery

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#8).

Change subject: disk failure: tests for disk failure recovery
......................................................................

disk failure: tests for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and scans,
ensuring the servers return to a normal state.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
2 files changed, 361 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/8
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: tests for disk failure recovery

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#10).

Change subject: disk failure: tests for disk failure recovery
......................................................................

disk failure: tests for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and during
scans, ensuring the servers return to a normal state. All of these tests
are parameterized to run with both the LBM and FBM.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
2 files changed, 382 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/10
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: basic test coverage for disk failure

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.

David Ribeiro Alves has posted comments on this change.

Change subject: disk failure: basic test coverage for disk failure
......................................................................


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7031/3//COMMIT_MSG
Commit Message:

PS3, Line 16: salvagable
typo


http://gerrit.cloudera.org:8080/#/c/7031/3/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

PS3, Line 61: int num_rows, int payload_size) {
nit: indent


PS3, Line 62: client::sp::shared_ptr
nit: using...


PS3, Line 82:   // This test is set up so each tablet occupies its own disk on each tserver.
            :   // The '|' represents a separation of disks.
            :   //    ts-0      ts-1      ts-2
            :   // [ a | b ] [ b | a ] [ a | b ]
            :   //
            :   // EIOs are triggered on two disks.
            :   //    ts-0      ts-1      ts-2
            :   // [ X | b ] [ X | a ] [ a | b ]
            :   //
            :   // With improper disk-failure handling, this scenario alone would have been
            :   // enough to leave the server with only a single copy of data, as the two
            :   // servers with EIOs would have been shut down entirely.
            :   //
            :   // With proper disk-failure handling, this scenario would be salvagable, and
            :   // data would be replicated on the remaining disks.
            :   //    ts-0       ts-1      ts-2
            :   // [ X | ba ] [ X | ab ] [ a | b ]
nit move this comment to before the test declaration, or at least give an intro there


PS3, Line 150: if (files.size() == 1) {
             :           // "Empty" directories will only have the instance file.
             :           path_without_data = data_path;
             :         } else {
             :           path_with_data = data_path;
             :           num_servers_with_data++;
             :         }
this is a bit funky (depending on the number of files) any way we can be more precise? also does this work for both block managers?


PS3, Line 214: tablets
tablet servers


-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP disk failure: tests for disk failure recovery

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Hello Tidy Bot, David Ribeiro Alves, Kudu Jenkins, Adar Dembo, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#14).

Change subject: WIP disk failure: tests for disk failure recovery
......................................................................

WIP disk failure: tests for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and during
scans, ensuring the servers return to a normal state; another is added
to test disk failures during Scans.  All of these tests are
parameterized to run with both the LBM and FBM.

WIP: until other patches are merged, I'm expecting this to fail or be
extremely flaky.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/disk_failure-itest.cc
1 file changed, 351 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/14
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-Change-Number: 7031
Gerrit-PatchSet: 14
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: tests for disk failure recovery

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#9).

Change subject: disk failure: tests for disk failure recovery
......................................................................

disk failure: tests for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and during
scans, ensuring the servers return to a normal state. All of these tests
are parameterized to run with both the LBM and FBM.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
2 files changed, 380 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/9
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] disk failure: basic test coverage for disk failure

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#4).

Change subject: disk failure: basic test coverage for disk failure
......................................................................

disk failure: basic test coverage for disk failure

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and scans.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
2 files changed, 361 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>