You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org> on 2019/11/05 22:28:39 UTC

[kudu-CR] [master] KUDU-2904 Crash master on disk error

Bankim Bhavsar has uploaded this change for review. ( http://gerrit.cloudera.org:8080/14642


Change subject: [master] KUDU-2904 Crash master on disk error
......................................................................

[master] KUDU-2904 Crash master on disk error

Register disk failure and cfile corruption FS error handlers in the master.
Since system catalog is the only tablet on the master server
crash the master on hitting these failures.

Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
---
M src/kudu/integration-tests/disk_failure-itest.cc
M src/kudu/master/master.cc
M src/kudu/master/master.h
3 files changed, 117 insertions(+), 37 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/14642/1
-- 
To view, visit http://gerrit.cloudera.org:8080/14642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
Gerrit-Change-Number: 14642
Gerrit-PatchSet: 1
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>

[kudu-CR] [master] KUDU-2904 Crash master on disk error

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14642 )

Change subject: [master] KUDU-2904 Crash master on disk error
......................................................................


Patch Set 3: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/14642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
Gerrit-Change-Number: 14642
Gerrit-PatchSet: 3
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 05 Nov 2019 23:26:01 +0000
Gerrit-HasComments: No

[kudu-CR] [master] KUDU-2904 Crash master on disk error

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Andrew Wong, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14642

to look at the new patch set (#2).

Change subject: [master] KUDU-2904 Crash master on disk error
......................................................................

[master] KUDU-2904 Crash master on disk error

Register disk failure and cfile corruption FS error handlers in the master.
Since system catalog is the only tablet on the master server
crash the master on hitting these failures.

Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
---
M src/kudu/integration-tests/disk_failure-itest.cc
M src/kudu/master/master.cc
M src/kudu/master/master.h
3 files changed, 117 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/14642/2
-- 
To view, visit http://gerrit.cloudera.org:8080/14642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
Gerrit-Change-Number: 14642
Gerrit-PatchSet: 2
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] [master] KUDU-2904 Crash master on disk error

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14642 )

Change subject: [master] KUDU-2904 Crash master on disk error
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14642/3/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

http://gerrit.cloudera.org:8080/#/c/14642/3/src/kudu/integration-tests/disk_failure-itest.cc@363
PS3, Line 363:   ASSERT_OK(leader_master->WaitForFatal(MonoDelta::FromSeconds(4)));
> This timeout value seems oddly specific, and potentially too low in a TSAN 
I'm unable to get TSAN builds working; both on my local Ubuntu 18.04 machine and ve0518 CentOS 6.6 machine.

To be safe bumped up wait time to 20 secs.


http://gerrit.cloudera.org:8080/#/c/14642/3/src/kudu/master/master.cc
File src/kudu/master/master.cc:

http://gerrit.cloudera.org:8080/#/c/14642/3/src/kudu/master/master.cc@302
PS3, Line 302: std::
> Nit: remove prefix
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
Gerrit-Change-Number: 14642
Gerrit-PatchSet: 3
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 06 Nov 2019 19:20:02 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2904 Crash master on disk error

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14642 )

Change subject: [master] KUDU-2904 Crash master on disk error
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14642/1/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

http://gerrit.cloudera.org:8080/#/c/14642/1/src/kudu/integration-tests/disk_failure-itest.cc@328
PS1, Line 328:   constexpr int kNumReplicas = 3;
> nit: this can actually be 1, right? We don't really care about actually cre
Done


http://gerrit.cloudera.org:8080/#/c/14642/1/src/kudu/integration-tests/disk_failure-itest.cc@342
PS1, Line 342:  std::unique_ptr<client::KuduTableCreator> table_creator(client_->NewTableCreator());
             :   auto client_schema = client::KuduSchema::FromSchema(GetSimpleTestSchema());
             :   for (int table_suffix = 0; table_suffix < 10; table_suffix++) {
             :     string table_name = Substitute("test-$0", table_suffix);
             :     LOG(INFO) << "Creating table " << table_name;
             :     ASSERT_OK(table_creator->table_name(table_name)
             :                   .schema(&client_schema)
             :                   .set_range_partition_columns({ "key" })
             :                   .num_replicas(kNumReplicas)
             :                   .wait(true)
             :                   .Create());
> Should we inject some sleeps in here so we're guaranteed to flush between t
That's a good point. I ran with dist-test 1000 times and observed 2 failures.



-- 
To view, visit http://gerrit.cloudera.org:8080/14642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
Gerrit-Change-Number: 14642
Gerrit-PatchSet: 2
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 05 Nov 2019 23:03:50 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2904 Crash master on disk error

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14642 )

Change subject: [master] KUDU-2904 Crash master on disk error
......................................................................


Patch Set 4: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/14642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
Gerrit-Change-Number: 14642
Gerrit-PatchSet: 4
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 06 Nov 2019 21:28:30 +0000
Gerrit-HasComments: No

[kudu-CR] [master] KUDU-2904 Crash master on disk error

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Andrew Wong, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14642

to look at the new patch set (#4).

Change subject: [master] KUDU-2904 Crash master on disk error
......................................................................

[master] KUDU-2904 Crash master on disk error

Register disk failure and cfile corruption FS error handlers in the master.
Since system catalog is the only tablet on the master server
crash the master on hitting these failures.

Tests:
- Ran the newly added test 1000 times using dist-test on debug and
  release builds and all passed.

Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
---
M src/kudu/integration-tests/disk_failure-itest.cc
M src/kudu/master/master.cc
M src/kudu/master/master.h
3 files changed, 118 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/14642/4
-- 
To view, visit http://gerrit.cloudera.org:8080/14642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
Gerrit-Change-Number: 14642
Gerrit-PatchSet: 4
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] [master] KUDU-2904 Crash master on disk error

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14642 )

Change subject: [master] KUDU-2904 Crash master on disk error
......................................................................


Patch Set 1:

(2 comments)

Test looks good overall! I'd try looping the new test in release mode to see about potential flakiness.

http://gerrit.cloudera.org:8080/#/c/14642/1/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

http://gerrit.cloudera.org:8080/#/c/14642/1/src/kudu/integration-tests/disk_failure-itest.cc@328
PS1, Line 328:   constexpr int kNumReplicas = 3;
nit: this can actually be 1, right? We don't really care about actually creating tablet replicas, so the less state we create the better.


http://gerrit.cloudera.org:8080/#/c/14642/1/src/kudu/integration-tests/disk_failure-itest.cc@342
PS1, Line 342:  for (int table_suffix = 0; table_suffix < 10; table_suffix++) {
             :     string table_name = Substitute("test-$0", table_suffix);
             :     LOG(INFO) << "Creating table " << table_name;
             :     std::unique_ptr<client::KuduTableCreator> table_creator(client_->NewTableCreator());
             :     auto client_schema = client::KuduSchema::FromSchema(GetSimpleTestSchema());
             :     ASSERT_OK(table_creator->table_name(table_name)
             :                   .schema(&client_schema)
             :                   .set_range_partition_columns({ "key" })
             :                   .num_replicas(kNumReplicas)
             :                   .wait(true)
             :                   .Create());
Should we inject some sleeps in here so we're guaranteed to flush between table creations? I can imagine if the tables are created quickly and we might wind up with a single MRS, and then no compactions are possible.



-- 
To view, visit http://gerrit.cloudera.org:8080/14642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
Gerrit-Change-Number: 14642
Gerrit-PatchSet: 1
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 05 Nov 2019 22:59:51 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2904 Crash master on disk error

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14642 )

Change subject: [master] KUDU-2904 Crash master on disk error
......................................................................


Patch Set 4:

> Patch Set 4:
> 
> Will wait to push in case you find anything in your TSAN looping

I was able to build tsan successfully and verified with dist-test, no failures reported.


-- 
To view, visit http://gerrit.cloudera.org:8080/14642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
Gerrit-Change-Number: 14642
Gerrit-PatchSet: 4
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 06 Nov 2019 22:05:15 +0000
Gerrit-HasComments: No

[kudu-CR] [master] KUDU-2904 Crash master on disk error

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14642 )

Change subject: [master] KUDU-2904 Crash master on disk error
......................................................................


Patch Set 3: Code-Review+2

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14642/3/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

http://gerrit.cloudera.org:8080/#/c/14642/3/src/kudu/integration-tests/disk_failure-itest.cc@363
PS3, Line 363:   ASSERT_OK(leader_master->WaitForFatal(MonoDelta::FromSeconds(4)));
This timeout value seems oddly specific, and potentially too low in a TSAN environment. Have you tried looping the test in TSAN mode, possibly with some additional stress threads (--stress_cpu_threads)?

I'm guessing we'll want to raise it to something like 20 or 30.


http://gerrit.cloudera.org:8080/#/c/14642/3/src/kudu/master/master.cc
File src/kudu/master/master.cc:

http://gerrit.cloudera.org:8080/#/c/14642/3/src/kudu/master/master.cc@302
PS3, Line 302: std::
Nit: remove prefix



-- 
To view, visit http://gerrit.cloudera.org:8080/14642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
Gerrit-Change-Number: 14642
Gerrit-PatchSet: 3
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 06 Nov 2019 08:15:29 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2904 Crash master on disk error

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/14642 )

Change subject: [master] KUDU-2904 Crash master on disk error
......................................................................

[master] KUDU-2904 Crash master on disk error

Register disk failure and cfile corruption FS error handlers in the master.
Since system catalog is the only tablet on the master server
crash the master on hitting these failures.

Tests:
- Ran the newly added test 1000 times using dist-test on debug and
  release builds and all passed.

Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
Reviewed-on: http://gerrit.cloudera.org:8080/14642
Tested-by: Kudu Jenkins
Reviewed-by: Andrew Wong <aw...@cloudera.com>
---
M src/kudu/integration-tests/disk_failure-itest.cc
M src/kudu/master/master.cc
M src/kudu/master/master.h
3 files changed, 118 insertions(+), 37 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Andrew Wong: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/14642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
Gerrit-Change-Number: 14642
Gerrit-PatchSet: 5
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] [master] KUDU-2904 Crash master on disk error

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Andrew Wong, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14642

to look at the new patch set (#3).

Change subject: [master] KUDU-2904 Crash master on disk error
......................................................................

[master] KUDU-2904 Crash master on disk error

Register disk failure and cfile corruption FS error handlers in the master.
Since system catalog is the only tablet on the master server
crash the master on hitting these failures.

Tests:
- Ran the newly added test 1000 times using dist-test and all passed.

Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
---
M src/kudu/integration-tests/disk_failure-itest.cc
M src/kudu/master/master.cc
M src/kudu/master/master.h
3 files changed, 118 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/14642/3
-- 
To view, visit http://gerrit.cloudera.org:8080/14642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
Gerrit-Change-Number: 14642
Gerrit-PatchSet: 3
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] [master] KUDU-2904 Crash master on disk error

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14642 )

Change subject: [master] KUDU-2904 Crash master on disk error
......................................................................


Patch Set 4:

Will wait to push in case you find anything in your TSAN looping


-- 
To view, visit http://gerrit.cloudera.org:8080/14642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I693eb7092c0b5feb530fb011937e636b40534495
Gerrit-Change-Number: 14642
Gerrit-PatchSet: 4
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 06 Nov 2019 21:29:04 +0000
Gerrit-HasComments: No