You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Adar Dembo (Code Review)" <ge...@cloudera.org> on 2019/01/23 01:17:24 UTC

[kudu-CR] KUDU-2665: deflake block manager-stress-test

Hello Mike Percy, helifu,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/12254

to review the following change.


Change subject: KUDU-2665: deflake block_manager-stress-test
......................................................................

KUDU-2665: deflake block_manager-stress-test

After commit 0c501979b was merged, this test became really flaky (like 50%
flaky in some environments). I think it's due to the new nature of the log
block manager, which may now delete dead containers in the background.

Specifically, if two transactions delete the last blocks from a full
container, it's possible for one to get scheduled for an (asynchronous) hole
punch, and for the other to set the container as dead. Later, when the
hole punch runs, the container's last ref will be dropped, causing the dead
container to be deleted.

While perhaps surprising, this new behavior is desirable, and it's now
incorrect to assume that a cessation in user threads implies an end to LBM
activity. block_manager-stress-test makes this assumption by using the
LBMCorruptor to inject inconsistencies after test threads have been joined.
To fix, we must explicitly quiesce the LBM; destroying it will do the trick.

What is surprising is that, for the life of me, I can't reproduce the
failure. Not locally, not on a CentOS 6.6 machine, not looped in dist-test
with stress threads, not ever. I even tried adding some "creative" sleep
calls in a few places to tickle the race, to no avail.

Change-Id: I0be328f740056cd6b64c9881759225c8b961a935
---
M src/kudu/fs/block_manager-stress-test.cc
1 file changed, 5 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/54/12254/1
-- 
To view, visit http://gerrit.cloudera.org:8080/12254
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0be328f740056cd6b64c9881759225c8b961a935
Gerrit-Change-Number: 12254
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: helifu <hz...@corp.netease.com>

[kudu-CR] KUDU-2665: deflake block manager-stress-test

Posted by "helifu (Code Review)" <ge...@cloudera.org>.
helifu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12254 )

Change subject: KUDU-2665: deflake block_manager-stress-test
......................................................................


Patch Set 1:

my preference 2) is not safe because we can't be sure when the threads in 'dd_manager_' will be over. ~~><~~


-- 
To view, visit http://gerrit.cloudera.org:8080/12254
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0be328f740056cd6b64c9881759225c8b961a935
Gerrit-Change-Number: 12254
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: helifu <hz...@corp.netease.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 03:33:28 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2665: deflake block manager-stress-test

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12254 )

Change subject: KUDU-2665: deflake block_manager-stress-test
......................................................................


Patch Set 1: Code-Review+2

Given that block_manager-stress-test is failing so much according to the flaky test dashboard, I'm going to treat He Lifu's +1 as a +2 and merge this.


-- 
To view, visit http://gerrit.cloudera.org:8080/12254
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0be328f740056cd6b64c9881759225c8b961a935
Gerrit-Change-Number: 12254
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: helifu <hz...@corp.netease.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 07:00:34 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2665: deflake block manager-stress-test

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12254 )

Change subject: KUDU-2665: deflake block_manager-stress-test
......................................................................


Patch Set 1:

> Reread the ThreadPool code, it seems that 'dd_manager_->WaitOnClosures()' in the destructor of LBM ensures that all tasks can be completed.

Exactly; that's why destroying the LBM has the effect of waiting for any outstanding dead container deletion to finish.


-- 
To view, visit http://gerrit.cloudera.org:8080/12254
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0be328f740056cd6b64c9881759225c8b961a935
Gerrit-Change-Number: 12254
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: helifu <hz...@corp.netease.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 06:57:37 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2665: deflake block manager-stress-test

Posted by "helifu (Code Review)" <ge...@cloudera.org>.
helifu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12254 )

Change subject: KUDU-2665: deflake block_manager-stress-test
......................................................................


Patch Set 1: Code-Review+1

Reread the ThreadPool code, it seems that 'dd_manager_->WaitOnClosures()' in the destructor of LBM ensures that all tasks can be completed.


-- 
To view, visit http://gerrit.cloudera.org:8080/12254
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0be328f740056cd6b64c9881759225c8b961a935
Gerrit-Change-Number: 12254
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: helifu <hz...@corp.netease.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 06:39:32 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2665: deflake block manager-stress-test

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has removed Kudu Jenkins from this change.  ( http://gerrit.cloudera.org:8080/12254 )

Change subject: KUDU-2665: deflake block_manager-stress-test
......................................................................


Removed reviewer Kudu Jenkins with the following votes:

* Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/12254
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteReviewer
Gerrit-Change-Id: I0be328f740056cd6b64c9881759225c8b961a935
Gerrit-Change-Number: 12254
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: helifu <hz...@corp.netease.com>

[kudu-CR] KUDU-2665: deflake block manager-stress-test

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12254 )

Change subject: KUDU-2665: deflake block_manager-stress-test
......................................................................


Patch Set 1: Verified+1

Overriding Jenkins; the Python3 build breakages are unrelated.


-- 
To view, visit http://gerrit.cloudera.org:8080/12254
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0be328f740056cd6b64c9881759225c8b961a935
Gerrit-Change-Number: 12254
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: helifu <hz...@corp.netease.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 06:57:57 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2665: deflake block manager-stress-test

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12254 )

Change subject: KUDU-2665: deflake block_manager-stress-test
......................................................................

KUDU-2665: deflake block_manager-stress-test

After commit 0c501979b was merged, this test became really flaky (like 50%
flaky in some environments). I think it's due to the new nature of the log
block manager, which may now delete dead containers in the background.

Specifically, if two transactions delete the last blocks from a full
container, it's possible for one to get scheduled for an (asynchronous) hole
punch, and for the other to set the container as dead. Later, when the
hole punch runs, the container's last ref will be dropped, causing the dead
container to be deleted.

While perhaps surprising, this new behavior is desirable, and it's now
incorrect to assume that a cessation in user threads implies an end to LBM
activity. block_manager-stress-test makes this assumption by using the
LBMCorruptor to inject inconsistencies after test threads have been joined.
To fix, we must explicitly quiesce the LBM; destroying it will do the trick.

What is surprising is that, for the life of me, I can't reproduce the
failure. Not locally, not on a CentOS 6.6 machine, not looped in dist-test
with stress threads, not ever. I even tried adding some "creative" sleep
calls in a few places to tickle the race, to no avail.

Change-Id: I0be328f740056cd6b64c9881759225c8b961a935
Reviewed-on: http://gerrit.cloudera.org:8080/12254
Reviewed-by: helifu <hz...@corp.netease.com>
Tested-by: Adar Dembo <ad...@cloudera.com>
Reviewed-by: Adar Dembo <ad...@cloudera.com>
---
M src/kudu/fs/block_manager-stress-test.cc
1 file changed, 5 insertions(+), 0 deletions(-)

Approvals:
  helifu: Looks good to me, but someone else must approve
  Adar Dembo: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/12254
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I0be328f740056cd6b64c9881759225c8b961a935
Gerrit-Change-Number: 12254
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: helifu <hz...@corp.netease.com>

[kudu-CR] KUDU-2665: deflake block manager-stress-test

Posted by "helifu (Code Review)" <ge...@cloudera.org>.
helifu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12254 )

Change subject: KUDU-2665: deflake block_manager-stress-test
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12254/1/src/kudu/fs/block_manager-stress-test.cc
File src/kudu/fs/block_manager-stress-test.cc:

http://gerrit.cloudera.org:8080/#/c/12254/1/src/kudu/fs/block_manager-stress-test.cc@550
PS1, Line 550:  // Quiesce the block manager before injecting inconsistencies so that the two
             :   // don't interfere with one another.
             :   this->bm_.reset();
I have just read through the code of 'block_manager-stress-test'. I agree with adar's analysis that it's now incorrect to assume that a cessation in user threads implies an end to LBM
activity. Please take a look at the code of log_block_manager-test-util.cc:Line70/Line95, it is not safe while the threads in 'dd_manager_' are deleting dead container via punch hole and drop container's last ref. And i am sure Percy's issue is from Line95. Maybe we can have two ways: 1)destroy the 'dd_manager_' before injecting inconsistencies, 2)check data file existence at Line95.
Sorry for leaving this issue:(



-- 
To view, visit http://gerrit.cloudera.org:8080/12254
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0be328f740056cd6b64c9881759225c8b961a935
Gerrit-Change-Number: 12254
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: helifu <hz...@corp.netease.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 03:11:47 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2665: deflake block manager-stress-test

Posted by "helifu (Code Review)" <ge...@cloudera.org>.
helifu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12254 )

Change subject: KUDU-2665: deflake block_manager-stress-test
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12254/1/src/kudu/fs/block_manager-stress-test.cc
File src/kudu/fs/block_manager-stress-test.cc:

http://gerrit.cloudera.org:8080/#/c/12254/1/src/kudu/fs/block_manager-stress-test.cc@550
PS1, Line 550:  // Quiesce the block manager before injecting inconsistencies so that the two
             :   // don't interfere with one another.
             :   this->bm_.reset();
> I have just read through the code of 'block_manager-stress-test'. I agree w
BTW, i prefer 2), and let 'dd_manager_' continue to run.



-- 
To view, visit http://gerrit.cloudera.org:8080/12254
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0be328f740056cd6b64c9881759225c8b961a935
Gerrit-Change-Number: 12254
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: helifu <hz...@corp.netease.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 03:20:32 +0000
Gerrit-HasComments: Yes