You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2018/02/23 19:23:33 UTC

[kudu-CR] KUDU-2325. Fix CHECK failure in RevokeSigData()

Hello Mike Percy,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/9430

to review the following change.


Change subject: KUDU-2325. Fix CHECK failure in RevokeSigData()
......................................................................

KUDU-2325. Fix CHECK failure in RevokeSigData()

This fixes a race in the stack trace signaling mechanism:

Target thread in HandleStackTraceSignal():
-----------------------------------------
- stack->Collect()
- queued_to_tid = kNotInUse
  <context switched out>

                Tracer thread in RevokeSigData:
                -------------------------------
                - see that the flag is not complete
                - exchange out queued_to_tid
                - assertions checked that queued_to_tid was either
                  equal to the target tid or kDumpInProgress (because
                  it was "not complete yet" it didn't expect
                  kNotInUse)

- mark complete

This was fairly rare since RevokeSigData had to run in between
the setting of 'queued_to_tid = kNotInUse' and the signalling
of the completion flag. A new test added in this commit managed
to trigger it pretty reliably by measuring the usual time taken for
stack trace collection and setting timeouts to very close to that value.

The fix is to simplify the state machine a bit so that the target thread
doesn't ever transition back to 'kNotInUse'. I added a more detailed
diagram of the state machine to clarify this.

Change-Id: I419ffa462821c1e6d8f9f158f028b86d60f4fd54
---
M src/kudu/util/debug-util-test.cc
M src/kudu/util/debug-util.cc
2 files changed, 111 insertions(+), 38 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/9430/1
-- 
To view, visit http://gerrit.cloudera.org:8080/9430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I419ffa462821c1e6d8f9f158f028b86d60f4fd54
Gerrit-Change-Number: 9430
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2325. Fix CHECK failure in RevokeSigData()

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9430 )

Change subject: KUDU-2325. Fix CHECK failure in RevokeSigData()
......................................................................

KUDU-2325. Fix CHECK failure in RevokeSigData()

This fixes a race in the stack trace signaling mechanism:

Target thread in HandleStackTraceSignal():
-----------------------------------------
- stack->Collect()
- queued_to_tid = kNotInUse
  <context switched out>

                Tracer thread in RevokeSigData:
                -------------------------------
                - see that the flag is not complete
                - exchange out queued_to_tid
                - assertions checked that queued_to_tid was either
                  equal to the target tid or kDumpInProgress (because
                  it was "not complete yet" it didn't expect
                  kNotInUse)

- mark complete

This was fairly rare since RevokeSigData had to run in between
the setting of 'queued_to_tid = kNotInUse' and the signalling
of the completion flag. A new test added in this commit managed
to trigger it pretty reliably by measuring the usual time taken for
stack trace collection and setting timeouts to very close to that value.

The fix is to simplify the state machine a bit so that the target thread
doesn't ever transition back to 'kNotInUse'. I added a more detailed
diagram of the state machine to clarify this.

Change-Id: I419ffa462821c1e6d8f9f158f028b86d60f4fd54
Reviewed-on: http://gerrit.cloudera.org:8080/9430
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <to...@apache.org>
---
M src/kudu/util/debug-util-test.cc
M src/kudu/util/debug-util.cc
2 files changed, 112 insertions(+), 38 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Todd Lipcon: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/9430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I419ffa462821c1e6d8f9f158f028b86d60f4fd54
Gerrit-Change-Number: 9430
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2325. Fix CHECK failure in RevokeSigData()

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/9430 )

Change subject: KUDU-2325. Fix CHECK failure in RevokeSigData()
......................................................................


Patch Set 1: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/9430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I419ffa462821c1e6d8f9f158f028b86d60f4fd54
Gerrit-Change-Number: 9430
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sun, 25 Feb 2018 19:43:35 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2325. Fix CHECK failure in RevokeSigData()

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/9430 )

Change subject: KUDU-2325. Fix CHECK failure in RevokeSigData()
......................................................................


Patch Set 2: Code-Review+2

Trivial rebase


-- 
To view, visit http://gerrit.cloudera.org:8080/9430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I419ffa462821c1e6d8f9f158f028b86d60f4fd54
Gerrit-Change-Number: 9430
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 26 Feb 2018 07:11:55 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2325. Fix CHECK failure in RevokeSigData()

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Alexey Serbin, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9430

to look at the new patch set (#2).

Change subject: KUDU-2325. Fix CHECK failure in RevokeSigData()
......................................................................

KUDU-2325. Fix CHECK failure in RevokeSigData()

This fixes a race in the stack trace signaling mechanism:

Target thread in HandleStackTraceSignal():
-----------------------------------------
- stack->Collect()
- queued_to_tid = kNotInUse
  <context switched out>

                Tracer thread in RevokeSigData:
                -------------------------------
                - see that the flag is not complete
                - exchange out queued_to_tid
                - assertions checked that queued_to_tid was either
                  equal to the target tid or kDumpInProgress (because
                  it was "not complete yet" it didn't expect
                  kNotInUse)

- mark complete

This was fairly rare since RevokeSigData had to run in between
the setting of 'queued_to_tid = kNotInUse' and the signalling
of the completion flag. A new test added in this commit managed
to trigger it pretty reliably by measuring the usual time taken for
stack trace collection and setting timeouts to very close to that value.

The fix is to simplify the state machine a bit so that the target thread
doesn't ever transition back to 'kNotInUse'. I added a more detailed
diagram of the state machine to clarify this.

Change-Id: I419ffa462821c1e6d8f9f158f028b86d60f4fd54
---
M src/kudu/util/debug-util-test.cc
M src/kudu/util/debug-util.cc
2 files changed, 112 insertions(+), 38 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/9430/2
-- 
To view, visit http://gerrit.cloudera.org:8080/9430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I419ffa462821c1e6d8f9f158f028b86d60f4fd54
Gerrit-Change-Number: 9430
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>