You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2016/10/05 02:16:38 UTC

[kudu-CR] kernel stack watchdog: avoid blocking threads starting

Hello Adar Dembo,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/4626

to review the following change.

Change subject: kernel_stack_watchdog: avoid blocking threads starting
......................................................................

kernel_stack_watchdog: avoid blocking threads starting

I've noticed recently that threads start particularly slowly in TSAN.
One culprit which seems to exacerbate this issue is the following:

- TSAN defers signal-handling in many cases, which causes the stack
  watchdog to be slow at collecting stacks.
- The stack watchdog was holding a lock while collecting stacks from
  stuck threads.
- This lock blocked other threads from starting, since every new thread
  needs to register itself with the watchdog.

The fix here is to make the synchronization more fine-grained: we only
hold this lock long enough to make a copy of the current map of
registered threads. However, it's still important to prevent these
threads from _exiting_ while we are looking at their TLS. So, this patch
adds a new 'unregister_lock_' which is used to prevent such exits.

Since 'lock_' is now held for only short periods of time, I switched it
out for a spinlock instead of a mutex.

No new tests are included, but the watchdog is already covered and runs
as part of nearly every test.

Change-Id: I7af85ade6ec9050843ec5b146d26c2549c503d8f
---
M src/kudu/util/kernel_stack_watchdog.cc
M src/kudu/util/kernel_stack_watchdog.h
2 files changed, 51 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/26/4626/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I7af85ade6ec9050843ec5b146d26c2549c503d8f
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>

[kudu-CR] kernel stack watchdog: avoid blocking threads starting

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged.

Change subject: kernel_stack_watchdog: avoid blocking threads starting
......................................................................


kernel_stack_watchdog: avoid blocking threads starting

I've noticed recently that threads start particularly slowly in TSAN.
One culprit which seems to exacerbate this issue is the following:

- TSAN defers signal-handling in many cases, which causes the stack
  watchdog to be slow at collecting stacks.
- The stack watchdog was holding a lock while collecting stacks from
  stuck threads.
- This lock blocked other threads from starting, since every new thread
  needs to register itself with the watchdog.

The fix here is to make the synchronization more fine-grained: we only
hold this lock long enough to make a copy of the current map of
registered threads. However, it's still important to prevent these
threads from _exiting_ while we are looking at their TLS. So, this patch
adds a new 'unregister_lock_' which is used to prevent such exits.

Since 'lock_' is now held for only short periods of time, I switched it
out for a spinlock instead of a mutex.

Additionally, the lock protecting the log collector was also separated
out.

No new tests are included, but the watchdog is already covered and runs
as part of nearly every test.

Change-Id: I7af85ade6ec9050843ec5b146d26c2549c503d8f
Reviewed-on: http://gerrit.cloudera.org:8080/4626
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Todd Lipcon <to...@apache.org>
---
M src/kudu/util/kernel_stack_watchdog.cc
M src/kudu/util/kernel_stack_watchdog.h
2 files changed, 60 insertions(+), 32 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Todd Lipcon: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/4626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I7af85ade6ec9050843ec5b146d26c2549c503d8f
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] kernel stack watchdog: avoid blocking threads starting

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: kernel_stack_watchdog: avoid blocking threads starting
......................................................................


Patch Set 2: Code-Review+2

(2 comments)

Looks good, though it looks like you still have a broken test.

Between this patch and the other one you submitted to speed up thread creation in TSAN: what techniques did you use to find these root causes?

http://gerrit.cloudera.org:8080/#/c/4626/2//COMMIT_MSG
Commit Message:

PS2, Line 12: TSAN defers signal-handling
Just so I understand, what you mean is that TSAN handles the signal but takes its time before forwarding it to the process? DumpThreadStack() waits up to a second, so I presume you're still seeing stack traces but delayed in the high hundreds of ms or something like that?


PS2, Line 21: However, it's still important to prevent these
            : threads from _exiting_ while we are looking at their TLS
But presumably delaying Thread.Join() has little to no effect on test flakiness the way delaying Thread.Create() may?


-- 
To view, visit http://gerrit.cloudera.org:8080/4626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7af85ade6ec9050843ec5b146d26c2549c503d8f
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] kernel stack watchdog: avoid blocking threads starting

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4626

to look at the new patch set (#2).

Change subject: kernel_stack_watchdog: avoid blocking threads starting
......................................................................

kernel_stack_watchdog: avoid blocking threads starting

I've noticed recently that threads start particularly slowly in TSAN.
One culprit which seems to exacerbate this issue is the following:

- TSAN defers signal-handling in many cases, which causes the stack
  watchdog to be slow at collecting stacks.
- The stack watchdog was holding a lock while collecting stacks from
  stuck threads.
- This lock blocked other threads from starting, since every new thread
  needs to register itself with the watchdog.

The fix here is to make the synchronization more fine-grained: we only
hold this lock long enough to make a copy of the current map of
registered threads. However, it's still important to prevent these
threads from _exiting_ while we are looking at their TLS. So, this patch
adds a new 'unregister_lock_' which is used to prevent such exits.

Since 'lock_' is now held for only short periods of time, I switched it
out for a spinlock instead of a mutex.

No new tests are included, but the watchdog is already covered and runs
as part of nearly every test.

Change-Id: I7af85ade6ec9050843ec5b146d26c2549c503d8f
---
M src/kudu/util/kernel_stack_watchdog.cc
M src/kudu/util/kernel_stack_watchdog.h
2 files changed, 55 insertions(+), 32 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/26/4626/2
-- 
To view, visit http://gerrit.cloudera.org:8080/4626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7af85ade6ec9050843ec5b146d26c2549c503d8f
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] kernel stack watchdog: avoid blocking threads starting

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: kernel_stack_watchdog: avoid blocking threads starting
......................................................................


Patch Set 2:

(2 comments)

To find the root causes I was basically just looking at gstacks and adding LOG_IF_SLOW calls in various places, nothing too fancy.

http://gerrit.cloudera.org:8080/#/c/4626/2//COMMIT_MSG
Commit Message:

PS2, Line 12: TSAN defers signal-handling
> Just so I understand, what you mean is that TSAN handles the signal but tak
yea, I did some "LOG_IF_SLOW" on the Register(TLS) function and found that it was sometimes blocked for 100+ ms, and usually at the same time as the watchdog was attempting to dump some stack.


PS2, Line 21: However, it's still important to prevent these
            : threads from _exiting_ while we are looking at their TLS
> But presumably delaying Thread.Join() has little to no effect on test flaki
yea, there's a comment in the code to that effect. Thread _exits_ are basically never on a critical path, whereas thread creation often is (eg starting threadpool workers)


-- 
To view, visit http://gerrit.cloudera.org:8080/4626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7af85ade6ec9050843ec5b146d26c2549c503d8f
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] kernel stack watchdog: avoid blocking threads starting

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: kernel_stack_watchdog: avoid blocking threads starting
......................................................................


Patch Set 3: Verified+1

Other known flakies

-- 
To view, visit http://gerrit.cloudera.org:8080/4626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7af85ade6ec9050843ec5b146d26c2549c503d8f
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] kernel stack watchdog: avoid blocking threads starting

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4626

to look at the new patch set (#3).

Change subject: kernel_stack_watchdog: avoid blocking threads starting
......................................................................

kernel_stack_watchdog: avoid blocking threads starting

I've noticed recently that threads start particularly slowly in TSAN.
One culprit which seems to exacerbate this issue is the following:

- TSAN defers signal-handling in many cases, which causes the stack
  watchdog to be slow at collecting stacks.
- The stack watchdog was holding a lock while collecting stacks from
  stuck threads.
- This lock blocked other threads from starting, since every new thread
  needs to register itself with the watchdog.

The fix here is to make the synchronization more fine-grained: we only
hold this lock long enough to make a copy of the current map of
registered threads. However, it's still important to prevent these
threads from _exiting_ while we are looking at their TLS. So, this patch
adds a new 'unregister_lock_' which is used to prevent such exits.

Since 'lock_' is now held for only short periods of time, I switched it
out for a spinlock instead of a mutex.

Additionally, the lock protecting the log collector was also separated
out.

No new tests are included, but the watchdog is already covered and runs
as part of nearly every test.

Change-Id: I7af85ade6ec9050843ec5b146d26c2549c503d8f
---
M src/kudu/util/kernel_stack_watchdog.cc
M src/kudu/util/kernel_stack_watchdog.h
2 files changed, 60 insertions(+), 32 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/26/4626/3
-- 
To view, visit http://gerrit.cloudera.org:8080/4626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7af85ade6ec9050843ec5b146d26c2549c503d8f
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] kernel stack watchdog: avoid blocking threads starting

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: kernel_stack_watchdog: avoid blocking threads starting
......................................................................


Patch Set 3: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/4626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7af85ade6ec9050843ec5b146d26c2549c503d8f
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No