You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2017/05/03 21:46:02 UTC

[kudu-CR] WIP: KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode

Hello Adar Dembo,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/6791

to review the following change.

Change subject: WIP: KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode
......................................................................

WIP: KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode

KUDU-1992 describes a bug in which the destructor of one threadlocal
object ended up referring to another threadlocal object. Specifically, a
threadlocal cached block destructor ended up incrementing a metric which
was implemented by a LongAdder (inheriting from Striped64). The
LongAdder uses a threadlocal hashcode to avoid contention.

In the crash described in the bug, the striped64's TLS hashcode was
destructed before the other destructor ran, causing a
heap-use-after-free.

This patch should fix the issue by making the threadlocal hashcode be a
POD threadlocal rather than a pointer to a heap-allocated object. Thus
there is no destructor to run, and no chance that the destruction order
can cause this crash.

This fix seemed simpler than trying to ensure some specific ordering of
threadlocal destructors (eg by assigning destructor priorities or adding
some kind of dependency graph).

WIP because I would like to run a dist-test loop to ensure it's really
fixed.

Change-Id: I2d4e5264134c70ced434c3ae8da5866a151464d5
---
M src/kudu/util/striped64.cc
M src/kudu/util/striped64.h
2 files changed, 19 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/91/6791/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6791
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I2d4e5264134c70ced434c3ae8da5866a151464d5
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>

[kudu-CR] WIP: KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: WIP: KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode
......................................................................


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6791/1//COMMIT_MSG
Commit Message:

PS1, Line 24: This fix seemed simpler than trying to ensure some specific ordering of
            : threadlocal destructors (eg by assigning destructor priorities or adding
            : some kind of dependency graph).
> This is reasonable.
thanks for investigating. I was taking a more empirical approach: we haven't seen any ASAN crashes related to this until this bug came along, so I'm assuming this is the only one :)


http://gerrit.cloudera.org:8080/#/c/6791/1/src/kudu/util/striped64.cc
File src/kudu/util/striped64.cc:

Line 20: #include "kudu/util/striped64.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


PS1, Line 57: // Avoid zero to allow xorShift rehash
> Now we're also avoiding zero because it's a magic value that indicates that
Done


http://gerrit.cloudera.org:8080/#/c/6791/1/src/kudu/util/striped64.h
File src/kudu/util/striped64.h:

Line 115:   void RetryUpdate(int64_t x, Rehash to_rehash);
> warning: function 'kudu::Striped64::RetryUpdate' has a definition with diff
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6791
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2d4e5264134c70ced434c3ae8da5866a151464d5
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP: KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: WIP: KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6791/1//COMMIT_MSG
Commit Message:

PS1, Line 24: This fix seemed simpler than trying to ensure some specific ordering of
            : threadlocal destructors (eg by assigning destructor priorities or adding
            : some kind of dependency graph).
This is reasonable.

I went looking for other instances of class-scoped static thread locals (i.e. the macros from threadlocal.h). The only one I found was the KernelStackWatchdog's state. AFAICT, the only way it could trigger a similar bug is if an item cached in the ThreadLocalCache invoked SCOPED_WATCH_STACK() when being destructed. At the moment:
1. The only ThreadLocalCache use is for BloomCacheItem, which doesn't do that.
2. I'm hard-pressed to think of a case where such a thing would _ever_ happen. It would mean that the cached item's destructor call graph somehow performed an operation that was expected to block on I/O for some time. I suppose if the call graph included a deref of a shared object whose destructor did IO, such as LBM's LogBlock, but even in that case the IO is thunked to a different thread.


http://gerrit.cloudera.org:8080/#/c/6791/1/src/kudu/util/striped64.cc
File src/kudu/util/striped64.cc:

PS1, Line 57: // Avoid zero to allow xorShift rehash
Now we're also avoiding zero because it's a magic value that indicates that we need to generate the hash code.


-- 
To view, visit http://gerrit.cloudera.org:8080/6791
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2d4e5264134c70ced434c3ae8da5866a151464d5
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6791

to look at the new patch set (#2).

Change subject: KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode
......................................................................

KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode

KUDU-1992 describes a bug in which the destructor of one threadlocal
object ended up referring to another threadlocal object. Specifically, a
threadlocal cached block destructor ended up incrementing a metric which
was implemented by a LongAdder (inheriting from Striped64). The
LongAdder uses a threadlocal hashcode to avoid contention.

In the crash described in the bug, the striped64's TLS hashcode was
destructed before the other destructor ran, causing a
heap-use-after-free.

This patch should fix the issue by making the threadlocal hashcode be a
POD threadlocal rather than a pointer to a heap-allocated object. Thus
there is no destructor to run, and no chance that the destruction order
can cause this crash.

This fix seemed simpler than trying to ensure some specific ordering of
threadlocal destructors (eg by assigning destructor priorities or adding
some kind of dependency graph).

Before the patch, I was able to reproduce the crash 2/1000 times running
RaftConsensusITest.TestAutoCreateReplica. After the patch, I ran it 5000
times with no failures.

Change-Id: I2d4e5264134c70ced434c3ae8da5866a151464d5
---
M src/kudu/util/striped64.cc
M src/kudu/util/striped64.h
2 files changed, 26 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/91/6791/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6791
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2d4e5264134c70ced434c3ae8da5866a151464d5
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged.

Change subject: KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode
......................................................................


KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode

KUDU-1992 describes a bug in which the destructor of one threadlocal
object ended up referring to another threadlocal object. Specifically, a
threadlocal cached block destructor ended up incrementing a metric which
was implemented by a LongAdder (inheriting from Striped64). The
LongAdder uses a threadlocal hashcode to avoid contention.

In the crash described in the bug, the striped64's TLS hashcode was
destructed before the other destructor ran, causing a
heap-use-after-free.

This patch should fix the issue by making the threadlocal hashcode be a
POD threadlocal rather than a pointer to a heap-allocated object. Thus
there is no destructor to run, and no chance that the destruction order
can cause this crash.

This fix seemed simpler than trying to ensure some specific ordering of
threadlocal destructors (eg by assigning destructor priorities or adding
some kind of dependency graph).

Before the patch, I was able to reproduce the crash 2/1000 times running
RaftConsensusITest.TestAutoCreateReplica. After the patch, I ran it 5000
times with no failures.

Change-Id: I2d4e5264134c70ced434c3ae8da5866a151464d5
Reviewed-on: http://gerrit.cloudera.org:8080/6791
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Kudu Jenkins
---
M src/kudu/util/striped64.cc
M src/kudu/util/striped64.h
2 files changed, 26 insertions(+), 28 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/6791
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I2d4e5264134c70ced434c3ae8da5866a151464d5
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-1992. striped64: don't heap-allocate the threadlocal hashcode
......................................................................


Patch Set 2: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6791
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2d4e5264134c70ced434c3ae8da5866a151464d5
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No