You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2019/01/12 00:51:13 UTC
[kudu-CR] client: avoid accessing state after scheduling retry

Hello Alexey Serbin, Kudu Jenkins, Adar Dembo, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12170

to look at the new patch set (#3).

Change subject: client: avoid accessing state after scheduling retry
......................................................................

client: avoid accessing state after scheduling retry

I saw a failure of MasterMigrationTest.TestEndToEndMigration in which
the stack watchdog logged the following stacktrace:

User stack:
    @     0x7fa3e575c330  (unknown) at ??:0
    @           0x52dc09  __sanitizer::internal_read() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_syscall_linux_x86_64.inc:46 (discriminator 7)
    @           0x52f79f  __sanitizer::ReadFromFile() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_posix.cc:176
    @           0x53acb9  __sanitizer::SymbolizerProcess::ReadFromSymbolizer() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:468
    @           0x53ba76  __sanitizer::SymbolizerProcess::SendCommand() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:445
    @           0x53c3c5  __sanitizer::Symbolizer::SymbolizePC() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:356
    @           0x539673  __sanitizer::StackTrace::Print() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_stacktrace_libcdep.cc:36
    @           0x541c44  MaybePrintStackTrace() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_diag.cc:48
    @           0x5422ee  __ubsan::ScopedReport::~ScopedReport() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_diag.cc:73
    @           0x549608  HandleDynamicTypeCacheMiss() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_handlers_cxx.cc:81
    @           0x549a43  __ubsan_handle_dynamic_type_cache_miss_abort at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_handlers_cxx.cc:93
    @     0x7fa3f0086643  _ZZN4kudu6client8internal20AsyncLeaderMasterRpcINS_6master23GetTableSchemaRequestPBENS3_24GetTableSchemaResponsePBEE27RetryOrReconnectIfNecessaryEPNS_6StatusEENKUlvE_clEv at ??:0
    @     0x7fa3f0059b8c  _ZN4kudu13ScopedCleanupIZNS_6client8internal20AsyncLeaderMasterRpcINS_6master23GetTableSchemaRequestPBENS4_24GetTableSchemaResponsePBEE27RetryOrReconnectIfNecessaryEPNS_6StatusEEUlvE_ED2Ev at ??:0
    @     0x7fa3f005945c  kudu::client::internal::AsyncLeaderMasterRpc<>::RetryOrReconnectIfNecessary() at ??:0
    @     0x7fa3f0057d0a  kudu::client::internal::AsyncLeaderMasterRpc<>::SendRpcCb() at ??:0
    @     0x7fa3f0085a85  _ZNSt5_BindIFSt7_Mem_fnIMN4kudu6client8internal20AsyncLeaderMasterRpcINS1_6master23GetTableSchemaRequestPBENS5_24GetTableSchemaResponsePBEEEFvRKNS1_6StatusEEEPS8_S9_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE at ??:0
    #19 0x7fa3ddc84ffc in clone sysdeps/unix/sysv/linux/x86_64/clone.S:111

Upon looking into this and similar failures, it seems like the following
series of events is possible:

T1: allocate an AsyncLeaderMasterRpc on the stack with
    Synchronizer.Wait() as the user-specified callback
T2: the RPC call results in an error that makes it retry via
    RetryOrReconnectIfNecessary(); a retry is scheduled
T3: the retry completes and calls the user-specified callback to begin
    execution on T1
T1: with the RPC call completed, destroys RPC object
T2: the KLOG messages attempt to access state from the destroyed RPC
    object and hit a crash, race, undefined behavior, etc.

The fix is to make a copy of the RPC's state that was to be logged so
there's no chance that T1's destruction of the RPC will affect T2. I
tested this by looping a test[1] that repeatedly called
Client::IsCreateTableInProgress() on a multi-master cluster configured
to have election storms, and observing 5/1000 that yielded TSAN errors
around the logging calls in RetryOrReconnectIfNecessary.

[1] https://gist.github.com/andrwng/5d552f75a2e0d671b7ed54dd01892c66

Change-Id: I8cefd9613018247a1a25d17adedc021e8be166f6
---
M src/kudu/client/master_proxy_rpc.cc
1 file changed, 9 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/12170/3
-- 
To view, visit http://gerrit.cloudera.org:8080/12170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8cefd9613018247a1a25d17adedc021e8be166f6
Gerrit-Change-Number: 12170
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>