You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2019/01/12 00:51:13 UTC
[kudu-CR] client: avoid accessing state after scheduling retry
Hello Alexey Serbin, Kudu Jenkins, Adar Dembo, Todd Lipcon,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/12170
to look at the new patch set (#3).
Change subject: client: avoid accessing state after scheduling retry
......................................................................
client: avoid accessing state after scheduling retry
I saw a failure of MasterMigrationTest.TestEndToEndMigration in which
the stack watchdog logged the following stacktrace:
User stack:
@ 0x7fa3e575c330 (unknown) at ??:0
@ 0x52dc09 __sanitizer::internal_read() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_syscall_linux_x86_64.inc:46 (discriminator 7)
@ 0x52f79f __sanitizer::ReadFromFile() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_posix.cc:176
@ 0x53acb9 __sanitizer::SymbolizerProcess::ReadFromSymbolizer() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:468
@ 0x53ba76 __sanitizer::SymbolizerProcess::SendCommand() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:445
@ 0x53c3c5 __sanitizer::Symbolizer::SymbolizePC() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:356
@ 0x539673 __sanitizer::StackTrace::Print() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_stacktrace_libcdep.cc:36
@ 0x541c44 MaybePrintStackTrace() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_diag.cc:48
@ 0x5422ee __ubsan::ScopedReport::~ScopedReport() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_diag.cc:73
@ 0x549608 HandleDynamicTypeCacheMiss() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_handlers_cxx.cc:81
@ 0x549a43 __ubsan_handle_dynamic_type_cache_miss_abort at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_handlers_cxx.cc:93
@ 0x7fa3f0086643 _ZZN4kudu6client8internal20AsyncLeaderMasterRpcINS_6master23GetTableSchemaRequestPBENS3_24GetTableSchemaResponsePBEE27RetryOrReconnectIfNecessaryEPNS_6StatusEENKUlvE_clEv at ??:0
@ 0x7fa3f0059b8c _ZN4kudu13ScopedCleanupIZNS_6client8internal20AsyncLeaderMasterRpcINS_6master23GetTableSchemaRequestPBENS4_24GetTableSchemaResponsePBEE27RetryOrReconnectIfNecessaryEPNS_6StatusEEUlvE_ED2Ev at ??:0
@ 0x7fa3f005945c kudu::client::internal::AsyncLeaderMasterRpc<>::RetryOrReconnectIfNecessary() at ??:0
@ 0x7fa3f0057d0a kudu::client::internal::AsyncLeaderMasterRpc<>::SendRpcCb() at ??:0
@ 0x7fa3f0085a85 _ZNSt5_BindIFSt7_Mem_fnIMN4kudu6client8internal20AsyncLeaderMasterRpcINS1_6master23GetTableSchemaRequestPBENS5_24GetTableSchemaResponsePBEEEFvRKNS1_6StatusEEEPS8_S9_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE at ??:0
#19 0x7fa3ddc84ffc in clone sysdeps/unix/sysv/linux/x86_64/clone.S:111
Upon looking into this and similar failures, it seems like the following
series of events is possible:
T1: allocate an AsyncLeaderMasterRpc on the stack with
Synchronizer.Wait() as the user-specified callback
T2: the RPC call results in an error that makes it retry via
RetryOrReconnectIfNecessary(); a retry is scheduled
T3: the retry completes and calls the user-specified callback to begin
execution on T1
T1: with the RPC call completed, destroys RPC object
T2: the KLOG messages attempt to access state from the destroyed RPC
object and hit a crash, race, undefined behavior, etc.
The fix is to make a copy of the RPC's state that was to be logged so
there's no chance that T1's destruction of the RPC will affect T2. I
tested this by looping a test[1] that repeatedly called
Client::IsCreateTableInProgress() on a multi-master cluster configured
to have election storms, and observing 5/1000 that yielded TSAN errors
around the logging calls in RetryOrReconnectIfNecessary.
[1] https://gist.github.com/andrwng/5d552f75a2e0d671b7ed54dd01892c66
Change-Id: I8cefd9613018247a1a25d17adedc021e8be166f6
---
M src/kudu/client/master_proxy_rpc.cc
1 file changed, 9 insertions(+), 7 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/12170/3
--
To view, visit http://gerrit.cloudera.org:8080/12170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8cefd9613018247a1a25d17adedc021e8be166f6
Gerrit-Change-Number: 12170
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>