You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Dan Hecht (JIRA)" <ji...@apache.org> on 2018/09/14 17:55:00 UTC

[jira] [Resolved] (IMPALA-7464) DCHECK(!released_exec_resources_) is triggered when ExecFInstance() RPC times out

     [ https://issues.apache.org/jira/browse/IMPALA-7464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dan Hecht resolved IMPALA-7464.
-------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 3.1.0

> DCHECK(!released_exec_resources_) is triggered when ExecFInstance() RPC times out
> ---------------------------------------------------------------------------------
>
>                 Key: IMPALA-7464
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7464
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 3.1.0
>            Reporter: Michael Ho
>            Assignee: Dan Hecht
>            Priority: Critical
>             Fix For: Impala 3.1.0
>
>
> Hit the following when running custom_cluster_test/test_rpc_timeout.py
> {noformat}
> F0817 20:04:25.451669   311 query-state.cc:469] Check failed: !released_exec_resources_
> void QueryState::AcquireExecResourceRefcount() {
>   DCHECK(!released_exec_resources_); <<----
>   exec_resource_refcnt_.Add(1);
> }
> #0  0x00007ff07d59a1f7 in raise () from /lib64/libc.so.6
> #1  0x00007ff07d59b8e8 in abort () from /lib64/libc.so.6
> #2  0x00000000043dd1b4 in google::DumpStackTraceAndExit() ()
> #3  0x00000000043d3c0d in google::LogMessage::Fail() ()
> #4  0x00000000043d54b2 in google::LogMessage::SendToLog() ()
> #5  0x00000000043d35e7 in google::LogMessage::Flush() ()
> #6  0x00000000043d6bae in google::LogMessageFatal::~LogMessageFatal() ()
> #7  0x0000000001e2f623 in impala::QueryState::AcquireExecResourceRefcount (this=0xc5c2800) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-state.cc:469
> #8  0x0000000001e2b734 in impala::QueryState::Init (this=0xc5c2800, rpc_params=...) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-state.cc:125
> #9  0x0000000001e516e1 in impala::QueryExecMgr::StartQuery (this=0xc7c55c0, params=...) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-exec-mgr.cc:52
> #10 0x0000000001f6bd12 in impala::ImpalaInternalService::ExecQueryFInstances (this=0xd4421e0, return_val=..., params=...) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/service/impala-internal-service.cc:51
> #11 0x000000000308615c in impala::ImpalaInternalServiceProcessor::process_ExecQueryFInstances (this=0xcc38580, seqid=0, iprot=0xd3a4900, oprot=0xd3a4980, callContext=0xc2475c0) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/generated-sources/gen-cpp/ImpalaInternalService.cpp:1278
> #12 0x0000000003085eaa in impala::ImpalaInternalServiceProcessor::dispatchCall (this=0xcc38580, iprot=0xd3a4900, oprot=0xd3a4980, fname=..., seqid=0, callContext=0xc2475c0) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/generated-sources/gen-cpp/ImpalaInternalService.cpp:1251
> #13 0x0000000001913fd6 in apache::thrift::TDispatchProcessor::process (this=0xcc38580, in=..., out=..., connectionContext=0xc2475c0) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/thrift-0.9.3-p4/include/thrift/TDispatchProcessor.h:121
> #14 0x0000000001d2fbe9 in apache::thrift::server::TAcceptQueueServer::Task::run (this=0xce010c0) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/rpc/TAcceptQueueServer.cpp:68
> #15 0x0000000001d27ca5 in impala::ThriftThread::RunRunnable (this=0xc246500, runnable=..., promise=0x7fefee7b1260) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/rpc/thrift-thread.cc:74
> #16 0x0000000001d293cb in boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long, (impala::PromiseMode)0>*>::operator() (this=0xcdf1650, p=0xc246500, a1=..., a2=0x7fefee7b1260) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/mem_fn_template.hpp:280
> #17 0x0000000001d29261 in boost::_bi::list3<boost::_bi::value<impala::ThriftThread*>, boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable> >, boost::_bi::value<impala::Promise<unsigned long, (impala::PromiseMode)0>*> >::operator()<boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long, (impala::PromiseMode)0>*>, boost::_bi::list0> (this=0xcdf1660, f=..., a=...) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/bind.hpp:392
> #18 0x0000000001d28fad in boost::_bi::bind_t<void, boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long, (impala::PromiseMode)0>*>, boost::_bi::list3<boost::_bi::value<impala::ThriftThread*>, boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable> >, boost::_bi::value<impala::Promise<unsigned long, (impala::PromiseMode)0>*> > >::operator() (this=0xcdf1650) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/bind_template.hpp:20
> #19 0x0000000001d28ec0 in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long, (impala::PromiseMode)0>*>, boost::_bi::list3<boost::_bi::value<impala::ThriftThread*>, boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable> >, boost::_bi::value<impala::Promise<unsigned long, (impala::PromiseMode)0>*> > >, void>::invoke (function_obj_ptr=...) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/function/function_template.hpp:153
> #20 0x0000000001c4cbd0 in boost::function0<void>::operator() (this=0x7fefe719cba0) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/function/function_template.hpp:767
> #21 0x00000000020754db in impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*) (name=..., category=..., functor=..., parent_thread_info=0x7fefee7b1850, thread_started=0x7fefee7b0f30) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/thread.cc:359
> #22 0x000000000207d7fb in boost::_bi::list5<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::ThreadDebugInfo*>, boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> >::operator()<void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*), boost::_bi::list0&, int) (this=0xd016fc0, f=@0xd016fb8: 0x2075174 <impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*)>, a=...) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/bind.hpp:525
> #23 0x000000000207d71f in boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*), boost::_bi::list5<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::ThreadDebugInfo*>, boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> > >::operator()() (this=0xd016fb8) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/bind_template.hpp:20
> #24 0x000000000207d6e2 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*), boost::_bi::list5<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::ThreadDebugInfo*>, boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> > > >::run() (this=0xd016e00) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/thread/detail/thread.hpp:116
> #25 0x00000000032ff25a in thread_proxy ()
> #26 0x00007ff07d92fe25 in start_thread () from /lib64/libpthread.so.0
> #27 0x00007ff07d65d34d in clone () from /lib64/libc.so.6
> {noformat}
> It appears that there is a race on the coordinator which occurs only when {{ExecFInstance()}} RPC times out due to really slow processing of the RPC on the coordinator itself. The assumption of the existing code is that by the time {{exec_rpcs_complete_barrier_}} is unblocked, all executors should have processed their RPCs and called QueryState::Init(). This assumption is violated when the executor on coordinator is slow enough to cause {{ExecFInstance()}} RPC to time out.
> In which case, the coordinator proceeds to mark the query as failed in {{Coordinator::FinishBackendStartup()}} which calls {{UpdateExecState()}} which eventually calls {{ReleaseExecResources()}} in {{HandleExecStateTransition()}}. That's why the DCHECK above is triggered.
> {noformat}
> I0817 20:04:23.451918   309 coordinator.cc:508] ExecState: query id=d740459b753125a4:f74e0fd700000000 finstance=N/A on host=impala-ec2-centos74-m5-4xlarge-ondemand-1ebb.vpc.cloudera.com (EXECUTING -> ERROR) status=ExecQueryFInstances rpc query_id=d740459b753125a4:f74e0fd700000000 failed: RPC recv timed out: dest address: impala-ec2-centos74-m5-4xlarge-ondemand-1ebb.vpc.cloudera.com:22000, rpc: N6impala26TExecQueryFInstancesResultE
> {noformat}
> In release builds, the behavior would be that the executor on coordinator will move ahead to allocate a bunch of resources after {{ReleaseExecResources()}} has been called on the coordinator. It will also start a bunch of fragment instances only to realize latter from {{ReportExecStatus()}} RPC that the query should be cancelled, in which case, it calls {{ReleaseExecResource()}} again after the ref count reaches 0 in {{ReleaseExecResourceRefCount()}} so there should be no resource leak. That said, if any logic relies on the flag {{released_exec_resources_}}, it may get confused.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org