You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/08/28 16:55:00 UTC

[jira] [Commented] (IMPALA-6788) Abort ExecFInstance() RPC loop early after query failure

    [ https://issues.apache.org/jira/browse/IMPALA-6788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186681#comment-17186681 ] 

ASF subversion and git services commented on IMPALA-6788:
---------------------------------------------------------

Commit 3733c4cc2cfb78d7f13463fb1ee9e1c4560d4a3d in impala's branch refs/heads/master from wzhou-code
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=3733c4c ]

IMPALA-10050: Fixed DCHECK error for backend in terminal state.

Recent patch for IMPALA-6788 makes coordinator to cancel inflight
query fragment instances when it receives failure report from one
backend. It's possible the BackendState::Cancel() is called for
one fragment instance before the first execution status report
from its backend is received and processed by the coordinator.
Since the status of BackendState is set as Cancelled after Cancel()
is called, the execution of the fragment instance is treated as
Done in such case so that the status report will NOT be processed.
Hence the backend receives response OK from coordinator even it
sent a report with execution error. This make backend hit DCHECK
error if backend in the terminal state with error.
This patch fixs the issue by making coordinator send CANCELLED
status in the response of status report if the backend status is not
ok and the execution status report is not applied.

Testing:
 - The issue could be reproduced by running test_failpoints for about
   20 iterations. Verified the fixing by running test_failpoints over
   200 iterations without DCHECK failure.
 - Passed TestProcessFailures::test_kill_coordinator.
 - Psssed TestRPCException::test_state_report_error.
 - Passed exhaustive tests.

Change-Id: Iba6a72f98c0f9299c22c58830ec5a643335b966a
Reviewed-on: http://gerrit.cloudera.org:8080/16303
Reviewed-by: Thomas Tauber-Marshall <tm...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Abort ExecFInstance() RPC loop early after query failure
> --------------------------------------------------------
>
>                 Key: IMPALA-6788
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6788
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Distributed Exec
>    Affects Versions: Impala 2.12.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Wenzhe Zhou
>            Priority: Major
>              Labels: krpc, rpc
>             Fix For: Impala 4.0
>
>         Attachments: connect_thread_busy_queries_failing.txt, impalad.va1007.foo.com.impala.log.INFO.20180401-200453.1800807.zip
>
>
> Logs from a large cluster show that query startup can take a long time, then once the startup completes the query is cancelled, this is because one of the intermediate rpcs failed. 
> Not clear what the right answer is as fragments are started asynchronously, possibly a timeout?
> {code}
> I0401 21:25:30.776803 1830900 coordinator.cc:99] Exec() query_id=334cc7dd9758c36c:ec38aeb400000000 stmt=with customer_total_return as
> I0401 21:25:30.813993 1830900 coordinator.cc:357] starting execution on 644 backends for query_id=334cc7dd9758c36c:ec38aeb400000000
> I0401 21:29:58.406466 1830900 coordinator.cc:370] started execution on 644 backends for query_id=334cc7dd9758c36c:ec38aeb400000000
> I0401 21:29:58.412132 1830900 coordinator.cc:896] Cancel() query_id=334cc7dd9758c36c:ec38aeb400000000
> I0401 21:29:59.188817 1830900 coordinator.cc:906] CancelBackends() query_id=334cc7dd9758c36c:ec38aeb400000000, tried to cancel 643 backends
> I0401 21:29:59.189177 1830900 coordinator.cc:1092] Release admission control resources for query_id=334cc7dd9758c36c:ec38aeb400000000
> {code}
> {code}
> I0401 21:23:48.218379 1830386 coordinator.cc:99] Exec() query_id=e44d553b04d47cfb:28f06bb800000000 stmt=with customer_total_return as
> I0401 21:23:48.270226 1830386 coordinator.cc:357] starting execution on 640 backends for query_id=e44d553b04d47cfb:28f06bb800000000
> I0401 21:29:58.402195 1830386 coordinator.cc:370] started execution on 640 backends for query_id=e44d553b04d47cfb:28f06bb800000000
> I0401 21:29:58.403818 1830386 coordinator.cc:896] Cancel() query_id=e44d553b04d47cfb:28f06bb800000000
> I0401 21:29:59.255903 1830386 coordinator.cc:906] CancelBackends() query_id=e44d553b04d47cfb:28f06bb800000000, tried to cancel 639 backends
> I0401 21:29:59.256251 1830386 coordinator.cc:1092] Release admission control resources for query_id=e44d553b04d47cfb:28f06bb800000000
> {code}
> Checked the coordinator and threads appear to be spending lots of time waiting on exec_complete_barrier_
> {code}
> #0  0x00007fd928c816d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> #1  0x0000000001222944 in impala::Promise<bool>::Get() ()
> #2  0x0000000001220d7b in impala::Coordinator::StartBackendExec() ()
> #3  0x0000000001221c87 in impala::Coordinator::Exec() ()
> #4  0x0000000000c3a925 in impala::ClientRequestState::ExecQueryOrDmlRequest(impala::TQueryExecRequest const&) ()
> #5  0x0000000000c41f7e in impala::ClientRequestState::Exec(impala::TExecRequest*) ()
> #6  0x0000000000bff597 in impala::ImpalaServer::ExecuteInternal(impala::TQueryCtx const&, std::shared_ptr<impala::ImpalaServer::SessionState>, bool*, std::shared_ptr<impala::ClientRequestState>*) ()
> #7  0x0000000000c061d9 in impala::ImpalaServer::Execute(impala::TQueryCtx*, std::shared_ptr<impala::ImpalaServer::SessionState>, std::shared_ptr<impala::ClientRequestState>*) ()
> #8  0x0000000000c561c5 in impala::ImpalaServer::query(beeswax::QueryHandle&, beeswax::Query const&) ()
> /StartBackendExec
> #11 0x0000000000d60c9a in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long>*), boost::_bi::list5<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::ThreadDebugInfo*>, boost::_bi::value<impala::Promise<long>*> > > >::run() ()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org