You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Wenzhe Zhou (Jira)" <ji...@apache.org> on 2022/04/21 22:16:00 UTC
[jira] [Commented] (IMPALA-11263) Coordinator hang when cancelling a query

    [ https://issues.apache.org/jira/browse/IMPALA-11263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526111#comment-17526111 ] 

Wenzhe Zhou commented on IMPALA-11263:
--------------------------------------

Log messages for failed query - dc48f399b63dd87c:43f31fd900000000. 
It was retried query for another query 6d49b7f0544e71cc:37d2486300000000.
{code:java}
2022-03-24-13-50.log:    I0324 13:52:43.108821  9754 query-driver.cc:253] Retrying query 6d49b7f0544e71cc:37d2486300000000 with new query id dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-50.log:    I0324 13:52:43.108845  9754 impala-server.cc:1352] Registered query query_id=dc48f399b63dd87c:43f31fd900000000 session_id=274cff1ef21d8454:80abbc8346fdc7b1
2022-03-24-13-50.log:    I0324 13:52:43.109871  9754 impala-server.cc:1392] Query dc48f399b63dd87c:43f31fd900000000 has idle timeout of 10m
2022-03-24-13-50.log:    I0324 13:52:43.109925  9754 query-driver.cc:360] Retried query 6d49b7f0544e71cc:37d2486300000000 with new query id dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-50.log:    I0324 13:52:43.138975  9756 admission-controller.cc:1879] Trying to admit id=dc48f399b63dd87c:43f31fd900000000 in pool_name=root.default executor_group_name=root.default-group-000 per_host_mem_estimate=85.28 GB dedicated_coord_mem_estimate=376.55 MB max_requests=-1 max_queued=200 max_mem=48828.12 GB
{code}

The query was admitted
{code:java}
2022-03-24-13-55.log:    I0324 13:58:30.230749  9756 admission-controller.cc:1378] Admitted queued query id=dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-55.log:    I0324 13:58:30.243191  9756 coordinator.cc:150] Exec() query_id=dc48f399b63dd87c:43f31fd900000000 stmt=with data_set as (    
{code}

The query was starting to execute on 5 executors
{code:java}
2022-03-24-13-55.log:    I0324 13:58:30.252866  9756 coordinator.cc:475] starting execution on 5 backends for query_id=dc48f399b63dd87c:43f31fd900000000
{code}

Execution RPC failed for one backend due to network issue


{code:java}
2022-03-24-13-55.log:    I0324 13:58:30.253540   147 control-service.cc:148] dc48f399b63dd87c:43f31fd900000000] ExecQueryFInstances(): query_id=dc48f399b63dd87c:43f31fd900000000 coord=coordinator-0:27000 #instances=1
2022-03-24-13-55.log:    I0324 13:58:30.254931 10438 query-state.cc:948] dc48f399b63dd87c:43f31fd900000000] Executing instance. instance_id=dc48f399b63dd87c:43f31fd900000000 fragment_idx=0 per_fragment_instance_idx=0 coord_state_idx=0 #in-flight=5W0324 13:58:30.256196 10435 executor-group.cc:164] Executor group root.default-group-002 is unhealthy: 3 out of 4 are available.
2022-03-24-13-55.log:    I0324 13:58:30.274003   131 kudu-status-util.h:55] Exec() rpc failed: Network error: Client connection negotiation failed: client connection to 10.143.1.130:27010: connect: Connection refused (error 111)E0324 13:58:30.274055   131 coordinator-backend-state.cc:190] ExecQueryFInstances rpc query_id=dc48f399b63dd87c:43f31fd900000000 failed: Exec() rpc failed: Network error: Client connection negotiation failed: client connection to 10.143.1.130:27010: connect: Connection refused (error 111)
{code}

Since one of the backends failed to startup,  coordinator canceled the executions for other ones by calling Coordinator::CancelBackends().

Coordinator::CancelBackends() should call Coordinator::BackendState::Cancel() for 5 backends, then wrote log message "CancelBackends() query_id=dc48f399b63dd87c:43f31fd900000000, tried to cancel 5 backends". But I could not find this log message in log file, and could find 4 log messages wrote by Coordinator::BackendState::Cancel(). It looks the thread was hung in Coordinator::BackendState::Cancel() for the 4-th backend.


{code:java}
2022-03-24-13-55.log:    I0324 13:58:30.281944  9756 coordinator-backend-state.cc:974] query_id=dc48f399b63dd87c:43f31fd900000000 target backend=10.143.10.147:27000: Sending CancelQueryFInstances rpc
2022-03-24-13-55.log:    I0324 13:58:30.282232   147 control-service.cc:219] CancelQueryFInstances(): query_id=dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-55.log:    I0324 13:58:30.282265   147 query-exec-mgr.cc:126] QueryState: query_id=dc48f399b63dd87c:43f31fd900000000 refcnt=4
2022-03-24-13-55.log:    I0324 13:58:30.282274   147 query-state.cc:974] Cancel: query_id=dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-55.log:    I0324 13:58:30.282284   147 krpc-data-stream-mgr.cc:339] cancelling active streams for query_id=dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-55.log:    I0324 13:58:30.282481  9756 coordinator-backend-state.cc:974] query_id=dc48f399b63dd87c:43f31fd900000000 target backend=10.143.0.249:27010: Sending CancelQueryFInstances rpc
2022-03-24-13-55.log:    I0324 13:58:30.296829   534 admission-controller.cc:1887] Stats: agg_num_running=6, agg_num_queued=4, agg_mem_reserved=433.15 GB,  local_host(local_mem_admitted=481.94 GB, num_admitted_running=6, num_queued=4, backend_mem_reserved=2.14 GB, topN_query_stats: queries=[a8450959eb292a8e:df40af3200000000, 904fd49f800e6964:5aa6fc5200000000, 1e42ce4cc6ddb020:46960fab00000000, dc48f399b63dd87c:43f31fd900000000, 8f438f867f55c89b:efdb7c1600000000], total_mem_consumed=49.40 MB, fraction_of_pool_total_mem=1; pool_level_stats: num_running=21, min=0, max=25.27 MB, pool_total_mem=49.40 MB, average_per_query=2.35 MB)
2022-03-24-13-55.log:    I0324 13:58:30.297036   534 admission-controller.cc:1887] Stats: agg_num_running=6, agg_num_queued=4, agg_mem_reserved=433.15 GB,  local_host(local_mem_admitted=481.94 GB, num_admitted_running=6, num_queued=4, backend_mem_reserved=2.14 GB, topN_query_stats: queries=[a8450959eb292a8e:df40af3200000000, 904fd49f800e6964:5aa6fc5200000000, 1e42ce4cc6ddb020:46960fab00000000, dc48f399b63dd87c:43f31fd900000000, 8f438f867f55c89b:efdb7c1600000000], total_mem_consumed=49.40 MB, fraction_of_pool_total_mem=1; pool_level_stats: num_running=21, min=0, max=25.27 MB, pool_total_mem=49.40 MB, average_per_query=2.35 MB)
2022-03-24-13-55.log:    I0324 13:58:30.297185   534 admission-controller.cc:1887] Stats: agg_num_running=6, agg_num_queued=4, agg_mem_reserved=433.15 GB,  local_host(local_mem_admitted=481.94 GB, num_admitted_running=6, num_queued=4, backend_mem_reserved=2.14 GB, topN_query_stats: queries=[a8450959eb292a8e:df40af3200000000, 904fd49f800e6964:5aa6fc5200000000, 1e42ce4cc6ddb020:46960fab00000000, dc48f399b63dd87c:43f31fd900000000, 8f438f867f55c89b:efdb7c1600000000], total_mem_consumed=49.40 MB, fraction_of_pool_total_mem=1; pool_level_stats: num_running=21, min=0, max=25.27 MB, pool_total_mem=49.40 MB, average_per_query=2.35 MB)
2022-03-24-13-55.log:    I0324 13:58:30.303017 10438 krpc-data-stream-mgr.cc:308] dc48f399b63dd87c:43f31fd900000000] DeregisterRecvr(): fragment_instance_id=dc48f399b63dd87c:43f31fd900000000, node=74
2022-03-24-13-55.log:    I0324 13:58:30.303082 10438 query-state.cc:957] dc48f399b63dd87c:43f31fd900000000] Instance completed. instance_id=dc48f399b63dd87c:43f31fd900000000 #in-flight=4 status=CANCELLED: Cancelled
2022-03-24-13-55.log:    I0324 13:58:30.303110 10437 query-state.cc:468] dc48f399b63dd87c:43f31fd900000000] UpdateBackendExecState(): last report for dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-55.log:    I0324 13:58:30.304203 10437 query-state.cc:738] dc48f399b63dd87c:43f31fd900000000] Cancelling fragment instances as directed by the coordinator. Returned status: Cancelled
2022-03-24-13-55.log:    I0324 13:58:30.304234 10437 query-state.cc:974] dc48f399b63dd87c:43f31fd900000000] Cancel: query_id=dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-55.log:    I0324 13:58:32.535372  9756 coordinator-backend-state.cc:974] query_id=dc48f399b63dd87c:43f31fd900000000 target backend=10.143.1.130:27010: Not cancelling because the backend is already done: ExecQueryFInstances rpc query_id=dc48f399b63dd87c:43f31fd900000000 failed: Exec() rpc failed: Network error: Client connection negotiation failed: client connection to 10.143.1.130:27010: connect: Connection refused (error 111)
2022-03-24-13-55.log:    I0324 13:58:32.535439  9756 coordinator-backend-state.cc:974] query_id=dc48f399b63dd87c:43f31fd900000000 target backend=10.143.1.137:27010: Attempting to cancel Exec() rpc
{code}

Impala server tried to cancel query dc48f399b63dd87c:43f31fd900000000 due to one failed backend
{code:java}
2022-03-24-13-55.log:    I0324 13:58:39.793668   554 impala-server.cc:2274] Backends failed for query dc48f399b63dd87c:43f31fd900000000, adding to queue to check for cancellation: impala-executor-002-2:27010, impala-executor-002-3:27010
{code}

Since dc48f399b63dd87c:43f31fd900000000 was tried query, skip retry 

{code:java}
2022-03-24-13-55.log:    I0324 13:58:39.793761   538 query-driver.cc:150] Skipping retry of query_id=dc48f399b63dd87c:43f31fd900000000 because it has already been retried
{code}

But cancellation failed due to invalid query handle. 

{code:java}

2022-03-24-13-55.log:    I0324 13:58:39.793794   538 impala-server.cc:1909] CancelFromThreadPool(): cancelling query_id=dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-55.log:    I0324 13:59:11.101034   554 impala-server.cc:2274] Backends failed for query dc48f399b63dd87c:43f31fd900000000, adding to queue to check for cancellation: impala-executor-002-1:27010
2022-03-24-13-55.log:    I0324 13:59:11.101111   539 impala-server.cc:1599] Invalid or unknown query handle: dc48f399b63dd87c:43f31fd900000000.
2022-03-24-13-55.log:    I0324 13:59:11.101135   539 impala-server.cc:1855] CancelFromThreadPool(): query dc48f399b63dd87c:43f31fd900000000 already unregistered.
2022-03-24-14-00.log:    I0324 14:04:18.176677 11006 impala-server.cc:1436] UnregisterQuery(): query_id=dc48f399b63dd87c:43f31fd900000000
2022-03-24-14-00.log:    I0324 14:04:18.176694 11006 impala-server.cc:1586] Invalid or unknown query handle: dc48f399b63dd87c:43f31fd900000000.
{code}

For this query failure, query-retry thread was hung in Coordinator::BackendState::Cancel() for the 4-th backend when the coordinator tried to cancel retried query. The last message written by the thread is "Attempting to cancel Exec() rpc".  This indicates that WaitOnExecLocked() caused the thread waiting indefinitely. We can add timeout for WaitOnExecLocked() to avoid this hanging issue. 




> Coordinator hang when cancelling a query
> ----------------------------------------
>
>                 Key: IMPALA-11263
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11263
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Wenzhe Zhou
>            Assignee: Wenzhe Zhou
>            Priority: Major
>
> In a rare case, callback function Coordinator::BackendState::ExecCompleteCb() was not called for the corresponding ExecQueryFInstances RPC somehow. This caused coordinator waited indefinitely when cancelling the query.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org