You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/06/24 05:58:00 UTC

[jira] [Commented] (IMPALA-10895) TestQueryRetries.test_retrying_query_cancel is flaky

    [ https://issues.apache.org/jira/browse/IMPALA-10895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558336#comment-17558336 ] 

ASF subversion and git services commented on IMPALA-10895:
----------------------------------------------------------

Commit ae00781983e94bfe7e41f65f50fde502d2039f7d in impala's branch refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=ae0078198 ]

IMPALA-10895: Fix flakiness of test_retrying_query_cancel

test_retrying_query_cancel test canceling the query when it's in the
RETRYING state. The test first run the query and wait for the state
become RETRYING. There is a debug action added to make the RETRYING
state longer than 1s, so it can be sufficient in the test. However, when
waiting for the RETRYING state, the interval is 0.5s. This waste the
majority of the time. In ASAN builds, the time is not enough for the
following steps, resulting the query state become RETRIED and fail the
test.

This patch reduces the wait interval to 0.1s. Also add some logs and
modify the code to get state after the wait instead of before the wait.

Tests:
- Run the test more than 1000 times in an ASAN build. Before this patch
  it fails in around 30 runs.

Change-Id: Id069091c94160d09868fcdc36ac7195b1deb337a
Reviewed-on: http://gerrit.cloudera.org:8080/18659
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> TestQueryRetries.test_retrying_query_cancel is flaky
> ----------------------------------------------------
>
>                 Key: IMPALA-10895
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10895
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>              Labels: broken-build
>         Attachments: catalogd.INFO.gz, impalad.INFO.gz, impalad_node1.INFO.gz, impalad_node2.INFO.gz, statestored.INFO.gz
>
>
> Saw this failed in an ASAN build:
> {code}
> custom_cluster.test_query_retries.TestQueryRetries.test_retrying_query_cancel
> {code}
> Stacktrace
> {code}
> custom_cluster/test_query_retries.py:742: in test_retrying_query_cancel
>     assert retry_status.group(1) == 'RETRYING'
> E   assert 'RETRIED' == 'RETRYING'
> E     - RETRIED
> E     + RETRYING
> {code}
> Standard Error
> {code}
> -- 2021-08-29 08:29:29,112 INFO     MainThread: Starting cluster with command: /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/bin/start-impala-cluster.py '--state_store_args=--statestore_update_frequency_ms=50     --statestore_priority_update_frequency_ms=50     --statestore_heartbeat_frequency_ms=50' --cluster_size=3 --num_coordinators=3 --log_dir=/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/logs/custom_cluster_tests --log_level=1 '--impalad_args=--debug_actions=RETRY_DELAY_CHECKING_ORIGINAL_DRIVER:SLEEP@1000 ' '--state_store_args=--statestore_heartbeat_frequency_ms=60000 ' --impalad_args=--default_query_options=
> 08:29:29 MainThread: Found 0 impalad/0 statestored/0 catalogd process(es)
> 08:29:29 MainThread: Starting State Store logging to /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/logs/custom_cluster_tests/statestored.INFO
> 08:29:29 MainThread: Starting Catalog Service logging to /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/logs/custom_cluster_tests/catalogd.INFO
> 08:29:29 MainThread: Starting Impala Daemon logging to /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/logs/custom_cluster_tests/impalad.INFO
> 08:29:29 MainThread: Starting Impala Daemon logging to /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/logs/custom_cluster_tests/impalad_node1.INFO
> 08:29:29 MainThread: Starting Impala Daemon logging to /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/logs/custom_cluster_tests/impalad_node2.INFO
> 08:29:32 MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
> 08:29:32 MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
> 08:29:32 MainThread: Getting num_known_live_backends from impala-ec2-centos74-m5-4xlarge-ondemand-061e.vpc.cloudera.com:25000
> 08:29:32 MainThread: Debug webpage not yet available: HTTPConnectionPool(host='impala-ec2-centos74-m5-4xlarge-ondemand-061e.vpc.cloudera.com', port=25000): Max retries exceeded with url: /backends?json (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f10c1d5d150>: Failed to establish a new connection: [Errno 111] Connection refused',))
> 08:29:34 MainThread: Debug webpage did not become available in expected time.
> 08:29:34 MainThread: Waiting for num_known_live_backends=3. Current value: None
> 08:29:35 MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
> 08:29:35 MainThread: Getting num_known_live_backends from impala-ec2-centos74-m5-4xlarge-ondemand-061e.vpc.cloudera.com:25000
> 08:29:35 MainThread: Waiting for num_known_live_backends=3. Current value: 0
> 08:29:36 MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
> 08:29:36 MainThread: Getting num_known_live_backends from impala-ec2-centos74-m5-4xlarge-ondemand-061e.vpc.cloudera.com:25000
> 08:29:36 MainThread: num_known_live_backends has reached value: 3
> 08:29:37 MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
> 08:29:37 MainThread: Getting num_known_live_backends from impala-ec2-centos74-m5-4xlarge-ondemand-061e.vpc.cloudera.com:25001
> 08:29:37 MainThread: num_known_live_backends has reached value: 3
> 08:29:38 MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
> 08:29:38 MainThread: Getting num_known_live_backends from impala-ec2-centos74-m5-4xlarge-ondemand-061e.vpc.cloudera.com:25002
> 08:29:38 MainThread: num_known_live_backends has reached value: 3
> 08:29:38 MainThread: Impala Cluster Running with 3 nodes (3 coordinators, 3 executors).
> -- 2021-08-29 08:29:38,626 DEBUG    MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
> -- 2021-08-29 08:29:38,626 INFO     MainThread: Getting metric: statestore.live-backends from impala-ec2-centos74-m5-4xlarge-ondemand-061e.vpc.cloudera.com:25010
> -- 2021-08-29 08:29:38,629 INFO     MainThread: Metric 'statestore.live-backends' has reached desired value: 4
> -- 2021-08-29 08:29:38,629 DEBUG    MainThread: Getting num_known_live_backends from impala-ec2-centos74-m5-4xlarge-ondemand-061e.vpc.cloudera.com:25000
> -- 2021-08-29 08:29:38,631 INFO     MainThread: num_known_live_backends has reached value: 3
> -- 2021-08-29 08:29:38,631 DEBUG    MainThread: Getting num_known_live_backends from impala-ec2-centos74-m5-4xlarge-ondemand-061e.vpc.cloudera.com:25001
> -- 2021-08-29 08:29:38,633 INFO     MainThread: num_known_live_backends has reached value: 3
> -- 2021-08-29 08:29:38,633 DEBUG    MainThread: Getting num_known_live_backends from impala-ec2-centos74-m5-4xlarge-ondemand-061e.vpc.cloudera.com:25002
> -- 2021-08-29 08:29:38,635 INFO     MainThread: num_known_live_backends has reached value: 3
> SET client_identifier=custom_cluster/test_query_retries.py::TestQueryRetries::()::test_retrying_query_cancel;
> -- connecting to: localhost:21000
> -- connecting to localhost:21050 with impyla
> -- 2021-08-29 08:29:38,801 INFO     MainThread: Closing active operation
> -- connecting to localhost:28000 with impyla
> -- 2021-08-29 08:29:38,822 INFO     MainThread: Closing active operation
> -- 2021-08-29 08:29:38,853 INFO     MainThread: Killing <ImpaladProcess PID: 6501 (/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/be/build/latest/service/impalad -disconnected_session_timeout 21600 -kudu_client_rpc_timeout_ms 60000 -kudu_master_hosts localhost -mem_limit=12884901888 -logbufsecs=5 -v=1 -max_log_files=0 -log_filename=impalad_node1 -log_dir=/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/logs/custom_cluster_tests -beeswax_port=21001 -hs2_port=21051 -hs2_http_port=28001 -krpc_port=27001 -state_store_subscriber_port=23001 -webserver_port=25001 --debug_actions=RETRY_DELAY_CHECKING_ORIGINAL_DRIVER:SLEEP@1000 --default_query_options=)> with signal 9
> SET client_identifier=custom_cluster/test_query_retries.py::TestQueryRetries::()::test_retrying_query_cancel;
> SET retry_failed_queries=true;
> -- executing async: localhost:21000
> select count(*) from tpch_parquet.lineitem;
> -- 2021-08-29 08:29:39,647 INFO     MainThread: Started query de46b6f56f935fa5:f4c701b200000000
> -- canceling operation: <tests.common.impala_connection.OperationHandle object at 0x7fb2a8026250>
> {code}
> CC [~xqhe]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org