You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Michael Smith (Jira)" <ji...@apache.org> on 2024/04/11 19:03:00 UTC

[jira] [Updated] (IMPALA-12997) test_query_log tests get stuck trying to write to the log

     [ https://issues.apache.org/jira/browse/IMPALA-12997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Smith updated IMPALA-12997:
-----------------------------------
    Description: 
In some test runs, most tests under test_query_log will start to fail on various conditions like
{code}
custom_cluster/test_query_log.py:452: in test_query_log_table_query_select_mt_dop
    "impala-server.completed-queries.written", 1, 60)
common/impala_service.py:144: in wait_for_metric_value
    self.__metric_timeout_assert(metric_name, expected_value, timeout)
common/impala_service.py:213: in __metric_timeout_assert
    assert 0, assert_string
E   AssertionError: Metric impala-server.completed-queries.written did not reach value 1 in 60s.
E   Dumping debug webpages in JSON format...
E   Dumped memz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/memz.json
E   Dumped metrics JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/metrics.json
E   Dumped queries JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/queries.json
E   Dumped sessions JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/sessions.json
E   Dumped threadz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/threadz.json
E   Dumped rpcz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/rpcz.json
E   Dumping minidumps for impalads/catalogds...
E   Dumped minidump for Impalad PID 3680802
E   Dumped minidump for Impalad PID 3680805
E   Dumped minidump for Impalad PID 3680809
E   Dumped minidump for Catalogd PID 3680732
{code}
or
{code}
custom_cluster/test_query_log.py:921: in test_query_log_ignored_sqls
    assert len(sql_results.data) == 1, "query not found in completed queries table"
E   AssertionError: query not found in completed queries table
E   assert 0 == 1
E    +  where 0 = len([])
E    +    where [] = <tests.common.impala_connection.ImpylaHS2ResultSet object at 0xffffa00cc350>.data
{code}

One symptom that seems related to this is INSERT operations into sys.impala_query_log that start "UnregisterQuery()" but never finish (with "Query successfully unregistered").

We can identify cases like that with
{code}
for log in $(ag -l 'INSERT INTO sys.impala_query_log' impalad.*); do echo $log; for qid in $(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO sys.impala_query_log' $log | cut -d']' -f1); do if ! ag "Query successfully unregistered: query_id=$qid" $log; then echo "$qid not unregistered"; fi; done; done
{code}
A similar case may occur with creating the table too
{code}
for log in $(ag -l 'CREATE TABLE IF NOT EXISTS sys.impala_query_log' impalad.impala-ec2-rhel88-m7g-4xlarge-ondemand-0a5a.vpc.cloudera.com.jenkins.log.INFO.20240410-*); do QID=$(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO sys.impala_query_log' $log | cut -d']' -f1); echo $log; ag "Query successfully unregistered: query_id=$QID" $log; done
{code}
although these frequently fail because the test completes and shuts down Impala before the CREATE TABLE query completes.

Tracking one of those cases led to catalogd errors that repeated for 1m27s before the test suite restarted catalogd:
{code}
W0410 12:48:05.051760 3681790 Tasks.java:456] 6647229faf7637d5:3ec7565b00000000] Retrying task after failure: Waiting for lock on table sys.impala_query_log
Java exception follows:
org.apache.iceberg.hive.MetastoreLock$WaitingForLockException: Waiting for lock on table sys.impala_query_log
	at org.apache.iceberg.hive.MetastoreLock.lambda$acquireLock$1(MetastoreLock.java:217)
	at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
	at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
	at org.apache.iceberg.hive.MetastoreLock.acquireLock(MetastoreLock.java:209)
	at org.apache.iceberg.hive.MetastoreLock.lock(MetastoreLock.java:146)
	at org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:194)
	at org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:135)
	at org.apache.iceberg.BaseTransaction.lambda$commitSimpleTransaction$3(BaseTransaction.java:417)
	at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
	at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
	at org.apache.iceberg.BaseTransaction.commitSimpleTransaction(BaseTransaction.java:413)
	at org.apache.iceberg.BaseTransaction.commitTransaction(BaseTransaction.java:308)
	at org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7220)
	at org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:6939)
	at org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504)
	at org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
	at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
	at org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
	at org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100)
	at org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245)
	at org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259)
	at org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503)
{code}
This seems to correspond to the first test failure and persisted through all tests afterwards. So this lock was likely acquired from HMS, then never released.

> test_query_log tests get stuck trying to write to the log
> ---------------------------------------------------------
>
>                 Key: IMPALA-12997
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12997
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>    Affects Versions: Impala 4.4.0
>            Reporter: Michael Smith
>            Priority: Major
>
> In some test runs, most tests under test_query_log will start to fail on various conditions like
> {code}
> custom_cluster/test_query_log.py:452: in test_query_log_table_query_select_mt_dop
>     "impala-server.completed-queries.written", 1, 60)
> common/impala_service.py:144: in wait_for_metric_value
>     self.__metric_timeout_assert(metric_name, expected_value, timeout)
> common/impala_service.py:213: in __metric_timeout_assert
>     assert 0, assert_string
> E   AssertionError: Metric impala-server.completed-queries.written did not reach value 1 in 60s.
> E   Dumping debug webpages in JSON format...
> E   Dumped memz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/memz.json
> E   Dumped metrics JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/metrics.json
> E   Dumped queries JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/queries.json
> E   Dumped sessions JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/sessions.json
> E   Dumped threadz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/threadz.json
> E   Dumped rpcz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/rpcz.json
> E   Dumping minidumps for impalads/catalogds...
> E   Dumped minidump for Impalad PID 3680802
> E   Dumped minidump for Impalad PID 3680805
> E   Dumped minidump for Impalad PID 3680809
> E   Dumped minidump for Catalogd PID 3680732
> {code}
> or
> {code}
> custom_cluster/test_query_log.py:921: in test_query_log_ignored_sqls
>     assert len(sql_results.data) == 1, "query not found in completed queries table"
> E   AssertionError: query not found in completed queries table
> E   assert 0 == 1
> E    +  where 0 = len([])
> E    +    where [] = <tests.common.impala_connection.ImpylaHS2ResultSet object at 0xffffa00cc350>.data
> {code}
> One symptom that seems related to this is INSERT operations into sys.impala_query_log that start "UnregisterQuery()" but never finish (with "Query successfully unregistered").
> We can identify cases like that with
> {code}
> for log in $(ag -l 'INSERT INTO sys.impala_query_log' impalad.*); do echo $log; for qid in $(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO sys.impala_query_log' $log | cut -d']' -f1); do if ! ag "Query successfully unregistered: query_id=$qid" $log; then echo "$qid not unregistered"; fi; done; done
> {code}
> A similar case may occur with creating the table too
> {code}
> for log in $(ag -l 'CREATE TABLE IF NOT EXISTS sys.impala_query_log' impalad.impala-ec2-rhel88-m7g-4xlarge-ondemand-0a5a.vpc.cloudera.com.jenkins.log.INFO.20240410-*); do QID=$(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO sys.impala_query_log' $log | cut -d']' -f1); echo $log; ag "Query successfully unregistered: query_id=$QID" $log; done
> {code}
> although these frequently fail because the test completes and shuts down Impala before the CREATE TABLE query completes.
> Tracking one of those cases led to catalogd errors that repeated for 1m27s before the test suite restarted catalogd:
> {code}
> W0410 12:48:05.051760 3681790 Tasks.java:456] 6647229faf7637d5:3ec7565b00000000] Retrying task after failure: Waiting for lock on table sys.impala_query_log
> Java exception follows:
> org.apache.iceberg.hive.MetastoreLock$WaitingForLockException: Waiting for lock on table sys.impala_query_log
> 	at org.apache.iceberg.hive.MetastoreLock.lambda$acquireLock$1(MetastoreLock.java:217)
> 	at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
> 	at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
> 	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
> 	at org.apache.iceberg.hive.MetastoreLock.acquireLock(MetastoreLock.java:209)
> 	at org.apache.iceberg.hive.MetastoreLock.lock(MetastoreLock.java:146)
> 	at org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:194)
> 	at org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:135)
> 	at org.apache.iceberg.BaseTransaction.lambda$commitSimpleTransaction$3(BaseTransaction.java:417)
> 	at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
> 	at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
> 	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
> 	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
> 	at org.apache.iceberg.BaseTransaction.commitSimpleTransaction(BaseTransaction.java:413)
> 	at org.apache.iceberg.BaseTransaction.commitTransaction(BaseTransaction.java:308)
> 	at org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7220)
> 	at org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:6939)
> 	at org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504)
> 	at org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
> 	at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
> 	at org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
> 	at org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100)
> 	at org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245)
> 	at org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259)
> 	at org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503)
> {code}
> This seems to correspond to the first test failure and persisted through all tests afterwards. So this lock was likely acquired from HMS, then never released.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org