You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/11/04 03:42:00 UTC
[jira] [Commented] (IMPALA-9842) TestValidateMetrics.test_metrics_are_zero fails with num-fragments-in-flight not reaching zero

    [ https://issues.apache.org/jira/browse/IMPALA-9842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225823#comment-17225823 ] 

ASF subversion and git services commented on IMPALA-9842:
---------------------------------------------------------

Commit 278d3d22729f6ca8aecab2edfc95aa996e8b32ee in impala's branch refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=278d3d2 ]

IMPALA-9864: Produce a minidump when TestValidateMetrics fails

After running end-to-end tests, run-tests.py runs verifiers to
check that a set of metrics are zero. When this fails, it can
indicate a hung query fragment or other resource leak (see
IMPALA-9842 for example). To track this down, it is useful to
have a minidump, so this adds a step to have every Impalad/Catalogd
generate a minidump (by sending SIGUSR1) when we hit the timeout.

Also, the current error message dumps a bunch of unformatted
JSON from our Web UI. This is hard to read and painful to
cut/paste. This now dumps that JSON to files in a diagnostic
directory under the logs directory. The JSON is formatted
in a readable way. These files would be preserved along with
the rest of the logs directory for automated runs.

The new error message looks like this:
E   AssertionError: Metric impala-server.num-queries-registered did not reach value 0 in 60s.
E   Dumping debug webpages in JSON format...
E   Dumped memz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20201103_13:51:02/json/memz.json
E   Dumped metrics JSON to $IMPALA_HOME/logs/metric_timeout_diags_20201103_13:51:02/json/metrics.json
E   Dumped queries JSON to $IMPALA_HOME/logs/metric_timeout_diags_20201103_13:51:02/json/queries.json
E   Dumped sessions JSON to $IMPALA_HOME/logs/metric_timeout_diags_20201103_13:51:02/json/sessions.json
E   Dumped threadz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20201103_13:51:02/json/threadz.json
E   Dumped rpcz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20201103_13:51:02/json/rpcz.json
E   Dumping minidumps for impalads/catalogds...
E   Dumped minidump for Impalad PID 2709
E   Dumped minidump for Impalad PID 2714
E   Dumped minidump for Impalad PID 2721
E   Dumped minidump for Catalogd PID 2627

This also fixes various flake8 errors (unnecessary imports, etc), so
now impala_service.py is flake8 clean.

Testing:
 - Tried out the dump function on my developer machine
 - Verified the minidumps exist
 - Verified the JSON is readable

Change-Id: I16d26052d0664ee0b115e3611cd96047d8ada19d
Reviewed-on: http://gerrit.cloudera.org:8080/16690
Reviewed-by: Qifan Chen <qc...@cloudera.com>
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> TestValidateMetrics.test_metrics_are_zero fails with num-fragments-in-flight not reaching zero
> ----------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-9842
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9842
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Blocker
>              Labels: broken-build, flaky
>             Fix For: Impala 4.0
>
>
> A couple test runs have failed with TestValidateMetrics.test_metrics_are_zero failure: 
> {noformat}
> AssertionError: Metric value impala-server.num-fragments-in-flight did not reach value 0 in 60s Dumping impalad debug pages: ...{noformat}
> The memz shows: 
> {noformat}
> Process: Limit=12.00 GB Total=2.93 GB Peak=138.77 GB
>   JVM: max heap size: Total=1.78 GB
>   JVM: non-heap committed: Total=211.75 MB
>   Buffer Pool: Free Buffers: Total=1.90 MB
>   Buffer Pool: Clean Pages: Total=0  
>   Buffer Pool: Unused Reservation: Total=-2.00 MB
>   Control Service Queue: Limit=50.00 MB Total=0 Peak=16.47 MB
>   Data Stream Service Queue: Limit=614.40 MB Total=0 Peak=636.91 MB
>   Data Stream Manager Early RPCs: Total=0 Peak=684.53 KB
>   TCMalloc Overhead: Total=347.15 MB
>   Spill-to-disk temporary compression buffers: Limit=512.00 MB Total=0 Peak=110.43 MB
>   RequestPool=root.default: Total=4.02 MB Peak=136.23 GB
>     Query(074a86ef88e16768:1a7bf9f400000000): Limit=300.00 MB Reservation=4.00 MB Reserva$
>       Fragment 074a86ef88e16768:1a7bf9f400000000: Reservation=4.00 MB OtherMemory=16.00 K$
>         HDFS_SCAN_NODE (id=0): Reservation=0 OtherMemory=0 Total=0 Peak=830.00 KB
>         PLAN_ROOT_SINK: Reservation=4.00 MB ReservationLimit=100.00 MB OtherMemory=8.00 K$
>   RequestPool=fe-eval-exprs: Total=0 Peak=256.00 MB
>   RequestPool=root.no-limits: Total=0 Peak=188.88 MB
>   Untracked Memory: Total=616.14 MB
> {noformat}
> Query 074a86ef88e16768:1a7bf9f4 is: 
> {noformat}
> I0608 19:09:08.410117 25628 Frontend.java:1498] 074a86ef88e16768:1a7bf9f400000000] Analyzing query: select * from functional.alltypes limit 10 db: default{noformat}
> The interesting query options are debug_action and fetch_rows_timeout_ms. These narrow it down to hs2/test_fetch_timeout.py ([https://github.com/apache/impala/blob/master/tests/hs2/test_fetch_timeout.py#L207-L215]):
> {noformat}
>   11: debug_action (string) = "CRS_BEFORE_COORD_STARTS:SLEEP@5000",
> ...
>   94: fetch_rows_timeout_ms (i64) = 0,{noformat}
>  It is executing only at the coordinator: 
> {noformat}
> I0608 19:09:08.430388 25637 scheduler.cc:549] 074a86ef88e16768:1a7bf9f400000000] Exec at coord is true
> I0608 19:09:08.430755 25637 admission-controller.cc:1296] 074a86ef88e16768:1a7bf9f400000000] Trying to admit id=074a86ef88e16768:1a7bf9f400000000 in pool_name=root.default executor_group_name=default per_host_mem_estimate=20.00 MB dedicated_coord_mem_estimate=120.00 MB max_requests=-1 (configured statically) max_queued=200 (configured statically) max_mem=29.30 GB (configured statically)
> I0608 19:09:08.430793 25637 admission-controller.cc:1308] 074a86ef88e16768:1a7bf9f400000000] Stats: agg_num_running=10, agg_num_queued=0, agg_mem_reserved=3.79 GB,  local_host(local_mem_admitted=4.67 GB, num_admitted_running=10, num_queued=0, backend_mem_reserved=2.24 GB)
> I0608 19:09:08.430811 25637 admission-controller.cc:894] 074a86ef88e16768:1a7bf9f400000000] Admitting query id=074a86ef88e16768:1a7bf9f400000000
> I0608 19:09:08.430881 25637 debug-util.cc:427] 074a86ef88e16768:1a7bf9f400000000] Debug Action: CRS_BEFORE_COORD_STARTS:SLEEP@5000 sleeping for 5000 ms{noformat}
>  The query starts execution: 
> {noformat}
> I0608 19:09:13.431097 25637 coordinator.cc:143] 074a86ef88e16768:1a7bf9f400000000] Exec() query_id=074a86ef88e16768:1a7bf9f400000000 stmt=select * from functional.alltypes limit 10
> I0608 19:09:13.431520 25637 coordinator.cc:463] 074a86ef88e16768:1a7bf9f400000000] starting execution on 1 backends for query_id=074a86ef88e16768:1a7bf9f400000000
> I0608 19:09:13.432674   114 control-service.cc:153] 074a86ef88e16768:1a7bf9f400000000] ExecQueryFInstances(): query_id=074a86ef88e16768:1a7bf9f400000000 coord=f0873f7637c8:22000 #instances=1
> ...
> I0608 19:09:13.433480 25637 coordinator.cc:520] 074a86ef88e16768:1a7bf9f400000000] started execution on 1 backends for query_id=074a86ef88e16768:1a7bf9f400000000
> I0608 19:09:13.434790 26254 query-state.cc:756] 074a86ef88e16768:1a7bf9f400000000] Executing instance. instance_id=074a86ef88e16768:1a7bf9f400000000 fragment_idx=0 per_fragment_instance_idx=0 coord_state_idx=0 #in-flight=11{noformat}
> The query is unregistered: 
> {noformat}
> I0608 19:09:13.448494 25628 impala-server.cc:1173] UnregisterQuery(): query_id=074a86ef88e16768:1a7bf9f400000000
> I0608 19:09:13.448557 25628 coordinator.cc:644] ExecState: query id=074a86ef88e16768:1a7bf9f400000000 execution cancelled
> I0608 19:09:13.448580 25628 coordinator-backend-state.cc:943] query_id=074a86ef88e16768:1a7bf9f400000000 target backend=172.18.0.4:27000: Sending CancelQueryFInstances rpc
> I0608 19:09:13.448971   114 control-service.cc:223] CancelQueryFInstances(): query_id=074a86ef88e16768:1a7bf9f400000000
> I0608 19:09:13.448995   114 query-exec-mgr.cc:105] QueryState: query_id=074a86ef88e16768:1a7bf9f400000000 refcnt=4
> I0608 19:09:13.449007   114 query-state.cc:782] Cancel: query_id=074a86ef88e16768:1a7bf9f400000000
> I0608 19:09:13.449023   114 krpc-data-stream-mgr.cc:337] cancelling active streams for fragment_instance_id=074a86ef88e16768:1a7bf9f400000000
> I0608 19:09:13.449226 25628 coordinator.cc:889] CancelBackends() query_id=074a86ef88e16768:1a7bf9f400000000, tried to cancel 1 backends
> I0608 19:09:13.449291 25628 coordinator.cc:1225] Release admission control resources for query_id=074a86ef88e16768:1a7bf9f400000000{noformat}
> After that, there are periodic reports coming in for 074a86ef88e16768:1a7bf9f400000000 which all return that it is an invalid query. 
> The original query is starts at 19:09:08, while the TestValidateMetrics.test_metrics_are_zero check for remaining fragments is happening over an hour later at 20:14:08.**
> Interestingly, this has happened on different runs *with the exact same query*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org