You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2023/04/18 08:17:00 UTC
[jira] [Assigned] (IMPALA-6828) Expose more detailed info in profile for REFRESH
[ https://issues.apache.org/jira/browse/IMPALA-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Quanlong Huang reassigned IMPALA-6828:
--------------------------------------
Assignee: Quanlong Huang
> Expose more detailed info in profile for REFRESH
> ------------------------------------------------
>
> Key: IMPALA-6828
> URL: https://issues.apache.org/jira/browse/IMPALA-6828
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog
> Reporter: John Russell
> Assignee: Quanlong Huang
> Priority: Major
> Labels: supportability
>
> Sometimes REFRESH is an expensive operation, and and I've heard anecdotally that it is sometimes run more often than necessary. I.e. a developer has no real way to know when it's worth doing a REFRESH, so they do it before every query, or in a loop for every table, or every N minutes.
> I suggest beefing up the info displayed in the profile after a REFRESH is done. The idea being that if there was quantifiable info about how expensive the operation was, and whether a particular REFRESH was needed or was a no-op, that information could be used by cluster management software - e.g. the management software could have charts showing how much namenode load, network traffic, etc. was due to REFRESH; show what percentage of REFRESH was wasted effort because there was no new data; display a dashboard of tables showing which ones needed to be REFRESHed and letting that be handled by an administrator rather than hardcoded into SQL scripts or Impala applications.
> Here's what I get if I side-load some HDFS files into a table, REFRESH, and look at the profile:
> {code}
> [localhost:21000] > select count(*) from t1;
> +----------+
> | count(*) |
> +----------+
> | 0 |
> +----------+
> [localhost:21000] > refresh t1;
> [localhost:21000] > summary;
> Summary not available
> [localhost:21000] > profile;
> Query Runtime Profile:
> Query (id=c0451cde0d827b96:e43381400000000):
> Summary:
> Session ID: 4f498d223ade18b1:893649710434fdb0
> Session Type: BEESWAX
> Start Time: 2018-04-09 15:17:48.116499000
> End Time: 2018-04-09 15:17:48.209089000
> Query Type: DDL
> Query State: FINISHED
> Query Status: OK
> Impala Version: impalad version 2.7.0-cdh5.9.0 RELEASE (build 4b4cf1936bd6cdf34fda5e2f32827e7d60c07a9c)
> User: jrussell
> Connected User: jrussell
> Delegated User:
> Network Address: ::1:44170
> Default Db: scratch
> Sql Statement: refresh t1
> Coordinator: mumblemumble:22000
> Query Options (non default):
> DDL Type: RESET_METADATA
> : 0.000ns
> Query Timeline: 94.081ms
> - Start execution: 278.360us (278.360us)
> - Planning finished: 1.209ms (931.528us)
> - Request finished: 90.333ms (89.123ms)
> - First row fetched: 91.762ms (1.429ms)
> - Unregister query: 92.510ms (747.664us)
> ImpalaServer:
> - CatalogOpExecTimer: 79.113ms
> - ClientFetchWaitTimer: 2.125ms
> - RowMaterializationTimer: 0.000ns
> [localhost:21000] > select count(*) from t1;
> +----------+
> | count(*) |
> +----------+
> | 16777216 |
> +----------+
> {code}
> Instead of just the times for phases of execution, I suggest:
> * How much memory used on each host.
> * How many requests sent to namenode and/or Impala executors.
> * Overall network traffic for the whole operation.
> * Any other statistics about significant resources involved in doing REFRESH.
> * Number of new, changed, or removed files detected. For the whole table, and per partition.
> * Some obvious indication if the operation was a complete no-op, or if it was a no-op for N out of M partitions. I.e. to provide feedback that the operation wasn't needed at all, or could have been targeted better at only changed partitions.
> cc: [~alex.behm], [~dtsirogiannis]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org