You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2023/04/18 08:17:00 UTC

[jira] [Assigned] (IMPALA-6828) Expose more detailed info in profile for REFRESH

     [ https://issues.apache.org/jira/browse/IMPALA-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Quanlong Huang reassigned IMPALA-6828:
--------------------------------------

    Assignee: Quanlong Huang

> Expose more detailed info in profile for REFRESH
> ------------------------------------------------
>
>                 Key: IMPALA-6828
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6828
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: John Russell
>            Assignee: Quanlong Huang
>            Priority: Major
>              Labels: supportability
>
> Sometimes REFRESH is an expensive operation, and and I've heard anecdotally that it is sometimes run more often than necessary. I.e. a developer has no real way to know when it's worth doing a REFRESH, so they do it before every query, or in a loop for every table, or every N minutes.
> I suggest beefing up the info displayed in the profile after a REFRESH is done. The idea being that if there was quantifiable info about how expensive the operation was, and whether a particular REFRESH was needed or was a no-op, that information could be used by cluster management software - e.g. the management software could have charts showing how much namenode load, network traffic, etc. was due to REFRESH; show what percentage of REFRESH was wasted effort because there was no new data; display a dashboard of tables showing which ones needed to be REFRESHed and letting that be handled by an administrator rather than hardcoded into SQL scripts or Impala applications.
> Here's what I get if I side-load some HDFS files into a table, REFRESH, and look at the profile:
> {code}
> [localhost:21000] > select count(*) from t1;
> +----------+
> | count(*) |
> +----------+
> | 0        |
> +----------+
> [localhost:21000] > refresh t1;
> [localhost:21000] > summary;
> Summary not available
> [localhost:21000] > profile;
> Query Runtime Profile:
> Query (id=c0451cde0d827b96:e43381400000000):
>   Summary:
>     Session ID: 4f498d223ade18b1:893649710434fdb0
>     Session Type: BEESWAX
>     Start Time: 2018-04-09 15:17:48.116499000
>     End Time: 2018-04-09 15:17:48.209089000
>     Query Type: DDL
>     Query State: FINISHED
>     Query Status: OK
>     Impala Version: impalad version 2.7.0-cdh5.9.0 RELEASE (build 4b4cf1936bd6cdf34fda5e2f32827e7d60c07a9c)
>     User: jrussell
>     Connected User: jrussell
>     Delegated User: 
>     Network Address: ::1:44170
>     Default Db: scratch
>     Sql Statement: refresh t1
>     Coordinator: mumblemumble:22000
>     Query Options (non default): 
>     DDL Type: RESET_METADATA
>     : 0.000ns
>     Query Timeline: 94.081ms
>        - Start execution: 278.360us (278.360us)
>        - Planning finished: 1.209ms (931.528us)
>        - Request finished: 90.333ms (89.123ms)
>        - First row fetched: 91.762ms (1.429ms)
>        - Unregister query: 92.510ms (747.664us)
>   ImpalaServer:
>      - CatalogOpExecTimer: 79.113ms
>      - ClientFetchWaitTimer: 2.125ms
>      - RowMaterializationTimer: 0.000ns
> [localhost:21000] > select count(*) from t1;
> +----------+
> | count(*) |
> +----------+
> | 16777216 |
> +----------+
> {code}
> Instead of just the times for phases of execution, I suggest:
> * How much memory used on each host.
> * How many requests sent to namenode and/or Impala executors.
> * Overall network traffic for the whole operation.
> * Any other statistics about significant resources involved in doing REFRESH.
> * Number of new, changed, or removed files detected. For the whole table, and per partition.
> * Some obvious indication if the operation was a complete no-op, or if it was a no-op for N out of M partitions. I.e. to provide feedback that the operation wasn't needed at all, or could have been targeted better at only changed partitions.
> cc: [~alex.behm], [~dtsirogiannis]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org