You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yuanjian Li (JIRA)" <ji...@apache.org> on 2018/12/05 16:40:00 UTC

[jira] [Commented] (SPARK-26222) Scan: track file listing time

    [ https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710303#comment-16710303 ] 

Yuanjian Li commented on SPARK-26222:
-------------------------------------

Leave some thoughts for further discussion:
 * There's one place has track file listing duration now in `FileSourceScanExec`, metrics name is `metadataTime`(maybe an inaccurate name, should be changed to file listing time), we should add the phase tracking here.
 * We should also add the duration and phase tracking in these 2 places:
 ** HiveMetastoreCatalog inferred Scehma.
 ** replaceTableScanWithPartitionMetadata in OptimizeMetadataOnlyQuery rule.
 * IIUC, the phase tracking can use `QueryPlanningTracker` directly cause its thread locally and passed through within all `RuleExecution`.
 * About the meaning of listing time, maybe we can define it's only refers to reading without cache because loading from cache is not the 'heavy' operator we want to tracking and also spend less time. The listing time not only contains the first time `listFiles` called, but also each time after cache was refreshed.

> Scan: track file listing time
> -----------------------------
>
>                 Key: SPARK-26222
>                 URL: https://issues.apache.org/jira/browse/SPARK-26222
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Reynold Xin
>            Priority: Major
>
> We should track file listing time and add it to the scan node's SQL metric, so we have visibility how much is spent in file listing. It'd be useful to track not just duration, but also start and end time so we can construct a timeline.
> This requires a little bit design to define what file listing time means, when we are reading from cache, vs not cache.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org