You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2018/12/01 04:32:00 UTC
[jira] [Commented] (SPARK-26221) Improve Spark SQL instrumentation
and metrics
[ https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705626#comment-16705626 ]
Apache Spark commented on SPARK-26221:
--------------------------------------
User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23192
> Improve Spark SQL instrumentation and metrics
> ---------------------------------------------
>
> Key: SPARK-26221
> URL: https://issues.apache.org/jira/browse/SPARK-26221
> Project: Spark
> Issue Type: Umbrella
> Components: SQL
> Affects Versions: 2.4.0
> Reporter: Reynold Xin
> Assignee: Reynold Xin
> Priority: Major
>
> This is an umbrella ticket for various small improvements for better metrics and instrumentation. Some thoughts:
>
> Differentiate query plan that’s writing data out, vs returning data to the driver
> * I.e. ETL & report generation vs interactive analysis
> * This is related to the data sink item below. We need to make sure from the query plan we can tell what a query is doing
> Data sink: Have an operator for data sink, with metrics that can tell us:
> * Write time
> * Number of records written
> * Size of output written
> * Number of partitions modified
> * Metastore update time
> * Also track number of records for collect / limit
> Scan
> * Track file listing time (start and end so we can construct timeline, not just duration)
> * Track metastore operation time
> * Track IO decoding time for row-based input sources; Need to make sure overhead is low
> Shuffle
> * Track read time and write time
> * Decide if we can measure serialization and deserialization
> Client fetch time
> * Sometimes a query take long to run because it is blocked on the client fetching result (e.g. using a result iterator). Record the time blocked on client so we can remove it in measuring query execution time.
> Make it easy to correlate queries with jobs, stages, and tasks belonging to a single query, e.g. dump execution id in task logs?
> Better logging:
> * Enable logging the query execution id and TID in executor logs, and query execution id in driver logs.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org