You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/12/11 08:53:00 UTC

[jira] [Commented] (SPARK-26327) Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set

    [ https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716580#comment-16716580 ] 

ASF GitHub Bot commented on SPARK-26327:
----------------------------------------

xuanyuanking commented on a change in pull request #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly
URL: https://github.com/apache/spark/pull/23277#discussion_r240517252
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
 ##########
 @@ -167,19 +167,14 @@ case class FileSourceScanExec(
       partitionSchema = relation.partitionSchema,
       relation.sparkSession.sessionState.conf)
 
+  private var fileListingTime = 0L
+
   @transient private lazy val selectedPartitions: Seq[PartitionDirectory] = {
     val optimizerMetadataTimeNs = relation.location.metadataOpsTimeNs.getOrElse(0L)
     val startTime = System.nanoTime()
     val ret = relation.location.listFiles(partitionFilters, dataFilters)
     val timeTakenMs = ((System.nanoTime() - startTime) + optimizerMetadataTimeNs) / 1000 / 1000
-
-    metrics("numFiles").add(ret.map(_.files.size.toLong).sum)
-    metrics("metadataTime").add(timeTakenMs)
-
-    val executionId = sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY)
-    SQLMetrics.postDriverMetricUpdates(sparkContext, executionId,
-      metrics("numFiles") :: metrics("metadataTime") :: Nil)
-
 
 Review comment:
   All versions after 2.2.0 have this bug, this PR fix the conner case(but all case by sql) in the commit. If we test by:
   ```
   spark.read.parquet($filepath)
   ```
   the metrics show correctly cause `relation.partitionSchemaOption` is None, `selectedPartitions` initializing not triggered at wrong place. But if we test as the added UT:
   ```
   sql("select * from table partition=xx")
   ```
   the metrics goes wrong.
   I changed the JIRA description and title more detailed, please help me to check whether I explain clearly. Thanks :)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-26327
>                 URL: https://issues.apache.org/jira/browse/SPARK-26327
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Yuanjian Li
>            Priority: Major
>
> As currently approach in `FileSourceScanExec`, the metrics of "numFiles" and "metadataTime"(fileListingTime) were updated while lazy val `selectedPartitions` initialized in the scenario of relation.partitionSchema is set. But `selectedPartitions` will be initialized by `metadata` at first, which is called by `queryExecution.toString` in `SQLExecution.withNewExecutionId`. So while the `SQLMetrics.postDriverMetricUpdates` called, there's no corresponding liveExecutions in SQLAppStatusListener, the metrics update is not work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org