You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by gatorsmile <gi...@git.apache.org> on 2018/10/05 02:04:39 UTC

[GitHub] spark pull request #22594: [MINOR][SQL] When batch reading, the number of by...

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22594#discussion_r222875551
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -104,12 +104,14 @@ class FileScanRDD(
             val nextElement = currentIterator.next()
             // TODO: we should have a better separation of row based and batch based scan, so that we
             // don't need to run this `if` for every record.
    +        val preNumRecordsRead = inputMetrics.recordsRead
             if (nextElement.isInstanceOf[ColumnarBatch]) {
               inputMetrics.incRecordsRead(nextElement.asInstanceOf[ColumnarBatch].numRows())
             } else {
               inputMetrics.incRecordsRead(1)
             }
    -        if (inputMetrics.recordsRead % SparkHadoopUtil.UPDATE_INPUT_METRICS_INTERVAL_RECORDS == 0) {
    --- End diff --
    
    The original goal here is to avoid updating it every record, because it is too expensive. I am not sure what is the goal of your changes. Try to write a test case in SQLMetricsSuite?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org