You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/01 19:47:44 UTC

[GitHub] [spark] karuppayya commented on pull request #28686: [SPARK-31877][SQL]Avoid stats computation for Hive table

karuppayya commented on pull request #28686:
URL: https://github.com/apache/spark/pull/28686#issuecomment-652613180


   @viirya @maropu 
   I relooked the code, the stats from Hive table relation are propagated only for [Partitioned tables](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L200). Also  in `DetermineTableStats` we compute the stats only for [non-partitioned table](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L121). I think for a partitioned table it will also be `spark.sql.defaultSizeInBytes`
   In case of a non-partitioned table, the HadoopFSRelation created uses InMemoryFileIndex which does not use the stats computed and does a separate listing to figure the stats. 
   Let me know if I am missing something here
   
   To add to how this change is useful, I took the example of q17.sql TPCDS query on scale 1000, non-partitioned data
   Without this change, the following is the query metrics for the query planning phase
   ```
   scala> val df = sql(query)
   scala> df.queryExecution.tracker.topRulesByTime(2).foreach(println)
   (org.apache.spark.sql.hive.DetermineTableStats,RuleSummary(55677175448, 3, 3))
   (org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions,RuleSummary(37485411305, 6, 0))
   ```
   The time is also pretty high due to SPARK-31850.
   
   The stats computed is not used and can be avoided completely.
   Let me know your thoughts.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org