You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/08/23 15:47:58 UTC

[GitHub] [spark] wangyum opened a new pull request #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation

wangyum opened a new pull request #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
URL: https://github.com/apache/spark/pull/24715
 
 
   ## What changes were proposed in this pull request?
   
   This PR update `spark.sql.statistics.fallBackToHdfs`'s doc:
   1. This flag is effective only if it is Hive table.
   2. For non-partitioned data source table, it will be automatically recalculated if table statistics are not available
   3. For partitioned data source table, It is 'spark.sql.defaultSizeInBytes' if table statistics are not available.
   
   Related code:
   - Non-partitioned data source table:
   [SizeInBytesOnlyStatsPlanVisitor.default()](https://github.com/apache/spark/blob/98be8953c75c026c1cb432cc8f66dd312feed0c6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L57) -> [LogicalRelation.computeStats()](https://github.com/apache/spark/blob/a1c1dd3484a4dcd7c38fe256e69dbaaaf10d1a92/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L42-L46) -> [HadoopFsRelation.sizeInBytes()](https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala#L72-L75) -> [PartitioningAwareFileIndex.sizeInBytes()](https://github.com/apache/spark/blob/b276788d57b270d455ef6a7c5ed6cf8a74885dde/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L103)
   `PartitioningAwareFileIndex.sizeInBytes()` is calculated by [`allFiles().map(_.getLen).sum`](https://github.com/apache/spark/blob/b276788d57b270d455ef6a7c5ed6cf8a74885dde/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L103) if table statistics are not available.
   
   - Partitioned data source table:
   [SizeInBytesOnlyStatsPlanVisitor.default()](https://github.com/apache/spark/blob/98be8953c75c026c1cb432cc8f66dd312feed0c6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L57) -> [LogicalRelation.computeStats()](https://github.com/apache/spark/blob/a1c1dd3484a4dcd7c38fe256e69dbaaaf10d1a92/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L42-L46) -> [CatalogFileIndex.sizeInBytes](https://github.com/apache/spark/blob/5d672b7f3e07cfd7710df319fc6c7d2b9056a068/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CatalogFileIndex.scala#L41)
   `CatalogFileIndex.sizeInBytes` is [spark.sql.defaultSizeInBytes](https://github.com/apache/spark/blob/c30b5297bc607ae33cc2fcf624b127942154e559/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L387) if table statistics are not available.
   
   ## How was this patch tested?
   
   N/A

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org