You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/31 04:23:11 UTC

[GitHub] [spark] liupc opened a new pull request #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

liupc opened a new pull request #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055
 
 
   
   ### What changes were proposed in this pull request?
   This PR will skip DeterminTableStats rule when hive table can be converted to datasource table, thus can avoid useless stats collection for these table when `spark.sql.statistics.fallBackToHdfs` is true.
   
   ### Why are the changes needed?
   This PR can improve performance in some cases.
   
   
   ### Does this PR introduce any user-facing change?
   No
   
   
   ### How was this patch tested?
   UT & test on real clusters
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#discussion_r374296512
 
 

 ##########
 File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
 ##########
 @@ -139,13 +139,15 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {
 
   override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
     case relation: HiveTableRelation
-      if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.isEmpty =>
+      if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.isEmpty &&
+        !RelationConversions.isConvertible(relation) =>
 
 Review comment:
   I think we just need this change that skips `DetermineTableStats when Hive will be converted later.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-581275046
 
 
   ok to test

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-581275585
 
 
   **[Test build #117764 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117764/testReport)** for PR 27055 at commit [`c5297d4`](https://github.com/apache/spark/commit/c5297d48fce32b584718c04b85dd06a293bc2c2b).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-581285877
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-581285833
 
 
   **[Test build #117764 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117764/testReport)** for PR 27055 at commit [`c5297d4`](https://github.com/apache/spark/commit/c5297d48fce32b584718c04b85dd06a293bc2c2b).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-569861748
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-569861845
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-581276023
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#discussion_r373953950
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CatalogFileIndex.scala
 ##########
 @@ -33,12 +33,10 @@ import org.apache.spark.sql.types.StructType
  *
  * @param sparkSession a [[SparkSession]]
  * @param table the metadata of the table
- * @param sizeInBytes the table's data size in bytes
  */
 class CatalogFileIndex(
     sparkSession: SparkSession,
-    val table: CatalogTable,
-    override val sizeInBytes: Long) extends FileIndex {
 
 Review comment:
   This change as @cloud-fan said, is expensive. And it doesn't follow up the defined behavior for partitioned data source and Hive table regrading statistics calculation.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-569861845
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-581285877
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liupc commented on a change in pull request #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
liupc commented on a change in pull request #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#discussion_r375872617
 
 

 ##########
 File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
 ##########
 @@ -139,13 +139,15 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {
 
   override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
     case relation: HiveTableRelation
-      if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.isEmpty =>
+      if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.isEmpty &&
+        !RelationConversions.isConvertible(relation) =>
 
 Review comment:
   @viirya I agree that doing size estimation on demand and disregarding catalog statistics is expensive, what I actually want to do is skip the `fallBackToHdfs` code path for datasource tables and do size estimation in `HadoopFsRelation`. 
   Maybe we should also add a config similar to `fallBackToHdfs` in `HadoopFsRelation`? we just do real scans when the config is true. Otherwise, we just use the stats in CatalogTable to compute the `sizeInBytes`.
   Why I think we should move this size estimation to `HadoopFsRelation` for datasource table is that we can do better estimation for datasource table like parquet. see  [SPARK-30712](https://issues.apache.org/jira/browse/SPARK-30712)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-581275585
 
 
   **[Test build #117764 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117764/testReport)** for PR 27055 at commit [`c5297d4`](https://github.com/apache/spark/commit/c5297d48fce32b584718c04b85dd06a293bc2c2b).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-581285890
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117764/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-581276027
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22527/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#discussion_r373952868
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CatalogFileIndex.scala
 ##########
 @@ -61,6 +59,12 @@ class CatalogFileIndex(
 
   override def refresh(): Unit = fileStatusCache.invalidateAll()
 
+  override def sizeInBytes: Long = {
+    table.stats.map(_.sizeInBytes.toLong).getOrElse{
+      filterPartitions(Nil).sizeInBytes
 
 Review comment:
   This is super expensive. Are you sure we always need to do it?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-581276023
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-569861748
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liupc commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
liupc commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-582919934
 
 
   > > 1. It solves the overflows issue.
   > 
   > I think the overflow issue is not limited to Hive case here. It sounds like a separate issue.
   
   Yes,I thinks it's a common issue,  we can solve it in a separate PR. I will create another PR for this later

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-581276027
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22527/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27055: [SPARK-30394]Skip DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#issuecomment-581285890
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117764/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org