You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/02/04 14:52:23 UTC

[GitHub] [spark] attilapiros opened a new pull request #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

attilapiros opened a new pull request #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453
 
 
   ### What changes were proposed in this pull request?
   
   This PR introduces a new statistic called `deserFactor` which can be set manualy as a table property 'spark.deserFactor' and intended to be used for columnar file formats as a ratio of actual data size (raw data size) to file size to scale up the file size to improve the estimate of in-memory data size and having a better query optimization (i.e., join strategy decision).
   
   ### Why are the changes needed?
   
   Before this change Spark estimated the table size as the sum of all the file sizes. This estimate can be way too low at columnar file formats where huge data can be compressed into a very small file because of serialization (like dictionary encoding) and compression. 
   With the `deserFactor` OOM error raised as a result of a wrongly chosen broadcast join strategy can be avoided.
   
   ### Does this PR introduce any user-facing change?
   
   No
   
   ### How was this patch tested?
   
   The StatisticsSuite is extended with a new test: `SPARK-24914 - test deserialization factor`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581949637
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22603/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
attilapiros commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581991031
 
 
   jenkins retest this please
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581948889
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117841/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
attilapiros commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-584052222
 
 
   cc @squito 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-583609316
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581960849
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22604/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-583482635
 
 
   **[Test build #118041 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118041/testReport)** for PR 27453 at commit [`e78b4b0`](https://github.com/apache/spark/commit/e78b4b07e566a02580502242df13de6a9c0d2a60).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581949637
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22603/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581960838
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
attilapiros commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581957343
 
 
   jenkins retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#discussion_r376481081
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionTestBase.scala
 ##########
 @@ -374,4 +376,14 @@ abstract class StatisticsCollectionTestBase extends QueryTest with SQLTestUtils
       assert(relation.stats.attributeStats.isEmpty)
     }
   }
+
+  def checkNumBroadcastHashJoins(df: DataFrame, expectedNumBhj: Int, clue: String): Unit = {
+    val plan = EnsureRequirements(spark.sessionState.conf).apply(df.queryExecution.sparkPlan)
 
 Review comment:
   Thanks! I am correcting this in my next commit.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581960849
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22604/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581948877
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581960150
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-583483254
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22806/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581949624
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
attilapiros removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581957343
 
 
   jenkins retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581949624
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581960164
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117842/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-583609327
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118041/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581948889
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117841/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#discussion_r376436911
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionTestBase.scala
 ##########
 @@ -374,4 +376,14 @@ abstract class StatisticsCollectionTestBase extends QueryTest with SQLTestUtils
       assert(relation.stats.attributeStats.isEmpty)
     }
   }
+
+  def checkNumBroadcastHashJoins(df: DataFrame, expectedNumBhj: Int, clue: String): Unit = {
+    val plan = EnsureRequirements(spark.sessionState.conf).apply(df.queryExecution.sparkPlan)
 
 Review comment:
   You don't need `EnsureRequirements` to count the number of joins.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-583482635
 
 
   **[Test build #118041 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118041/testReport)** for PR 27453 at commit [`e78b4b0`](https://github.com/apache/spark/commit/e78b4b07e566a02580502242df13de6a9c0d2a60).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581960150
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dbtsai commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
dbtsai commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-593584540
 
 
   cc @dongjoon-hyun for reviewing.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
attilapiros commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581972318
 
 
   jenkins retest this please
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-583609316
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
attilapiros commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-582008801
 
 
   cc @bersprockets 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-583483240
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581960838
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581960164
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117842/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#discussion_r376437205
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ##########
 @@ -422,14 +425,18 @@ object CatalogTable {
  */
 case class CatalogStatistics(
     sizeInBytes: BigInt,
+    deserFactor: Option[Int] = None,
 
 Review comment:
   Why not `Double`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-583483240
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-583483254
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22806/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#discussion_r376480769
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ##########
 @@ -422,14 +425,18 @@ object CatalogTable {
  */
 case class CatalogStatistics(
     sizeInBytes: BigInt,
+    deserFactor: Option[Int] = None,
 
 Review comment:
   It is expected to be a bigger number because of the columnar nature of the underlying file. Moreover it is better to take the ceil of a floating number here as being on the edge risks the an OOM.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-583609327
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118041/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-581948877
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27453: [SPARK-24914][SQL] Introduce new statistic to improve data size estimate for columnar storage formats (part 1)
URL: https://github.com/apache/spark/pull/27453#issuecomment-583608428
 
 
   **[Test build #118041 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118041/testReport)** for PR 27453 at commit [`e78b4b0`](https://github.com/apache/spark/commit/e78b4b07e566a02580502242df13de6a9c0d2a60).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org