You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by CodingCat <gi...@git.apache.org> on 2017/12/25 03:14:33 UTC

[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

GitHub user CodingCat opened a pull request:

    https://github.com/apache/spark/pull/20072

    [SPARK-22790][SQL] add a configurable factor to describe HadoopFsRelation's size

    ## What changes were proposed in this pull request?
    
    as per discussion in https://github.com/apache/spark/pull/19864#discussion_r156847927
    
    the current HadoopFsRelation is purely based on the underlying file size which is not accurate and makes the execution vulnerable to errors like OOM
    
    Users can enable CBO with the functionalities in https://github.com/apache/spark/pull/19864 to avoid this issue
    
    This JIRA proposes to add a configurable factor to sizeInBytes method in HadoopFsRelation class so that users can mitigate this problem without CBO
    
    ## How was this patch tested?
    
    Existing tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/CodingCat/spark SPARK-22790

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20072.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20072
    
----
commit b02d857f20f594d87c8c48991bfbbe95a71b364a
Author: CodingCat <zh...@...>
Date:   2016-03-07T14:37:37Z

    improve the doc for "spark.memory.offHeap.size"

commit e09d60f3dd212dc0ce9687b112970c7cf1e4c83b
Author: CodingCat <zh...@...>
Date:   2016-03-07T14:37:37Z

    improve the doc for "spark.memory.offHeap.size"

commit 2ebc6caab7c540b43a50b7e0f27b8f4c278e5611
Author: CodingCat <zh...@...>
Date:   2016-03-07T19:00:16Z

    fix

commit 9b87ba8830d102c2568e338787e8b49b284dd8b1
Author: CodingCat <zh...@...>
Date:   2016-03-07T19:00:16Z

    fix

commit e6065c75015b8a2c0eff9f3c6e1ebfe148b28e65
Author: CodingCat <zh...@...>
Date:   2017-12-25T03:21:02Z

    add a configurable factor to describe HadoopFsRelation's size

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Thanks! Merged to master/2.3


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159106914
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -261,6 +261,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val HADOOPFSRELATION_SIZE_FACTOR = buildConf(
    +    "org.apache.spark.sql.execution.datasources.sizeFactor")
    +    .internal()
    +    .doc("The result of multiplying this factor with the size of data source files is propagated" +
    +      " to serve as the stats to choose the best execution plan. In the case where the " +
    +      " the in-disk and in-memory size of data is significantly different, users can adjust this" +
    --- End diff --
    
    and double `the`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159477276
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala ---
    @@ -82,7 +82,11 @@ case class HadoopFsRelation(
         }
       }
     
    -  override def sizeInBytes: Long = location.sizeInBytes
    +  override def sizeInBytes: Long = {
    +    val sizeFactor = sqlContext.conf.sizeToMemorySizeFactor
    +    (location.sizeInBytes * sizeFactor).toLong
    --- End diff --
    
    ah good to know it


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159574899
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -261,6 +261,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    +    "org.apache.spark.sql.execution.datasources.fileDataSizeFactor")
    --- End diff --
    
    ah `compressionFactor` sounds better.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    cc @CodingCat 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r160076965
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -263,6 +263,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    +    "spark.sql.sources.compressionFactor")
    +    .internal()
    +    .doc("The result of multiplying this factor with the size of data source files is propagated " +
    +      "to serve as the stats to choose the best execution plan. In the case where the " +
    +      "in-disk and in-memory size of data is significantly different, users can adjust this " +
    +      "factor for a better choice of the execution plan. The default value is 1.0.")
    +    .doubleConf
    +    .checkValue(_ > 0, "the value of fileDataSizeFactor must be larger than 0")
    --- End diff --
    
    BTW `fileDataSizeFactor` -> `compressionFactor`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85363 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85363/testReport)** for PR 20072 at commit [`e6065c7`](https://github.com/apache/spark/commit/e6065c75015b8a2c0eff9f3c6e1ebfe148b28e65).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r160076855
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -263,6 +263,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    +    "spark.sql.sources.compressionFactor")
    --- End diff --
    
    merge this with the previous line


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r160077189
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -263,6 +263,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    +    "spark.sql.sources.compressionFactor")
    +    .internal()
    +    .doc("The result of multiplying this factor with the size of data source files is propagated " +
    +      "to serve as the stats to choose the best execution plan. In the case where the " +
    --- End diff --
    
    `When estimating the output data size of a table scan, multiple the file size with this factor as the estimated data size, in case the data is compressed in the file and lead to a heavily underestimated result.`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159133859
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -261,6 +261,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val HADOOPFSRELATION_SIZE_FACTOR = buildConf(
    +    "org.apache.spark.sql.execution.datasources.sizeFactor")
    +    .internal()
    +    .doc("The result of multiplying this factor with the size of data source files is propagated" +
    +      " to serve as the stats to choose the best execution plan. In the case where the " +
    +      " the in-disk and in-memory size of data is significantly different, users can adjust this" +
    --- End diff --
    
    done, thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159477801
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -261,6 +261,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    +    "org.apache.spark.sql.execution.datasources.fileDataSizeFactor")
    --- End diff --
    
    similar to `spark.sql.sources.parallelPartitionDiscovery.parallelism`, how about `spark.sql.sources. fileDataSizeFactor`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85586 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85586/testReport)** for PR 20072 at commit [`e97f419`](https://github.com/apache/spark/commit/e97f419a5c3347242832287a9e5b0e5662f9e6bb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85383/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85757/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85548 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85548/testReport)** for PR 20072 at commit [`2a33b88`](https://github.com/apache/spark/commit/2a33b88695ae686762c7aadaef00b467f7a5187e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85988 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85988/testReport)** for PR 20072 at commit [`c584c61`](https://github.com/apache/spark/commit/c584c61fb90851fd5cb74869c62f0935c93fed14).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r160077024
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -263,6 +263,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    --- End diff --
    
    Rename this too, `FILE_COMRESSION_FACTOR`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86045/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r160077002
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -263,6 +263,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    +    "spark.sql.sources.compressionFactor")
    --- End diff --
    
    BTW, how about `fileCompressionFactor`? Since it works for only file-based data sources.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    @cloud-fan @rxin @wzhfy @felixcheung @gatorsmile thanks the review, the new name of the parameter and test are added


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85548 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85548/testReport)** for PR 20072 at commit [`2a33b88`](https://github.com/apache/spark/commit/2a33b88695ae686762c7aadaef00b467f7a5187e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85988/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159130617
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala ---
    @@ -82,7 +84,15 @@ case class HadoopFsRelation(
         }
       }
     
    -  override def sizeInBytes: Long = location.sizeInBytes
    +  override def sizeInBytes: Long = {
    +    val size = location.sizeInBytes * hadoopFSSizeFactor
    --- End diff --
    
    nvm. hadoopFSSizeFactor is a double


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85383 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85383/testReport)** for PR 20072 at commit [`ec275a8`](https://github.com/apache/spark/commit/ec275a841a7bb4c23b277f915debeed54e6cf7ea).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85363 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85363/testReport)** for PR 20072 at commit [`e6065c7`](https://github.com/apache/spark/commit/e6065c75015b8a2c0eff9f3c6e1ebfe148b28e65).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159474598
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -261,6 +261,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    +    "org.apache.spark.sql.execution.datasources.sizeFactor")
    --- End diff --
    
    done


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85988 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85988/testReport)** for PR 20072 at commit [`c584c61`](https://github.com/apache/spark/commit/c584c61fb90851fd5cb74869c62f0935c93fed14).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85970/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85949 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85949/testReport)** for PR 20072 at commit [`5230081`](https://github.com/apache/spark/commit/523008192cb1476a68cf3808f46574598fcc6d2d).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85758 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85758/testReport)** for PR 20072 at commit [`670a6c0`](https://github.com/apache/spark/commit/670a6c062a10ff775d1e3d6533702e2cc3cb34da).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85760 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85760/testReport)** for PR 20072 at commit [`291ce3a`](https://github.com/apache/spark/commit/291ce3a70d903c6d4608c0a1b6f0a27f5526a79f).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159135272
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala ---
    @@ -82,7 +84,15 @@ case class HadoopFsRelation(
         }
       }
     
    -  override def sizeInBytes: Long = location.sizeInBytes
    +  override def sizeInBytes: Long = {
    +    val size = location.sizeInBytes * hadoopFSSizeFactor
    +    if (size > Long.MaxValue) {
    --- End diff --
    
    I think this branch can be removed? `Long.MaxValue` is returned when converting a double value larger than `Long.MaxValue`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159135036
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala ---
    @@ -60,6 +60,8 @@ case class HadoopFsRelation(
         }
       }
     
    +  private val hadoopFSSizeFactor = sqlContext.conf.hadoopFSSizeFactor
    --- End diff --
    
    shall we move it into the method `sizeInBytes` since it's only used there?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    cc @juliuszsompolski @cloud-fan @wzhfy 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85638 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85638/testReport)** for PR 20072 at commit [`a0f3462`](https://github.com/apache/spark/commit/a0f3462ce285c47d0bc97c408e14ecb5e85e50c6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85762 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85762/testReport)** for PR 20072 at commit [`291ce3a`](https://github.com/apache/spark/commit/291ce3a70d903c6d4608c0a1b6f0a27f5526a79f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85760/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159573530
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -261,6 +261,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    +    "org.apache.spark.sql.execution.datasources.fileDataSizeFactor")
    --- End diff --
    
    shouldn't we call this something like compressionFactor?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    @gatorsmile more comments?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159175078
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -261,6 +261,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    +    "org.apache.spark.sql.execution.datasources.sizeFactor")
    --- End diff --
    
    `...sizeFactor` is too vague, how about `fileDataSizeFactor`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20072


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85638/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159134987
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -261,6 +261,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val HADOOPFSRELATION_SIZE_FACTOR = buildConf(
    --- End diff --
    
    How about `DISK_TO_MEMORY_SIZE_FACTOR`? IMHO the current name doesn't describe the purpose clearly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85548/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85758/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85970 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85970/testReport)** for PR 20072 at commit [`6fe8589`](https://github.com/apache/spark/commit/6fe85892ca6c53b3aa965951dc7ef51df9d068e2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #86045 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86045/testReport)** for PR 20072 at commit [`c584c61`](https://github.com/apache/spark/commit/c584c61fb90851fd5cb74869c62f0935c93fed14).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159106860
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -261,6 +261,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val HADOOPFSRELATION_SIZE_FACTOR = buildConf(
    +    "org.apache.spark.sql.execution.datasources.sizeFactor")
    +    .internal()
    +    .doc("The result of multiplying this factor with the size of data source files is propagated" +
    +      " to serve as the stats to choose the best execution plan. In the case where the " +
    +      " the in-disk and in-memory size of data is significantly different, users can adjust this" +
    --- End diff --
    
    nit: always put space at the end of the line for readability / consistency.
    we have two spaces here
    ```
    "...In the case where the " +
    " the in-disk and in-..."
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r158632898
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala ---
    @@ -82,7 +84,15 @@ case class HadoopFsRelation(
         }
       }
     
    -  override def sizeInBytes: Long = location.sizeInBytes
    +  override def sizeInBytes: Long = {
    +    val size = location.sizeInBytes * hadoopFSSizeFactor
    --- End diff --
    
    Overflow?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159175087
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala ---
    @@ -82,7 +82,11 @@ case class HadoopFsRelation(
         }
       }
     
    -  override def sizeInBytes: Long = location.sizeInBytes
    +  override def sizeInBytes: Long = {
    +    val sizeFactor = sqlContext.conf.sizeToMemorySizeFactor
    +    (location.sizeInBytes * sizeFactor).toLong
    --- End diff --
    
    we should add a safe check for overflow.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    LGTM, we should also add a test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r160076999
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -263,6 +263,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    +    "spark.sql.sources.compressionFactor")
    +    .internal()
    +    .doc("The result of multiplying this factor with the size of data source files is propagated " +
    +      "to serve as the stats to choose the best execution plan. In the case where the " +
    +      "in-disk and in-memory size of data is significantly different, users can adjust this " +
    +      "factor for a better choice of the execution plan. The default value is 1.0.")
    +    .doubleConf
    +    .checkValue(_ > 0, "the value of fileDataSizeFactor must be larger than 0")
    --- End diff --
    
    it's not necessary to be that parquet is always smaller than memory size...e.g. in some simple dataset (like the one used in the test), parquet's overhead makes the overall size larger than in-memory size....
    
    but with TPCDS dataset, I observed that parquet size is much smaller than in-memory size


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85383 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85383/testReport)** for PR 20072 at commit [`ec275a8`](https://github.com/apache/spark/commit/ec275a841a7bb4c23b277f915debeed54e6cf7ea).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85586 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85586/testReport)** for PR 20072 at commit [`e97f419`](https://github.com/apache/spark/commit/e97f419a5c3347242832287a9e5b0e5662f9e6bb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r158651308
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala ---
    @@ -82,7 +84,15 @@ case class HadoopFsRelation(
         }
       }
     
    -  override def sizeInBytes: Long = location.sizeInBytes
    +  override def sizeInBytes: Long = {
    +    val size = location.sizeInBytes * hadoopFSSizeFactor
    --- End diff --
    
    this should be handled by `size` > Long.MaxValue, the double value is overflowed only when the result value is Double.PositivyInfinity which would be capped as Long.MaxValue


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85760 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85760/testReport)** for PR 20072 at commit [`291ce3a`](https://github.com/apache/spark/commit/291ce3a70d903c6d4608c0a1b6f0a27f5526a79f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85757 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85757/testReport)** for PR 20072 at commit [`2f6e3c9`](https://github.com/apache/spark/commit/2f6e3c9da482e85f6ee8046c7e7e2f9f6194ece2).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159171970
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -261,6 +261,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val HADOOPFSRELATION_SIZE_FACTOR = buildConf(
    +    "org.apache.spark.sql.execution.datasources.sizeFactor")
    --- End diff --
    
    this is only for HadoopFSRelation


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    @wzhfy thanks for the review, please take a look 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85970 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85970/testReport)** for PR 20072 at commit [`6fe8589`](https://github.com/apache/spark/commit/6fe85892ca6c53b3aa965951dc7ef51df9d068e2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85638/testReport)** for PR 20072 at commit [`a0f3462`](https://github.com/apache/spark/commit/a0f3462ce285c47d0bc97c408e14ecb5e85e50c6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85758 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85758/testReport)** for PR 20072 at commit [`670a6c0`](https://github.com/apache/spark/commit/670a6c062a10ff775d1e3d6533702e2cc3cb34da).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85363/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    @gatorsmile thanks for the review, Happy Christmas!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159474580
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala ---
    @@ -82,7 +82,11 @@ case class HadoopFsRelation(
         }
       }
     
    -  override def sizeInBytes: Long = location.sizeInBytes
    +  override def sizeInBytes: Long = {
    +    val sizeFactor = sqlContext.conf.sizeToMemorySizeFactor
    +    (location.sizeInBytes * sizeFactor).toLong
    --- End diff --
    
    before the latest commit, there is a safe check https://github.com/apache/spark/pull/20072/commits/e6065c75015b8a2c0eff9f3c6e1ebfe148b28e65#diff-fcb68cd3c7630f337ce9a3b479b6d0c4R88
    
    However, since sizeFactor is a double, any overflow with positive double numbers would be capped as Double.PositiveInfinity, and as @wzhfy indicated, any double number which is larger than Long.MaxValue would return Long.MaxValue in its toLong method
    
    so it should be safe here  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85762 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85762/testReport)** for PR 20072 at commit [`291ce3a`](https://github.com/apache/spark/commit/291ce3a70d903c6d4608c0a1b6f0a27f5526a79f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85586/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r158632706
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -261,6 +261,16 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val HADOOPFSRELATION_SIZE_FACTOR = buildConf(
    +    "org.apache.spark.sql.execution.datasources.sizeFactor")
    +    .internal()
    +    .doc("The result of multiplying this factor with the size of data source files is propagated" +
    +      " to serve as the stats to choose the best execution plan. In the case where the " +
    +      " the in-disk and in-memory size of data is significantly different, users can adjust this" +
    +      " factor for a better choice of the execution plan. The default value is 1.0.")
    +    .doubleConf
    +    .createWithDefault(1.0)
    --- End diff --
    
    checkValues > 0.0


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159106981
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -261,6 +261,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val HADOOPFSRELATION_SIZE_FACTOR = buildConf(
    +    "org.apache.spark.sql.execution.datasources.sizeFactor")
    +    .internal()
    +    .doc("The result of multiplying this factor with the size of data source files is propagated" +
    +      " to serve as the stats to choose the best execution plan. In the case where the " +
    +      " the in-disk and in-memory size of data is significantly different, users can adjust this" +
    --- End diff --
    
    actually https://github.com/apache/spark/pull/20072/files/ec275a841a7bb4c23b277f915debeed54e6cf7ea#diff-9a6b543db706f1a90f790783d6930a13R250 is missing a space at the end - could you also fix that


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159130601
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala ---
    @@ -82,7 +84,15 @@ case class HadoopFsRelation(
         }
       }
     
    -  override def sizeInBytes: Long = location.sizeInBytes
    +  override def sizeInBytes: Long = {
    +    val size = location.sizeInBytes * hadoopFSSizeFactor
    --- End diff --
    
    ```
    scala> val size: Long = 0x7fffffffffffffffL
    size: Long = 9223372036854775807
    
    scala> val res = size * 10
    res: Long = -10
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r159135028
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -261,6 +261,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val HADOOPFSRELATION_SIZE_FACTOR = buildConf(
    +    "org.apache.spark.sql.execution.datasources.sizeFactor")
    --- End diff --
    
    Is this config for all data sources or only hadoopFS-related data sources?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r160076809
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -263,6 +263,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    +    "spark.sql.sources.compressionFactor")
    +    .internal()
    +    .doc("The result of multiplying this factor with the size of data source files is propagated " +
    +      "to serve as the stats to choose the best execution plan. In the case where the " +
    +      "in-disk and in-memory size of data is significantly different, users can adjust this " +
    +      "factor for a better choice of the execution plan. The default value is 1.0.")
    +    .doubleConf
    +    .checkValue(_ > 0, "the value of fileDataSizeFactor must be larger than 0")
    --- End diff --
    
    maybe `>= 1.0`? it's weird to see a compression factor less than 1.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #86045 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86045/testReport)** for PR 20072 at commit [`c584c61`](https://github.com/apache/spark/commit/c584c61fb90851fd5cb74869c62f0935c93fed14).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r160077194
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala ---
    @@ -82,7 +82,11 @@ case class HadoopFsRelation(
         }
       }
     
    -  override def sizeInBytes: Long = location.sizeInBytes
    +  override def sizeInBytes: Long = {
    +    val sizeFactor = sqlContext.conf.diskToMemorySizeFactor
    --- End diff --
    
    compressionFactor


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85949/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85762/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85949 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85949/testReport)** for PR 20072 at commit [`5230081`](https://github.com/apache/spark/commit/523008192cb1476a68cf3808f46574598fcc6d2d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20072
  
    **[Test build #85757 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85757/testReport)** for PR 20072 at commit [`2f6e3c9`](https://github.com/apache/spark/commit/2f6e3c9da482e85f6ee8046c7e7e2f9f6194ece2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org