You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by bersprockets <gi...@git.apache.org> on 2018/08/01 22:31:32 UTC

[GitHub] spark pull request #21950: [SPARK-24912][SQL][WIP] Add configuration to avoi...

GitHub user bersprockets opened a pull request:

    https://github.com/apache/spark/pull/21950

    [SPARK-24912][SQL][WIP] Add configuration to avoid OOM during broadcast join (and other negative side effects of incorrect table sizing)

    ## What changes were proposed in this pull request?
    
    Added configuration settings to help avoid OOM errors during broadcast joins.
    
    - deser multiplication factor: Tell Spark to multiply totalSize times a specified factor for tables with encoded files (i.e., parquet or orc files). Spark will do this when calculating a table's sizeInBytes. This is modelled after Hive's hive.stats.deserialization.factor configuration setting.
    - ignore rawDataSize: Due to HIVE-20079, rawDataSize is broken. This settings tells Spark to ignore rawDataSize when calculating the table's sizeInBytes.
    
    One can partially simulate the deser multiplication factor without this change by decreasing the value in spark.sql.autoBroadcastJoinThreshold. However, that will affect all tables, not just the ones that are encoded.
    
    There is some awkwardness in that the check for file type (parquet or orc) uses Hive deser names, but the checks for partitioned tables need to be made outside of the Hive submodule. Still working that out.
    
    ## How was this patch tested?
    
    Added unit tests.
    
    Also, checked that I can avoid broadcast join OOM errors when using the deser multiplication factor on both my laptop and a cluster. Also checked that I can avoid OOM errors using the ignore rawDataSize flag on my laptop.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bersprockets/spark SPARK-24914

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21950.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21950
    
----
commit aa2a957751a906fe538822cace019014e763a8c3
Author: Bruce Robbins <be...@...>
Date:   2018-07-26T00:36:17Z

    WIP version

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94890/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #94067 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94067/testReport)** for PR 21950 at commit [`aa2a957`](https://github.com/apache/spark/commit/aa2a957751a906fe538822cace019014e763a8c3).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #94869 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94869/testReport)** for PR 21950 at commit [`3a65edf`](https://github.com/apache/spark/commit/3a65edf0e07f3beb6d6dd4dcb16e76ea7210c5e9).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    I broke this :). Don't ask for a redo.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24912][SQL][WIP] Add configuration to avoid OOM d...

Posted by holdensmagicalunicorn <gi...@git.apache.org>.
Github user holdensmagicalunicorn commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    @bersprockets, thanks! I am a bot who has found some folks who might be able to help with the review:@cloud-fan, @gatorsmile and @rxin


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL] Add configuration to avoid OOM during...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #98137 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98137/testReport)** for PR 21950 at commit [`ddfe945`](https://github.com/apache/spark/commit/ddfe945ef161e59fc2bbc1a12bf40563d2bdd400).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL] Add configuration to avoid OOM during...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94853/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #94869 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94869/testReport)** for PR 21950 at commit [`3a65edf`](https://github.com/apache/spark/commit/3a65edf0e07f3beb6d6dd4dcb16e76ea7210c5e9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94017/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21950: [SPARK-24914][SQL][WIP] Add configuration to avoi...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21950#discussion_r212719073
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala ---
    @@ -76,4 +78,16 @@ private[sql] object PruneFileSourcePartitions extends Rule[LogicalPlan] {
             op
           }
       }
    +
    +  private def calcPartSize(catalogTable: Option[CatalogTable], sizeInBytes: Long): Long = {
    +    val conf: SQLConf = SQLConf.get
    +    val factor = conf.sizeDeserializationFactor
    +    if (catalogTable.isDefined && factor != 1.0 &&
    +      // TODO: The serde check should be in a utility function, since it is also checked elsewhere
    +      catalogTable.get.storage.serde.exists(s => s.contains("Parquet") || s.contains("Orc"))) {
    --- End diff --
    
    @mgaido91 Good point. Also, I notice that even when the table's files are not compressed (say, a table backed by CSV files), the LongToUnsafeRowMap or BytesToBytesMap that backs the relation is roughly 3 times larger than the total file size. So even under the best of circumstances (i.e., the table's files are not compressed), Spark will get it wrong by several multiples.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94869/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96007/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #96007 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96007/testReport)** for PR 21950 at commit [`21f8ba3`](https://github.com/apache/spark/commit/21f8ba31562ad519266e11ab9ddf82d8c1c5805d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96157/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #96190 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96190/testReport)** for PR 21950 at commit [`e776ef2`](https://github.com/apache/spark/commit/e776ef288bf92099e6c67d9c2a1b0fd30027e9f3).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21950: [SPARK-24914][SQL][WIP] Add configuration to avoi...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21950#discussion_r217216975
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala ---
    @@ -91,4 +91,28 @@ class PruneFileSourcePartitionsSuite extends QueryTest with SQLTestUtils with Te
           assert(size2 < tableStats.get.sizeInBytes)
         }
       }
    +
    +  test("Test deserialization factor against partition") {
    +    val factor = 10
    +    withTable("tbl") {
    +      spark.range(10).selectExpr("id", "id % 3 as p").write.format("parquet")
    +        .partitionBy("p").saveAsTable("tbl")
    +      sql(s"ANALYZE TABLE tbl COMPUTE STATISTICS")
    +
    +      val df1 = sql("SELECT * FROM tbl WHERE p = 1")
    +      val sizes1 = df1.queryExecution.optimizedPlan.collect {
    +        case relation: LogicalRelation => relation.catalogTable.get.stats.get.sizeInBytes
    +      }
    +      assert(sizes1 != 0)
    --- End diff --
    
    Oops. Should be <code>assert(sizes1(0) != 0)</code>. I will fix.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #96459 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96459/testReport)** for PR 21950 at commit [`251af58`](https://github.com/apache/spark/commit/251af58e98cb2645cd7c61be9a8586d5c02cf414).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #94853 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94853/testReport)** for PR 21950 at commit [`aa2a957`](https://github.com/apache/spark/commit/aa2a957751a906fe538822cace019014e763a8c3).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #94874 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94874/testReport)** for PR 21950 at commit [`3a65edf`](https://github.com/apache/spark/commit/3a65edf0e07f3beb6d6dd4dcb16e76ea7210c5e9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #94890 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94890/testReport)** for PR 21950 at commit [`3a65edf`](https://github.com/apache/spark/commit/3a65edf0e07f3beb6d6dd4dcb16e76ea7210c5e9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    restest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #94017 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94017/testReport)** for PR 21950 at commit [`aa2a957`](https://github.com/apache/spark/commit/aa2a957751a906fe538822cace019014e763a8c3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24912][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #93912 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93912/testReport)** for PR 21950 at commit [`aa2a957`](https://github.com/apache/spark/commit/aa2a957751a906fe538822cace019014e763a8c3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #96157 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96157/testReport)** for PR 21950 at commit [`1efa747`](https://github.com/apache/spark/commit/1efa74785ed25a613c5035f8accaf89ca0105bc3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94874/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24912][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL] Add configuration to avoid OOM during...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98137/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94067/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21950: [SPARK-24914][SQL][WIP] Add configuration to avoi...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21950#discussion_r210841402
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala ---
    @@ -76,4 +78,16 @@ private[sql] object PruneFileSourcePartitions extends Rule[LogicalPlan] {
             op
           }
       }
    +
    +  private def calcPartSize(catalogTable: Option[CatalogTable], sizeInBytes: Long): Long = {
    +    val conf: SQLConf = SQLConf.get
    +    val factor = conf.sizeDeserializationFactor
    +    if (catalogTable.isDefined && factor != 1.0 &&
    +      // TODO: The serde check should be in a utility function, since it is also checked elsewhere
    +      catalogTable.get.storage.serde.exists(s => s.contains("Parquet") || s.contains("Orc"))) {
    --- End diff --
    
    I am not sure about this. We are saying that only Parquet/ORC files can be compressed?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #93912 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93912/testReport)** for PR 21950 at commit [`aa2a957`](https://github.com/apache/spark/commit/aa2a957751a906fe538822cace019014e763a8c3).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #96007 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96007/testReport)** for PR 21950 at commit [`21f8ba3`](https://github.com/apache/spark/commit/21f8ba31562ad519266e11ab9ddf82d8c1c5805d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96190/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93912/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #94067 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94067/testReport)** for PR 21950 at commit [`aa2a957`](https://github.com/apache/spark/commit/aa2a957751a906fe538822cace019014e763a8c3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #94853 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94853/testReport)** for PR 21950 at commit [`aa2a957`](https://github.com/apache/spark/commit/aa2a957751a906fe538822cace019014e763a8c3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #94017 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94017/testReport)** for PR 21950 at commit [`aa2a957`](https://github.com/apache/spark/commit/aa2a957751a906fe538822cace019014e763a8c3).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #96459 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96459/testReport)** for PR 21950 at commit [`251af58`](https://github.com/apache/spark/commit/251af58e98cb2645cd7c61be9a8586d5c02cf414).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL] Add configuration to avoid OOM during...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #98137 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98137/testReport)** for PR 21950 at commit [`ddfe945`](https://github.com/apache/spark/commit/ddfe945ef161e59fc2bbc1a12bf40563d2bdd400).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21950: [SPARK-24914][SQL][WIP] Add configuration to avoi...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21950#discussion_r218608537
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala ---
    @@ -1051,11 +1052,27 @@ private[hive] object HiveClientImpl {
         // When table is external, `totalSize` is always zero, which will influence join strategy.
         // So when `totalSize` is zero, use `rawDataSize` instead. When `rawDataSize` is also zero,
         // return None.
    +    // If a table has a deserialization factor, the table owner expects the in-memory
    +    // representation of the table to be larger than the table's totalSize value. In that case,
    +    // multiply totalSize by the deserialization factor and use that number instead.
    +    // If the user has set spark.sql.statistics.ignoreRawDataSize to true (because of HIVE-20079,
    +    // for example), don't use rawDataSize.
         // In Hive, when statistics gathering is disabled, `rawDataSize` and `numRows` is always
         // zero after INSERT command. So they are used here only if they are larger than zero.
    -    if (totalSize.isDefined && totalSize.get > 0L) {
    -      Some(CatalogStatistics(sizeInBytes = totalSize.get, rowCount = rowCount.filter(_ > 0)))
    -    } else if (rawDataSize.isDefined && rawDataSize.get > 0) {
    +    val factor = try {
    +        properties.get("deserFactor").getOrElse("1.0").toDouble
    --- End diff --
    
    I need to eliminate this duplication: There's a similar lookup and calculation done in PruneFileSourcePartitionsSuite. Also, I should check if a Long value, used as an intermediate value, is acceptable to hold file sizes (possibly, since a Long can represent 8 exabytes)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #96190 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96190/testReport)** for PR 21950 at commit [`e776ef2`](https://github.com/apache/spark/commit/e776ef288bf92099e6c67d9c2a1b0fd30027e9f3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #96157 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96157/testReport)** for PR 21950 at commit [`1efa747`](https://github.com/apache/spark/commit/1efa74785ed25a613c5035f8accaf89ca0105bc3).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96459/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #94890 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94890/testReport)** for PR 21950 at commit [`3a65edf`](https://github.com/apache/spark/commit/3a65edf0e07f3beb6d6dd4dcb16e76ea7210c5e9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21950
  
    **[Test build #94874 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94874/testReport)** for PR 21950 at commit [`3a65edf`](https://github.com/apache/spark/commit/3a65edf0e07f3beb6d6dd4dcb16e76ea7210c5e9).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org