You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jinxing64 <gi...@git.apache.org> on 2018/05/10 08:02:07 UTC

[GitHub] spark pull request #21289: [SPARK-24240] Add a config to control whether InM...

GitHub user jinxing64 opened a pull request:

    https://github.com/apache/spark/pull/21289

    [SPARK-24240] Add a config to control whether InMemoryFileIndex should update cache when refresh.

    ## What changes were proposed in this pull request?
    In current code(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L172), after data is inserted, spark will always refresh file index and update the cache. If the target table has tons of files, job will suffer time and OOM issue. Could we add a config to control whether `InMemoryFileIndex` should update cache when refresh.
    
    ## How was this patch tested?
    
    To be added


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jinxing64/spark SPARK-24240

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21289.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21289
    
----
commit b49665d16be723a7abe9fdfa9ea600bd7be349df
Author: jinxing <ji...@...>
Date:   2018-05-10T08:00:24Z

    [SPARK-24240] Add a config to control whether InMemoryFileIndex should update cache when refresh.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by adrian-ionescu <gi...@git.apache.org>.
Github user adrian-ionescu commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    Fair enough. Actually I think that's how `CatalogFileIndex` works. So, for consistency, we might even want to consider making this the default.
    
    Please add a test, though, and improve the PR description.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    **[Test build #90637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90637/testReport)** for PR 21289 at commit [`5cabfd6`](https://github.com/apache/spark/commit/5cabfd6d9d5189b3669cf25f0ade5e8df9029d9d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by adrian-ionescu <gi...@git.apache.org>.
Github user adrian-ionescu commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    I just took a quick look... Do I understand correctly that your flag basically controls whether the refresh is eager (`true`, current behavior) vs lazy (`false`)? If so, let's rename the flag accordingly.
    
    Also, with lazy mode we just delay file listing from immediately after an append operation until the next Spark job involving that table, right? It doesn't seem like we gain much overall..
    
    What's the general use-case that motivated you to make this change? Please explain that more clearly in the PR description. Also, let's add a test that verifies the intended behavior.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21289: [SPARK-24240] Add a config to control whether InM...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21289#discussion_r187372024
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala ---
    @@ -75,16 +77,31 @@ class InMemoryFileIndex(
       }
     
       override protected def leafFiles: mutable.LinkedHashMap[Path, FileStatus] = {
    +    refreshIfCacheOutdated()
    --- End diff --
    
    what's the gain here? seems you only delay the refresh, not eliminate it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    cc @cloud-fan @gengliangwang @adrian-ionescu 
    Do you have any thoughts? 
    Please take a look when you have time, thanks !


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    @cloud-fan @adrian-ionescu
    Thanks for looking into this !
    The issue is that when I append into a table with sql below:
    ```
    insert into X select ....
    ```
    It always refresh the file index of table X in memory, though I don't query X after this job. From this point, the refresh is not necessary and costs much time and memory.
    In this pr, I proposes to mark the file index as 'outdated', and postpone the refresh until next query of the file index.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    Build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90637/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90447/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    **[Test build #90447 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90447/testReport)** for PR 21289 at commit [`b49665d`](https://github.com/apache/spark/commit/b49665d16be723a7abe9fdfa9ea600bd7be349df).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    **[Test build #90637 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90637/testReport)** for PR 21289 at commit [`5cabfd6`](https://github.com/apache/spark/commit/5cabfd6d9d5189b3669cf25f0ade5e8df9029d9d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5886/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    **[Test build #90447 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90447/testReport)** for PR 21289 at commit [`b49665d`](https://github.com/apache/spark/commit/b49665d16be723a7abe9fdfa9ea600bd7be349df).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3228/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    @cloud-fan  @adrian-ionescu 
    I added a test, please check when you have time.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    I mean the memory refresh always happen after writing data, which I think is not necessary ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21289
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3099/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org