You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2017/05/23 14:11:06 UTC

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/18073

    [SPARK-20848][SQL] Shutdown the pool after reading parquet files

    ## What changes were proposed in this pull request?
    
    From JIRA: On each call to spark.read.parquet, a new ForkJoinPool is created. One of the threads in the pool is kept in the WAITING state, and never stopped, which leads to unbounded growth in number of threads.
    
    We should shutdown the pool after reading parquet files.
    
    ## How was this patch tested?
    
    Added a test to ParquetFileFormatSuite.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-20848

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18073.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18073
    
----
commit e4940b9e511a314144db042dfaa6b0f2f4a4a45d
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-05-23T12:54:21Z

    Shutdown the pool after reading parquet files.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118356094
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -495,6 +496,8 @@ object ParquetFileFormat extends Logging {
             } else {
               throw new IOException(s"Could not read footer for file: $currentFile", e)
             }
    +      } finally {
    +        pool.shutdown()
    --- End diff --
    
    I would expect this will fail some test, but it didn't...
    
    When you fix this error, could you call `ThreadUtils.newForkJoinPool` instead to set a better thread name?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    **[Test build #77254 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77254/testReport)** for PR 18073 at commit [`e4940b9`](https://github.com/apache/spark/commit/e4940b9e511a314144db042dfaa6b0f2f4a4a45d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    **[Test build #77296 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77296/testReport)** for PR 18073 at commit [`7e57595`](https://github.com/apache/spark/commit/7e57595d2750b65002cda6a5ee94dbcf3f5f2f64).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118329515
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -495,6 +496,8 @@ object ParquetFileFormat extends Logging {
             } else {
               throw new IOException(s"Could not read footer for file: $currentFile", e)
             }
    +      } finally {
    +        pool.shutdown()
    --- End diff --
    
    Why we terminate `pool` inside `flatMap`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77254/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118152566
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormatSuite.scala ---
    @@ -26,6 +26,22 @@ import org.apache.spark.sql.test.SharedSQLContext
     
     class ParquetFileFormatSuite extends QueryTest with ParquetTest with SharedSQLContext {
     
    +  test("Number of threads doesn't grow extremely after parquet file reading") {
    +    withTempDir { dir =>
    +      val file = dir.toString + "/file"
    +      spark.range(1).toDF("a").coalesce(1).write.parquet(file)
    +      spark.read.parquet(file)
    +      val numThreadBefore = Thread.activeCount
    +      (1 to 100).map { _ =>
    +        spark.read.parquet(file)
    +      }
    +      val numThreadAfter = Thread.activeCount
    +      // Hard to test a correct thread number,
    +      // but it shouldn't increase more than a reasonable number.
    +      assert(numThreadAfter - numThreadBefore < 20)
    --- End diff --
    
    after waiting for enough time, can we expect this to be 0?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118196532
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormatSuite.scala ---
    @@ -26,6 +26,22 @@ import org.apache.spark.sql.test.SharedSQLContext
     
     class ParquetFileFormatSuite extends QueryTest with ParquetTest with SharedSQLContext {
     
    +  test("Number of threads doesn't grow extremely after parquet file reading") {
    +    withTempDir { dir =>
    +      val file = dir.toString + "/file"
    +      spark.range(1).toDF("a").coalesce(1).write.parquet(file)
    +      spark.read.parquet(file)
    +      val numThreadBefore = Thread.activeCount
    +      (1 to 100).map { _ =>
    +        spark.read.parquet(file)
    +      }
    +      val numThreadAfter = Thread.activeCount
    +      // Hard to test a correct thread number,
    +      // but it shouldn't increase more than a reasonable number.
    +      assert(numThreadAfter - numThreadBefore < 20)
    --- End diff --
    
    OK. Let's remove the test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    @viirya  can you test it manually with JVisualVM or other tools and attach the screen shot in this PR? thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    **[Test build #77254 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77254/testReport)** for PR 18073 at commit [`e4940b9`](https://github.com/apache/spark/commit/e4940b9e511a314144db042dfaa6b0f2f4a4a45d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118356467
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -495,6 +496,8 @@ object ParquetFileFormat extends Logging {
             } else {
               throw new IOException(s"Could not read footer for file: $currentFile", e)
             }
    +      } finally {
    +        pool.shutdown()
    --- End diff --
    
    > Why not doing it outside? For example?
    
    Just realized the `toSeq` is lazy. But shutting down in flatMap is also not correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    **[Test build #77296 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77296/testReport)** for PR 18073 at commit [`7e57595`](https://github.com/apache/spark/commit/7e57595d2750b65002cda6a5ee94dbcf3f5f2f64).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77279/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118193514
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -479,7 +479,8 @@ object ParquetFileFormat extends Logging {
           partFiles: Seq[FileStatus],
           ignoreCorruptFiles: Boolean): Seq[Footer] = {
         val parFiles = partFiles.par
    -    parFiles.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(8))
    +    val pool = new ForkJoinPool(8)
    --- End diff --
    
    will it be better to share one global thread pool? Creating a lot of thread pools may not increase the concurrency


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118004547
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -479,8 +479,9 @@ object ParquetFileFormat extends Logging {
           partFiles: Seq[FileStatus],
           ignoreCorruptFiles: Boolean): Seq[Footer] = {
         val parFiles = partFiles.par
    -    parFiles.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(8))
    -    parFiles.flatMap { currentFile =>
    +    val readParquetTaskSupport = new ForkJoinTaskSupport(new ForkJoinPool(8))
    +    parFiles.tasksupport = readParquetTaskSupport
    +    val footers = parFiles.flatMap { currentFile =>
    --- End diff --
    
    You could probably put the shutdown in a finally block to avoid adding this new reference. I think you can rearrange to use the existing one below. It does help shut it down in case of an error too. Also I think you can hold a reference just to the ForkJoinPool in order to shut it down rather than a ref to ForkJoinTaskSupport. No big deal


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118382694
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -495,6 +496,8 @@ object ParquetFileFormat extends Logging {
             } else {
               throw new IOException(s"Could not read footer for file: $currentFile", e)
             }
    +      } finally {
    +        pool.shutdown()
    --- End diff --
    
    @zsxwing @gatorsmile 
    
    I was shutdowning it outside at the beginning of this PR. I changed to current way after @srowen's suggestion.
    
    I was thinking it can be wrong initially. But seems it is fine and I think the tasks are all invoked at the beginning and no more tasks are submitted later, so to shutdown inside is ok.
    
    I can go to submit a follow-up if you still think we need to change it. Thank you.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118195655
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -479,7 +479,8 @@ object ParquetFileFormat extends Logging {
           partFiles: Seq[FileStatus],
           ignoreCorruptFiles: Boolean): Seq[Footer] = {
         val parFiles = partFiles.par
    -    parFiles.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(8))
    +    val pool = new ForkJoinPool(8)
    --- End diff --
    
    The main concern is that if we share a thread pool for parquet reading, we may limit the concurrency as @srowen pointed out in the JIRA.
    
    If we have multiple parquet reading in parallel, they will share the pool. Currently they own their pool.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18073


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118384461
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -495,6 +496,8 @@ object ParquetFileFormat extends Logging {
             } else {
               throw new IOException(s"Could not read footer for file: $currentFile", e)
             }
    +      } finally {
    +        pool.shutdown()
    --- End diff --
    
    I don't check the details. But I guess the implementation will submit tasks one by one. Then it's possible that when the first task is shutting down the pool, some tasks has not yet submitted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    **[Test build #77279 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77279/testReport)** for PR 18073 at commit [`14e09aa`](https://github.com/apache/spark/commit/14e09aafb8ee0cc9f5f1e95bc41b19eeab6fc7d4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77296/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118384970
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -495,6 +496,8 @@ object ParquetFileFormat extends Logging {
             } else {
               throw new IOException(s"Could not read footer for file: $currentFile", e)
             }
    +      } finally {
    +        pool.shutdown()
    --- End diff --
    
    Ok. We should take a safer approach. Let me submit a follow-up for this. Thanks @zsxwing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118199779
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -479,7 +479,8 @@ object ParquetFileFormat extends Logging {
           partFiles: Seq[FileStatus],
           ignoreCorruptFiles: Boolean): Seq[Footer] = {
         val parFiles = partFiles.par
    -    parFiles.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(8))
    +    val pool = new ForkJoinPool(8)
    --- End diff --
    
    ok let's keep the previous behavior


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    **[Test build #77279 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77279/testReport)** for PR 18073 at commit [`14e09aa`](https://github.com/apache/spark/commit/14e09aafb8ee0cc9f5f1e95bc41b19eeab6fc7d4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118331567
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -495,6 +496,8 @@ object ParquetFileFormat extends Logging {
             } else {
               throw new IOException(s"Could not read footer for file: $currentFile", e)
             }
    +      } finally {
    +        pool.shutdown()
    --- End diff --
    
    Why not doing it outside? For example?
    
    ```Scala
        val parFiles = partFiles.par
        val pool = new ForkJoinPool(8)
        parFiles.tasksupport = new ForkJoinTaskSupport(pool)
        try {
          parFiles.flatMap { currentFile =>
            ...
          }.seq
        } finally {
          pool.shutdown()
        }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118158352
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormatSuite.scala ---
    @@ -26,6 +26,22 @@ import org.apache.spark.sql.test.SharedSQLContext
     
     class ParquetFileFormatSuite extends QueryTest with ParquetTest with SharedSQLContext {
     
    +  test("Number of threads doesn't grow extremely after parquet file reading") {
    +    withTempDir { dir =>
    +      val file = dir.toString + "/file"
    +      spark.range(1).toDF("a").coalesce(1).write.parquet(file)
    +      spark.read.parquet(file)
    +      val numThreadBefore = Thread.activeCount
    +      (1 to 100).map { _ =>
    +        spark.read.parquet(file)
    +      }
    +      val numThreadAfter = Thread.activeCount
    +      // Hard to test a correct thread number,
    +      // but it shouldn't increase more than a reasonable number.
    +      assert(numThreadAfter - numThreadBefore < 20)
    --- End diff --
    
    It reduces to few after waiting an enough time. The number returned by Thread.activeCount is only an estimate. So we may not expect this to be 0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    cc @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118195891
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -479,7 +479,8 @@ object ParquetFileFormat extends Logging {
           partFiles: Seq[FileStatus],
           ignoreCorruptFiles: Boolean): Seq[Footer] = {
         val parFiles = partFiles.par
    -    parFiles.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(8))
    +    val pool = new ForkJoinPool(8)
    --- End diff --
    
    Not sure if using a shared one will change current behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    cc @cloud-fan @gatorsmile This is a regression in 2.1. I think we may want to include this in 2.2. Please take a look. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18073: [SPARK-20848][SQL] Shutdown the pool after readin...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18073#discussion_r118193060
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormatSuite.scala ---
    @@ -26,6 +26,22 @@ import org.apache.spark.sql.test.SharedSQLContext
     
     class ParquetFileFormatSuite extends QueryTest with ParquetTest with SharedSQLContext {
     
    +  test("Number of threads doesn't grow extremely after parquet file reading") {
    +    withTempDir { dir =>
    +      val file = dir.toString + "/file"
    +      spark.range(1).toDF("a").coalesce(1).write.parquet(file)
    +      spark.read.parquet(file)
    +      val numThreadBefore = Thread.activeCount
    +      (1 to 100).map { _ =>
    +        spark.read.parquet(file)
    +      }
    +      val numThreadAfter = Thread.activeCount
    +      // Hard to test a correct thread number,
    +      // but it shouldn't increase more than a reasonable number.
    +      assert(numThreadAfter - numThreadBefore < 20)
    --- End diff --
    
    this looks hacky, can we think of a better way to test it? If not, I suggest to remove this test, as the fix is straightforward and we can verify it manually by some profile tools.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    thanks, merging to master/2.2/2.1!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18073: [SPARK-20848][SQL] Shutdown the pool after reading parqu...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18073
  
    @cloud-fan My dev environment is not convenient to run GUI-based tools like jconsole. I use a command-line tool [jvmtop](https://github.com/patric-r/jvmtop).
    
    Screen shots (the column "#T" is the number of threads):
    
    Before:
    <img width="914" alt="screen shot 2017-05-24 at 9 43 11 pm" src="https://cloud.githubusercontent.com/assets/68855/26407021/7d5cdf82-40cc-11e7-880e-80675e181ab0.png">
    
    After:
    <img width="918" alt="screen shot 2017-05-24 at 9 38 15 pm" src="https://cloud.githubusercontent.com/assets/68855/26406991/6c730282-40cc-11e7-9349-525b3624e5f9.png">
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org