You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by darionyaphet <gi...@git.apache.org> on 2017/03/02 09:51:27 UTC

[GitHub] spark pull request #17135: SPARK-19794 Release HDFS Client after read/write ...

GitHub user darionyaphet opened a pull request:

    https://github.com/apache/spark/pull/17135

    SPARK-19794 Release HDFS Client after read/write checkpoint

    ## What changes were proposed in this pull request?
    
    Close HDFS client and streams after reading and writing from HDFS .
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/darionyaphet/spark SPARK-19794

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17135.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17135
    
----
commit 60754bd65a389181f055882daa390e8aea4e3989
Author: darionyaphet <da...@gmail.com>
Date:   2017-03-02T09:48:49Z

    SPARK-19794 Release HDFS Client after read/write checkpoint

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17135: SPARK-19794 Release HDFS Client after read/write ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17135#discussion_r103907312
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala ---
    @@ -181,21 +181,26 @@ private[spark] object ReliableCheckpointRDD extends Logging {
           serializeStream.writeAll(iterator)
         } {
           serializeStream.close()
    +      fileOutputStream.close()
         }
     
    -    if (!fs.rename(tempOutputPath, finalOutputPath)) {
    -      if (!fs.exists(finalOutputPath)) {
    -        logInfo(s"Deleting tempOutputPath $tempOutputPath")
    -        fs.delete(tempOutputPath, false)
    -        throw new IOException("Checkpoint failed: failed to save output of task: " +
    -          s"${ctx.attemptNumber()} and final output path does not exist: $finalOutputPath")
    -      } else {
    -        // Some other copy of this task must've finished before us and renamed it
    -        logInfo(s"Final output path $finalOutputPath already exists; not overwriting it")
    -        if (!fs.delete(tempOutputPath, false)) {
    -          logWarning(s"Error deleting ${tempOutputPath}")
    +    try {
    --- End diff --
    
    Given that this doesn't encompass the span of usage for `fs` -- better to just call `fs.close()` at the end and not worry about manually closing in an error case? or expand the try-finally?
    
    Actually, I am not sure we are supposed to call `FileSystem.close()` because they are shared instances, cached and reused across the whole application.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17135: SPARK-19794 Release HDFS Client after read/write checkpo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17135
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17135: SPARK-19794 Release HDFS Client after read/write ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17135#discussion_r103907445
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala ---
    @@ -279,8 +287,13 @@ private[spark] object ReliableCheckpointRDD extends Logging {
     
         // Register an on-task-completion callback to close the input stream.
         context.addTaskCompletionListener(context => deserializeStream.close())
    -
    -    deserializeStream.asIterator.asInstanceOf[Iterator[T]]
    +    Utils.tryWithSafeFinally {
    --- End diff --
    
    I don't think you can close it here, right? you're returning an iterator on the stream


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17135: SPARK-19794 Release HDFS Client after read/write checkpo...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/17135
  
    It's not just a matter of performance regression - it will brake any other code that has references to the file system being closed. -1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17135: SPARK-19794 Release HDFS Client after read/write ...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17135#discussion_r104077712
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala ---
    @@ -279,8 +287,13 @@ private[spark] object ReliableCheckpointRDD extends Logging {
     
         // Register an on-task-completion callback to close the input stream.
         context.addTaskCompletionListener(context => deserializeStream.close())
    -
    -    deserializeStream.asIterator.asInstanceOf[Iterator[T]]
    +    Utils.tryWithSafeFinally {
    --- End diff --
    
    This code will introduce issue, `deserializaStream` should be called after finished, the code here will close this stream prematurely.
    
    Also please look at L289, it already takes care of `close` after the task is finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17135: SPARK-19794 Release HDFS Client after read/write ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17135#discussion_r103907416
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala ---
    @@ -216,6 +221,8 @@ private[spark] object ReliableCheckpointRDD extends Logging {
             serializeStream.writeObject(partitioner)
           } {
             serializeStream.close()
    +        fileOutputStream.close()
    --- End diff --
    
    Ditto, this is OK if `serializeStream.close()` doesn't actually close the underlying stream (?) but not sure about the next line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17135: SPARK-19794 Release HDFS Client after read/write ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17135


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17135: SPARK-19794 Release HDFS Client after read/write ...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17135#discussion_r104077402
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala ---
    @@ -181,21 +181,26 @@ private[spark] object ReliableCheckpointRDD extends Logging {
           serializeStream.writeAll(iterator)
         } {
           serializeStream.close()
    +      fileOutputStream.close()
         }
     
    -    if (!fs.rename(tempOutputPath, finalOutputPath)) {
    -      if (!fs.exists(finalOutputPath)) {
    -        logInfo(s"Deleting tempOutputPath $tempOutputPath")
    -        fs.delete(tempOutputPath, false)
    -        throw new IOException("Checkpoint failed: failed to save output of task: " +
    -          s"${ctx.attemptNumber()} and final output path does not exist: $finalOutputPath")
    -      } else {
    -        // Some other copy of this task must've finished before us and renamed it
    -        logInfo(s"Final output path $finalOutputPath already exists; not overwriting it")
    -        if (!fs.delete(tempOutputPath, false)) {
    -          logWarning(s"Error deleting ${tempOutputPath}")
    +    try {
    --- End diff --
    
    Agreed with @srowen , `FileSystem` is a cached object, closing it means removed it from cache. I don't think we need to call this explicitly. Because by default it is designed to be shared.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17135: SPARK-19794 Release HDFS Client after read/write checkpo...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17135
  
    Yes this is substantially not something we can merge, so let's close this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17135: SPARK-19794 Release HDFS Client after read/write checkpo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17135
  
    **[Test build #3592 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3592/testReport)** for PR 17135 at commit [`60754bd`](https://github.com/apache/spark/commit/60754bd65a389181f055882daa390e8aea4e3989).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17135: SPARK-19794 Release HDFS Client after read/write checkpo...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17135
  
    I get the idea, but I'm not sure any of these are valid


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17135: SPARK-19794 Release HDFS Client after read/write checkpo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17135
  
    **[Test build #3592 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3592/testReport)** for PR 17135 at commit [`60754bd`](https://github.com/apache/spark/commit/60754bd65a389181f055882daa390e8aea4e3989).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17135: SPARK-19794 Release HDFS Client after read/write checkpo...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on the issue:

    https://github.com/apache/spark/pull/17135
  
    I remember FileSystem will be cached internally by default. Closing it probably will introduce some performance regression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org