You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by hvanhovell <gi...@git.apache.org> on 2015/11/02 16:49:33 UTC

[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

GitHub user hvanhovell opened a pull request:

    https://github.com/apache/spark/pull/9417

    [SPARK-11449][Core] PortableDataStream should be a factory

    ```PortableDataStream``` maintains some internal state. This makes it tricky to reuse a stream (one needs to call ```close``` on both the ```PortableDataStream``` and the ```InputStream``` it produces).
    
    This PR removes all state from ```PortableDataStream``` and effectively turns it into an ```InputStream```/```Array[Byte]``` factory. This makes the user responsible for managing the ```InputStream``` it returns.
    
    cc @srowen 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hvanhovell/spark SPARK-11449

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9417.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9417
    
----
commit 91b8c6cc86c83df161e176c7a4efbc3dd439d037
Author: Herman van Hovell <hv...@questtec.nl>
Date:   2015-11-02T15:42:44Z

    Removed state from PortableDataStream

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9417#issuecomment-153095133
  
    **[Test build #1967 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1967/consoleFull)** for PR 9417 at commit [`91b8c6c`](https://github.com/apache/spark/commit/91b8c6cc86c83df161e176c7a4efbc3dd439d037).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/9417#issuecomment-153696906
  
    LGTM. Let me leave it open a bit for comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/9417#issuecomment-154001929
  
    Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9417#issuecomment-153133888
  
    **[Test build #1968 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1968/consoleFull)** for PR 9417 at commit [`91b8c6c`](https://github.com/apache/spark/commit/91b8c6cc86c83df161e176c7a4efbc3dd439d037).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `case class Corr(`\n  * `case class Corr(left: Expression, right: Expression)`\n  * `case class RepartitionByExpression(`\n  * `        logInfo(s\"Hive class not found $e\")`\n  * `        logDebug(\"Hive class not found\", e)`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9417#issuecomment-153059875
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9417#discussion_r43870851
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -177,39 +170,24 @@ class PortableDataStream(
       }
     
       /**
    -   * Create a new DataInputStream from the split and context
    +   * Create a new DataInputStream from the split and context. The user of this method is responsible
    +   * for closing the stream after usage.
        */
       def open(): DataInputStream = {
    -    if (!isOpen) {
    -      val pathp = split.getPath(index)
    -      val fs = pathp.getFileSystem(conf)
    -      fileIn = fs.open(pathp)
    -      isOpen = true
    -    }
    -    fileIn
    +    val pathp = split.getPath(index)
    +    val fs = pathp.getFileSystem(conf)
    +    fs.open(pathp)
       }
     
       /**
        * Read the file as a byte array
        */
       def toArray(): Array[Byte] = {
    -    open()
    -    val innerBuffer = ByteStreams.toByteArray(fileIn)
    -    close()
    -    innerBuffer
    -  }
    -
    -  /**
    -   * Close the file (if it is currently open)
    -   */
    -  def close(): Unit = {
    --- End diff --
    
    Added a deprecated ```close()``` method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/9417#issuecomment-153059923
  
    @kmader what do you think of this one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

Posted by kmader <gi...@git.apache.org>.
Github user kmader commented on the pull request:

    https://github.com/apache/spark/pull/9417#issuecomment-153550148
  
    @srowen @hvanhovell this is a nice improvement and more elegant than the original approach. 
    
    As a side node, In our code base (which uses PortableDataStream quite a bit), these changes didn't break anything so since it passes tests as well, I'd say it's good to go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9417#discussion_r43869486
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -177,39 +170,24 @@ class PortableDataStream(
       }
     
       /**
    -   * Create a new DataInputStream from the split and context
    +   * Create a new DataInputStream from the split and context. The user of this method is responsible
    +   * for closing the stream after usage.
        */
       def open(): DataInputStream = {
    -    if (!isOpen) {
    -      val pathp = split.getPath(index)
    -      val fs = pathp.getFileSystem(conf)
    -      fileIn = fs.open(pathp)
    -      isOpen = true
    -    }
    -    fileIn
    +    val pathp = split.getPath(index)
    +    val fs = pathp.getFileSystem(conf)
    +    fs.open(pathp)
       }
     
       /**
        * Read the file as a byte array
        */
       def toArray(): Array[Byte] = {
    -    open()
    -    val innerBuffer = ByteStreams.toByteArray(fileIn)
    -    close()
    -    innerBuffer
    -  }
    -
    -  /**
    -   * Close the file (if it is currently open)
    -   */
    -  def close(): Unit = {
    --- End diff --
    
    My only last concern here is removing this method. It's `@Experimental` though it has been around a while. We could just keep the method as a no-op, in case some caller out there is calling it. Maybe deprecate it, and log a warning too. I'm not so fussed about the behavior change since it's again `@Experimental` and I think callers should have been closing the stream directly anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9417#issuecomment-153095896
  
    **[Test build #1968 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1968/consoleFull)** for PR 9417 at commit [`91b8c6c`](https://github.com/apache/spark/commit/91b8c6cc86c83df161e176c7a4efbc3dd439d037).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/9417


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9417#issuecomment-153060623
  
    **[Test build #1967 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1967/consoleFull)** for PR 9417 at commit [`91b8c6c`](https://github.com/apache/spark/commit/91b8c6cc86c83df161e176c7a4efbc3dd439d037).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org