You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by zsxwing <gi...@git.apache.org> on 2014/12/11 04:02:55 UTC

[GitHub] spark pull request: [SPARK-4824][Core] Add 'iterator' to reduce me...

GitHub user zsxwing opened a pull request:

    https://github.com/apache/spark/pull/3671

    [SPARK-4824][Core] Add 'iterator' to reduce memory consumed by join

    In Scala, `map` and `flatMap` of `Iterable` will copy the contents of `Iterable` to a new `Seq`. Such as,
    ```Scala
      val iterable = Seq(1, 2, 3).map(v => {
        println(v)
        v
      })
      println("Iterable map done")
    
      val iterator = Seq(1, 2, 3).iterator.map(v => {
        println(v)
        v
      })
      println("Iterator map done")
    ```
    outputed
    ```
    1
    2
    3
    Iterable map done
    Iterator map done
    ```
    So we should use 'iterator' to reduce memory consumed by join.
    
    Found by Johannes Simon in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3C5BE70814-9D03-4F61-AE2C-0D63F2DE4446%40mail.de%3E

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zsxwing/spark SPARK-4824

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3671.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3671
    
----
commit 95d59d697a7f2fabd877dac864e9dff8d1011d5a
Author: zsxwing <zs...@gmail.com>
Date:   2014-12-11T02:42:02Z

    Add 'iterator' to reduce memory consumed by join

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by scwf <gi...@git.apache.org>.
Github user scwf commented on the pull request:

    https://github.com/apache/spark/pull/3671#issuecomment-66772429
  
    Very interesting, any test for the effect on memory or performance?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3671#issuecomment-66576578
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24351/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3671#discussion_r21743352
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -493,9 +493,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
       def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))] = {
         this.cogroup(other, partitioner).flatMapValues { pair =>
           if (pair._2.isEmpty) {
    -        pair._1.map(v => (v, None))
    +        pair._1.iterator.map(v => (v, None): (V, Option[W]))
    --- End diff --
    
    You're right. My IDE had some problem. After I rebuilt the project, the errors gone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3671#discussion_r21668908
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -493,9 +493,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
       def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))] = {
         this.cogroup(other, partitioner).flatMapValues { pair =>
           if (pair._2.isEmpty) {
    -        pair._1.map(v => (v, None))
    +        pair._1.iterator.map(v => (v, None): (V, Option[W]))
    --- End diff --
    
    Interesting, are these types required? or can it be limited to just changing `None` to `None: Option[W]`?
    Not that it hurts to spell out the types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3671#issuecomment-66781010
  
      [Test build #24407 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24407/consoleFull) for   PR 3671 at commit [`48ee7b9`](https://github.com/apache/spark/commit/48ee7b9c9aee147f0dc5bfec85a7ce4ea1319c78).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3671#issuecomment-66771834
  
      [Test build #24407 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24407/consoleFull) for   PR 3671 at commit [`48ee7b9`](https://github.com/apache/spark/commit/48ee7b9c9aee147f0dc5bfec85a7ce4ea1319c78).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by johannessimon <gi...@git.apache.org>.
Github user johannessimon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3671#discussion_r21739399
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -493,9 +493,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
       def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))] = {
         this.cogroup(other, partitioner).flatMapValues { pair =>
           if (pair._2.isEmpty) {
    -        pair._1.map(v => (v, None))
    +        pair._1.iterator.map(v => (v, None): (V, Option[W]))
    --- End diff --
    
    @zsxwing First of all thanks for the patch! What kind of error are you getting without these explicit types? Compile time or runtime? I don't get any compile error/warning without these types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3671#issuecomment-66781024
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24407/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3671#issuecomment-66566629
  
      [Test build #24348 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24348/consoleFull) for   PR 3671 at commit [`95d59d6`](https://github.com/apache/spark/commit/95d59d697a7f2fabd877dac864e9dff8d1011d5a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/3671#issuecomment-66567599
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3671#discussion_r21670105
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -493,9 +493,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
       def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))] = {
         this.cogroup(other, partitioner).flatMapValues { pair =>
           if (pair._2.isEmpty) {
    -        pair._1.map(v => (v, None))
    +        pair._1.iterator.map(v => (v, None): (V, Option[W]))
    --- End diff --
    
    > None to None: Option[W]
    
    Have tried. But not work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3671#issuecomment-66568014
  
      [Test build #24351 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24351/consoleFull) for   PR 3671 at commit [`95d59d6`](https://github.com/apache/spark/commit/95d59d697a7f2fabd877dac864e9dff8d1011d5a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3671#issuecomment-67898424
  
    This looks good to me.  This is a small fix but one which could significantly improve memory usage during joins, so I'm going to pull this into `master` (1.3.0), `branch-1.2` (1.2.1), and `branch-1.1` (1.1.2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3671#issuecomment-66576573
  
      [Test build #24351 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24351/consoleFull) for   PR 3671 at commit [`95d59d6`](https://github.com/apache/spark/commit/95d59d697a7f2fabd877dac864e9dff8d1011d5a).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/3671


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4824][Core] Add 'iterator' to reduce me...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3671#issuecomment-66562876
  
      [Test build #24348 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24348/consoleFull) for   PR 3671 at commit [`95d59d6`](https://github.com/apache/spark/commit/95d59d697a7f2fabd877dac864e9dff8d1011d5a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/3671#issuecomment-66791824
  
    > Very interesting, any test for the effect on memory or performance?
    
    No. But I expect the memory will descrease from O(m * n) to O(m + n).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3671#issuecomment-66566634
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24348/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4818][Core] Add 'iterator' to reduce me...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/3671#issuecomment-67122065
  
    @pwendell Is it OK to put this patch into branch 1.2, or it's too late?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org