You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by holdenk <gi...@git.apache.org> on 2014/03/27 00:38:23 UTC

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

GitHub user holdenk opened a pull request:

    https://github.com/apache/spark/pull/242

    [WIP] Spark 1271 (1320) cogroup and groupby should pass iterator[x]

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/holdenk/spark spark-1320-cogroupandgroupshouldpassiterator

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/242.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #242
    
----
commit 4cd116f08d6c759e1a94ccac549012e42ea2e2cf
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-25T22:32:55Z

    This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures

commit 5ccb7ac8346c08a2bfdbf096ff8d423672959c2c
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-25T22:40:31Z

    Calling size on an iterator is not so good if we want to use it after

commit 44b7259eeb78211e5dfb452b4e94be3e4e3c4239
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-25T22:40:37Z

    Fix some tests

commit b676212847654f5e32125474b9f520da29760e2c
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-25T23:06:02Z

    Fix groupbykey to actually give back an iterator

commit 0d395229da708d067fcb3e6ab0d6497e46279b0d
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T00:20:41Z

    I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy

commit 6576ba252288998f1861b5b55b81b276ef6f8458
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T00:21:28Z

    Fix Java API suite

commit 75122e6e1ebc0dfe8854c3670064faacbb119689
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T00:21:45Z

    hmmm try and fix up basic operation suite

commit a6e24ed0c3b1636ba19e2a59f349016e3283d675
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T00:22:20Z

    Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy"
    
    This reverts commit df9afbec7e9fb558cf75d4e8dc94d8f44f101301.

commit e17ac9b28e660a4de6ba140d101cc55957023fbf
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T00:45:13Z

    org.apache.spark.rdd.PairRDDFunctionsSuite passes

commit 0ec6bd701f4170f16a655f60ce372485bf9b2cae
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T01:29:18Z

    core/tests now pass

commit cdbd16316aac7d702fbb67180b21eff00aa2c594
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T01:37:32Z

    Fix some of the types in the streaming JavaAPI suite. Probably still needs more work

commit 100b21ac2cef532df13824cd02e16c4d03d5a4fb
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T16:38:46Z

    I really need to stop calling size on iterators, it is the path of sadness.

commit 31ebcf1cc1b599941f57d833f0cae4b8f5a19703
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T16:51:07Z

    Fix style issues

commit 4b088062f9fcdaf419c44dca60c1e57aae2ef194
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T17:59:10Z

    Revert this but for now put things in list pandas

commit b5e0849ad5ce7bba1f77c55da5b3164ed29f6558
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T20:21:41Z

    Add a join based on the problem in SVD

commit 3247db74603e06292d83d166b02d35740482b395
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T20:35:20Z

    Revert me, but we have some stuff to debug
    
    more loggings

commit 315696172ca709f218018da78e409787e91633c2
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T21:11:33Z

    Fix the bug

commit 33f17c93ecc1781e85ab09ad9e8b3f4f325d9b6f
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T22:12:19Z

    Revert

commit 2992d93a10ca1f8917aebdcecb972203b6c90146
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-03-26T23:37:22Z

    Revert "Revert this but for now put things in list pandas"
    
    This reverts commit 4b088062f9fcdaf419c44dca60c1e57aae2ef194.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38948866
  
    @pwendell I removed the WIP tag as the python changes are now done too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/242#discussion_r11398836
  
    --- Diff: python/pyspark/resultitr.py ---
    @@ -0,0 +1,33 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +__all__ = ["ResultItr"]
    +
    +import collections
    +
    +class ResultItr(collections.Iterable):
    --- End diff --
    
    Spell out the name on this (ResultIterable)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39626476
  
    @mateiz whoops yes, commented on the wrong pr


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39898158
  
    Build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39912158
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38887449
  
    @pwendell true, but its also probably easier to do quickly if you have a lot of code, and the Java people can't use toSeq. That being said I'd rather not add the legacy functions unless there is a strong use case for them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39914647
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39372669
  
    BTW this should also make the transition from Seq and Collection easier for users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39814557
  
    Build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/242#discussion_r11380438
  
    --- Diff: extras/java8-tests/src/test/java/org/apache/spark/Java8APISuite.java ---
    @@ -60,6 +60,16 @@ public void tearDown() {
         System.clearProperty("spark.driver.port");
       }
     
    +  private int iterableSize(Iterable<?> a) {
    --- End diff --
    
    Can be removed I think


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39845685
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13891/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39845684
  
    Build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38869634
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39902366
  
    @holdenk this needs to be up-merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39373102
  
    The python one is allready an iterable. I'll refactor this if @rxin / @aarondav don't have any comments :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39888437
  
    Build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38765207
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38765654
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39891603
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13903/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39814558
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13878/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39372516
  
    Hey @holdenk, I actually thought of a problem with using Iterator here -- users won't be able to cache() an RDD of Iterators in a meaningful way. (After the first time you read it, all iterators will be exhausted). We should use Iterable instead, both in Scala and in Java. This will make the implementation tougher when we do go to on-disk storage, and we may still have to do special tricks for caching (cause we don't just want to cache an object that knows how to read data across the network), but it's a more appropriate API.
    
    You should do something equivalent in Python; it already has a concept of iterables (http://www.shutupandship.com/2012/01/understanding-python-iterables-and.html).
    
    CC @rxin, @aarondav in case there are other ideas.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38889781
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13538/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39882201
  
     Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/242#discussion_r11398816
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala ---
    @@ -421,12 +421,12 @@ class ALS private (
        * Compute the new feature vectors for a block of the users matrix given the list of factors
        * it received from each product and its InLinkBlock.
        */
    -  private def updateBlock(messages: Seq[(Int, Array[Array[Double]])], inLinkBlock: InLinkBlock,
    +  private def updateBlock(messages: Iterable[(Int, Array[Array[Double]])], inLinkBlock: InLinkBlock,
           rank: Int, lambda: Double, alpha: Double, YtY: Option[Broadcast[DoubleMatrix]])
         : Array[Array[Double]] =
       {
         // Sort the incoming block factor messages by block ID and make them an array
    -    val blockFactors = messages.sortBy(_._1).map(_._2).toArray // Array[Array[Double]]
    +    val blockFactors = messages.toArray.sortBy(_._1).map(_._2) // Array[Array[Double]]
    --- End diff --
    
    It sort of is, but is necessary for sorting. I don't think it will take up that much space compared to the blocks themselves.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38864494
  
     Merged build triggered. One or more automated tests failed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38887022
  
     Merged build triggered. One or more automated tests failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39892649
  
    Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38756242
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38880557
  
    @mridulm if not method overloading - do you have a specific proposal in mind? Downstream consumers will have to add `toSe`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39890392
  
     Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/242#discussion_r11380355
  
    --- Diff: core/src/test/scala/org/apache/spark/FailureSuite.scala ---
    @@ -72,7 +72,8 @@ class FailureSuite extends FunSuite with LocalSparkContext {
                 throw new Exception("Intentional task failure")
               }
             }
    -        (k, v(0) * v(0))
    +        val vHead = v.iterator.next()
    +        (k, vHead * vHead)
    --- End diff --
    
    Can express this in one line: `(k, v.head * v.head)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39919855
  
    Thanks @holdenk - merged!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38765714
  
    Jenkins, retest this please
    
    On Wednesday, March 26, 2014, UCB AMPLab <no...@github.com> wrote:
    
    > Merged build finished.
    >
    > --
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/242#issuecomment-38765654>
    > .
    >
    
    
    -- 
    Cell : 425-233-8271


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/242#discussion_r11380312
  
    --- Diff: core/src/test/java/org/apache/spark/JavaAPISuite.java ---
    @@ -78,6 +79,38 @@ public int compare(Integer a, Integer b) {
         }
       }
     
    +  private int iteratorSize(Iterator<?> a) {
    --- End diff --
    
    Here I think you can use `Iterators.size` and `Iterables.size` from guava:
    http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Iterators.html#size(java.util.Iterator)
    http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Iterables.html#size(java.lang.Iterable)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/242#discussion_r11380250
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -328,9 +330,10 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)])
           : RDD[(K, (Option[V], W))] = {
         this.cogroup(other, partitioner).flatMapValues { case (vs, ws) =>
           if (vs.isEmpty) {
    -        ws.iterator.map(w => (None, w))
    +        ws.map(w => (None, w))
           } else {
    -        for (v <- vs.iterator; w <- ws.iterator) yield (Some(v), w)
    +        val wlist = ws.toList
    +        for (v <- vs; w <- wlist) yield (Some(v), w)
    --- End diff --
    
    ```
            for (v <- vs; w <- ws) yield (v, Some(w))
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39892406
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39914648
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13911/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38773795
  
    It would be a bit difficult to preserve the other return type since the call signatures are the same. Its probably worth calling out in the release notes given that I had to make changes to some code which compiled fine against the changes but failed at run-time (mostly in Scala the iterator type has a size method which while it works consumes all of the elements).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39912147
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/242#discussion_r11380428
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/JavaPageRank.java ---
    @@ -86,12 +87,18 @@ public Double call(List<String> rs) {
         for (int current = 0; current < Integer.parseInt(args[2]); current++) {
           // Calculates URL contributions to the rank of other URLs.
           JavaPairRDD<String, Double> contribs = links.join(ranks).values()
    -        .flatMapToPair(new PairFlatMapFunction<Tuple2<List<String>, Double>, String, Double>() {
    +        .flatMapToPair(new PairFlatMapFunction<Tuple2<Iterable<String>, Double>, String, Double>() {
               @Override
    -          public Iterable<Tuple2<String, Double>> call(Tuple2<List<String>, Double> s) {
    +          public Iterable<Tuple2<String, Double>> call(Tuple2<Iterable<String>, Double> s) {
    +	    int urlCount = 0;
    +	    Iterator<String> urls = s._1.iterator();
    +            while (urls.hasNext()) {
    --- End diff --
    
    could you use `Iterators.size` here too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39595898
  
    @mateiz what about pulling the test utility in defined here? https://github.com/apache/spark/pull/119/files#diff-80212629295b483ab30a3436a653e97aR65
    
    Some other people have already asked me for that to write other tests. I could submit it as its own PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39841083
  
    Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38769996
  
    We are trying to cement the API for 1.0. This will allow us to provide an actual iterator-based implementation later that does not materialize all the values for a single group-by key at once, for instance. With our current Seq-based solution, we have basically no choice but to OOM now and for any future implementation if the values are too large to fit in memory.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38765208
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38759436
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39404342
  
    @pwendell @holdenk i dont have a good solution to keep compatibility, was hoping you guys might :-)
    This will require some rewrite, but I am hoping it wont be more than method signature changes ...
    
    
    @mateiz that is exactly the kind of usecase I was thinking about
    if we had support for something like output collector, would have made even more sense (stream relation 1 from iterator, replay iterator 2 repeatedly, and output goes to a collector which could be disk backed to remove memory pressure). But even without that, I can see the benefits of using Iterable instead of Iterator (my scala collection knowledge is nil, thanks for pointing out existing interface !) ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39888439
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13896/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39841075
  
     Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39603162
  
    @pwendell I assume you meant to put that on the class loader PR, not this one? Anyway that utility would be good. The other thing we can do is add a Maven project that *just* generates some test classes in a known directory (we'd never publish them anywhere), but it seems high-overhead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38756241
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/242#discussion_r11380471
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala ---
    @@ -421,12 +421,12 @@ class ALS private (
        * Compute the new feature vectors for a block of the users matrix given the list of factors
        * it received from each product and its InLinkBlock.
        */
    -  private def updateBlock(messages: Seq[(Int, Array[Array[Double]])], inLinkBlock: InLinkBlock,
    +  private def updateBlock(messages: Iterable[(Int, Array[Array[Double]])], inLinkBlock: InLinkBlock,
           rank: Int, lambda: Double, alpha: Double, YtY: Option[Broadcast[DoubleMatrix]])
         : Array[Array[Double]] =
       {
         // Sort the incoming block factor messages by block ID and make them an array
    -    val blockFactors = messages.sortBy(_._1).map(_._2).toArray // Array[Array[Double]]
    +    val blockFactors = messages.toArray.sortBy(_._1).map(_._2) // Array[Array[Double]]
    --- End diff --
    
    @mateiz is this an expensive thing here? this will make an extra copy of this sequence.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38767649
  
    what is the immediate need for this change ?
    i am not disagreeing with it, and this would definitely be needed for disk-backed data, etc : but what is the current need for this ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39811251
  
    Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38771396
  
    fair enough, just wanted to ensure it was being done in preparation for disk backed iterator (or atleast that would be something which can be implemented using this).
    Actually, instead of Iterator, would be good if we exposed a restartable iterator - this allows for interesting usecases.
    
    Btw, incompatible api change right ? Any attempts to preserve (with deprecated warning) existing behavior ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/242#discussion_r11379907
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -298,7 +298,8 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)])
        */
       def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = {
         this.cogroup(other, partitioner).flatMapValues { case (vs, ws) =>
    -      for (v <- vs.iterator; w <- ws.iterator) yield (v, w)
    +      val wlist = ws.toList
    --- End diff --
    
    could this just be one line?
    
    ```
          for (v <- vs; w <- ws) yield (v, w)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38887028
  
    Merged build started. One or more automated tests failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39374474
  
    No ideas here, though Iterable will probably make the implementation of actual iterator-backed cogrouping significantly more difficult unless we always store the result to disk like we do for shuffles (which has performance implications). That might be good anyway, though, because otherwise we have to keep some sort of stream running to satisfy the iterator at any point (as it could be read well after the job finishes).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38886975
  
    @mridulm Did you want something like named legacyGroupByKey which would have the old behaviour?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38869216
  
    Do we also want these changes for the python API?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38765656
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13492/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39378171
  
    The Python one is indeed an iterable, but it also seems to be an iterator. That might be confusing to users, because if they use it as an iterator and cache it, they can't reuse it. Maybe make it only act as an iterable? In that case it will essentially be a wrapper over an array that only provides the __iter__ method, kind of weird but good for ensuring that this will be compatible with on-disk storage later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39811244
  
     Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39909639
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38886937
  
    So I made the python API match as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39906452
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38869635
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13529/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39906464
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38889780
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39891602
  
    Build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38864509
  
    Merged build started. One or more automated tests failed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38864553
  
    @mridulm I think type erasure will make it difficult to preserve compatiblity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39898160
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13905/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38759440
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13484/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/242#discussion_r11406772
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala ---
    @@ -421,12 +421,12 @@ class ALS private (
        * Compute the new feature vectors for a block of the users matrix given the list of factors
        * it received from each product and its InLinkBlock.
        */
    -  private def updateBlock(messages: Seq[(Int, Array[Array[Double]])], inLinkBlock: InLinkBlock,
    +  private def updateBlock(messages: Iterable[(Int, Array[Array[Double]])], inLinkBlock: InLinkBlock,
           rank: Int, lambda: Double, alpha: Double, YtY: Option[Broadcast[DoubleMatrix]])
         : Array[Array[Double]] =
       {
         // Sort the incoming block factor messages by block ID and make them an array
    -    val blockFactors = messages.sortBy(_._1).map(_._2).toArray // Array[Array[Double]]
    +    val blockFactors = messages.toArray.sortBy(_._1).map(_._2) // Array[Array[Double]]
    --- End diff --
    
    So looking at the behaviour of Seq in the scala shell, it looks like we can replace this toArray with toSeq and not have the performance hit for now (until we start using something other than Seq's as iterables at which point we actually get benefits from the change).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38887023
  
    I think moving your function to `legacyGroupByKey` is more annoying than writing `toSeq` :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39882219
  
    Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39909641
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13910/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/242


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39892627
  
     Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38880515
  
    @pwendell agree, I was not referring to method overloading.
    There is a lot of code already written which will require rewrite if interface changes (how significant, I am not sure : but cost would be same as Seq does today unfortunately).
    I am thinking of how to minimize cost of migration compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Spark 1271 (1320) cogroup and groupby sh...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-38881836
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark 1271 (1320) cogroup and groupby should p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/242#issuecomment-39890412
  
    Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---