You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2014/08/22 06:55:27 UTC

[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/2093

    [SPARK-2871] [PySpark] add RDD.lookup(key)

    RDD.lookup(key)
    
            Return the list of values in the RDD for key `key`. This operation
            is done efficiently if the RDD has a known partitioner by only
            searching the partition that the key maps to.
    
            >>> l = range(1000)
            >>> rdd = sc.parallelize(zip(l, l), 10)
            >>> rdd.lookup(42)  # slow
            [42]
            >>> sorted = rdd.sortByKey()
            >>> sorted.lookup(42)  # fast
            [42]

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark lookup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2093.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2093
    
----
commit eb1305df44e030b086ca3cf518bdda3bef772109
Author: Davies Liu <da...@gmail.com>
Date:   2014-08-22T04:48:50Z

    add RDD.lookup(key)

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53451126
  
    @ScrapCodes I had ran pyspark with PyPy successfully, will send out PR and some benchmarks later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53182020
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19123/consoleFull) for   PR 2093 at commit [`2871b80`](https://github.com/apache/spark/commit/2871b802a9a7145af1ae93594fbf6b01bd2bb1b7).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class BoundedFloat(float):`
      * `class JoinedRow2 extends Row `
      * `class JoinedRow3 extends Row `
      * `class JoinedRow4 extends Row `
      * `class JoinedRow5 extends Row `
      * `class GenericRow(protected[sql] val values: Array[Any]) extends Row `
      * `abstract class MutableValue extends Serializable `
      * `final class MutableInt extends MutableValue `
      * `final class MutableFloat extends MutableValue `
      * `final class MutableBoolean extends MutableValue `
      * `final class MutableDouble extends MutableValue `
      * `final class MutableShort extends MutableValue `
      * `final class MutableLong extends MutableValue `
      * `final class MutableByte extends MutableValue `
      * `final class MutableAny extends MutableValue `
      * `final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow `
      * `case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate `
      * `case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression `
      * `case class CollectHashSetFunction(`
      * `case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression `
      * `case class CombineSetsAndCountFunction(`
      * `case class CountDistinctFunction(`
      * `case class MaxOf(left: Expression, right: Expression) extends Expression `
      * `case class NewSet(elementType: DataType) extends LeafExpression `
      * `case class AddItemToSet(item: Expression, set: Expression) extends Expression `
      * `case class CombineSets(left: Expression, right: Expression) extends BinaryExpression `
      * `case class CountSet(child: Expression) extends UnaryExpression `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by mattf <gi...@git.apache.org>.
Github user mattf commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53190051
  
    i'm never a fan of code reformatting, whitespace changes or refactoring at the same time as functional changes, e.g. if (k, v) => if k, v; v if k not in m else func(m[k], v) => func(m[k], v) if k in m else v; MaxHeapQ; if not x: return None -> if x: return y
    
    while all those are good changes, it's good practice (and future you me and others will thank you) to bundle those changes w/ their own justification, for instance -
    
    question: why did you change to using count() instead of collect() to force evaluation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53381277
  
    In PyPy 2.3, the result is reversed, MaxHeapQ is 3x faster than nsmallest(). They are both implemented in pure Python, nsmallest() does more than MaxHeapQ, it will make it stable. 
    
    If nsmallest() does not do stable sort, MaxHeapQ is still 30% faster than nsmallest(), because it will try to call  __le__ if no __lt__ (MaxHeapQ will fail if object has no __gt__).
    
    BTW, PyPy do very well in optimizing these algorithm. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53025922
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19074/consoleFull) for   PR 2093 at commit [`eb1305d`](https://github.com/apache/spark/commit/eb1305df44e030b086ca3cf518bdda3bef772109).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53306345
  
    @mattf I agree with you that it's generally a good idea to separate functional vs. cosmetic changes (I've weighed in against several PRs that _only_ perform minor code formatting changes, especially ones that touch tens or hundreds of files).
    
    Ideally, this PR would have only made the necessary changes for `lookup()`, since that makes things easier to review.  However, I think we do want to eventually make most of the other changes, so I don't mind trying to review those changes here (the alternative is having to review another PR, which creates more work for me now that I've already reviewed the code here).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53181857
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19122/consoleFull) for   PR 2093 at commit [`2871b80`](https://github.com/apache/spark/commit/2871b802a9a7145af1ae93594fbf6b01bd2bb1b7).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class BoundedFloat(float):`
      * `class JoinedRow2 extends Row `
      * `class JoinedRow3 extends Row `
      * `class JoinedRow4 extends Row `
      * `class JoinedRow5 extends Row `
      * `class GenericRow(protected[sql] val values: Array[Any]) extends Row `
      * `abstract class MutableValue extends Serializable `
      * `final class MutableInt extends MutableValue `
      * `final class MutableFloat extends MutableValue `
      * `final class MutableBoolean extends MutableValue `
      * `final class MutableDouble extends MutableValue `
      * `final class MutableShort extends MutableValue `
      * `final class MutableLong extends MutableValue `
      * `final class MutableByte extends MutableValue `
      * `final class MutableAny extends MutableValue `
      * `final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow `
      * `case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate `
      * `case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression `
      * `case class CollectHashSetFunction(`
      * `case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression `
      * `case class CombineSetsAndCountFunction(`
      * `case class CountDistinctFunction(`
      * `case class MaxOf(left: Expression, right: Expression) extends Expression `
      * `case class NewSet(elementType: DataType) extends LeafExpression `
      * `case class AddItemToSet(item: Expression, set: Expression) extends Expression `
      * `case class CombineSets(left: Expression, right: Expression) extends BinaryExpression `
      * `case class CountSet(child: Expression) extends UnaryExpression `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53180512
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19122/consoleFull) for   PR 2093 at commit [`2871b80`](https://github.com/apache/spark/commit/2871b802a9a7145af1ae93594fbf6b01bd2bb1b7).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by mattf <gi...@git.apache.org>.
Github user mattf commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53451465
  
    > Could we merge this, maybe it can catch the last train of 1.1?
    
    i understand, but not my call.
    
    i'll probably have a stronger opinion in coming weeks *smile*


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53450874
  
    @mattf I'm working on Spark since recently, also trying to follow the process as others, and made some mistakes sometime, hope that I will do better, thanks.
    
    There are many things needed to do (especially to 1.1 release) recently, quality and process are both important things we should take care of.
    
    Could we merge this, maybe it can catch the last train of 1.1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53201526
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19126/consoleFull) for   PR 2093 at commit [`1789cd4`](https://github.com/apache/spark/commit/1789cd46e38d1426deb6a4b14bddcbb8c751f585).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2093


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53377163
  
    Hey so I think heapq.nsmallest does the same thing. It should be correct to use it, quick look at its source, I was doubtful about its performance since it has a sort call in each nsmallest. For larger N "I think", MaxHeapQ will perform better. What do you think ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2093#discussion_r16633870
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -746,18 +678,23 @@ def reduce(self, f):
             15
             >>> sc.parallelize((2 for _ in range(10))).map(lambda x: 1).cache().reduce(add)
             10
    +        >>> sc.parallelize([]).reduce(add)
    +        Traceback (most recent call last):
    +            ...
    +        ValueError: Can not reduce() of empty RDD
    --- End diff --
    
    Minor nit, but I'd drop the 'of' and just say "Cannot reduce() empty RDD"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53145616
  
    The doc tests should covered all the code paths, do we still need more tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by mattf <gi...@git.apache.org>.
Github user mattf commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53139620
  
    are you planning to add tests for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2093#discussion_r16633882
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1612,9 +1535,9 @@ def subtractByKey(self, other, numPartitions=None):
             [('b', 4), ('b', 5)]
             """
             def filter_func((key, vals)):
    -            return len(vals[0]) > 0 and len(vals[1]) == 0
    +            return vals[0] and not vals[1]
             map_func = lambda (key, vals): [(key, val) for val in vals[0]]
    --- End diff --
    
    I think you can delete this line, now that it's unused.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53180657
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19123/consoleFull) for   PR 2093 at commit [`2871b80`](https://github.com/apache/spark/commit/2871b802a9a7145af1ae93594fbf6b01bd2bb1b7).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53379531
  
    Some quick benchmark:
    
        >>> from pyspark.rdd import MaxHeapQ
        >>> import heapq, random, timeit
        >>> l = range(1<<13)
        >>> random.shuffle(l)
        >>>
        >>> def take1():
        >>>     q = MaxHeapQ(100)
        >>>     for i in l:
        >>>         q.insert(i)
        >>>     return q.getElements()
        >>>
        >>> def take2():
        >>>     return heapq.nsmallest(100, l)
        >>> # for S >> N
        >>> print timeit.timeit("take1()", "from __main__ import *", number=100)
        0.748146057129
        >>> print timeit.timeit("take2()", "from __main__ import *", number=100)
        0.142593860626
        >>> # for N > S
        >>> l = range(80)
        >>> random.shuffle(l)
        >>> print timeit.timeit("take1()", "from __main__ import *", number=1000)
        0.156821012497
        >>> print timeit.timeit("take2()", "from __main__ import *", number=1000)
        0.00907206535339
    
    Whenever S < N or S > N, nsmallest() is much faster than MaxHeapQ.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by mattf <gi...@git.apache.org>.
Github user mattf commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53415478
  
    > It's supposed that count() will cheaper than collect(), we call count() instead of collect() to trigger the calculation in Scala/Java, It's better to keep the same style in Python.
    > 
    > But in PySpark, count() depends on collect(), which will dump the result into disks and load them into Python. In future, this is maybe changed, count() will returned a number from JVM.
    > 
    > Right now, no strong reason to change collect() to count(), revert it?
    
    thank you for the explaination, it wasn't clear from the code. my preference is for isolated changes, so i'd suggest reverting and doing it separately. however, others may not agree. so i'd say at least add a comment about why count() is used -- someone might come along and change it back to collect() without knowing they shouldn't.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53306550
  
    @ScrapCodes Can you explain the original motivation for MaxHeapQ?  Is there a reason why you used it instead of `heapq.nsmallest`?  If there's a corner-case that the old code supported but that this PR breaks, can you submit a failing test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53378259
  
    > Only for N (top) > S (total), nsmallest() will sort, it's faster than heap, because it's implemented in C.
    
    My point was MaxHeap implementation does not need to sort. For Most values it may just replace the top. Or at worst logN swaps. ? But I think even though sort is C based implementation, it would be NlogN (at best). My python knowledge is definitely limited. I will be happy to be corrected. 
    
    > btw, MaxHeapQ is actually a min heap, why we call it MaxHeapQ ?
    From what I have learnt, MaxHeap keeps largest  element at top. ? Is that wrong ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53200221
  
    @mattf While I was scanning down the whole file line by line in order to find out all the issues related to persersesPartitioning, reformatting them in the same time, if some lines did not looks nice to *me*. It's a completely personal judgement, so maybe it does not make sense to others. 
    
    It's not a good idea to do this kind of reformatting in in PR, I also was thinking of do it as a separated PR or do not dot it if we have no necessary reason.
    
    Should I remove these not-related changes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by mattf <gi...@git.apache.org>.
Github user mattf commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53151737
  
    > The doc tests should covered all the code paths, do we still need more tests?
    
    it's worth including a lookup for 1000 or 1234, which won't be found


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2093#discussion_r16633886
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1777,10 +1698,28 @@ def _defaultReducePartitions(self):
             else:
                 return self.getNumPartitions()
     
    -    # TODO: `lookup` is disabled because we can't make direct comparisons based
    -    # on the key; we need to compare the hash of the key to the hash of the
    -    # keys in the pairs.  This could be an expensive operation, since those
    -    # hashes aren't retained.
    +    def lookup(self, key):
    +        """
    +        Return the list of values in the RDD for key `key`. This operation
    +        is done efficiently if the RDD has a known partitioner by only
    +        searching the partition that the key maps to.
    +
    +        >>> l = range(1000)
    +        >>> rdd = sc.parallelize(zip(l, l), 10)
    +        >>> rdd.lookup(42)  # slow
    +        [42]
    +        >>> sorted = rdd.sortByKey()
    +        >>> sorted.lookup(42)  # fast
    +        [42]
    +        >>> sorted.lookup(1024)
    +        []
    +        """
    +        self = self.filter(lambda (k, v): k == key).values()
    --- End diff --
    
    This reassignment to `self` seems potentially confusing.  Could you use a different name for the filtered RDD, such as `filtered`?  I think even a one-letter variable name, like `f`, is preferable to re-assigning `self`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53161474
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19109/consoleFull) for   PR 2093 at commit [`0f1bce8`](https://github.com/apache/spark/commit/0f1bce8bbf6ca8ccec04f7030707f5a01f3a15ae).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53023121
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19074/consoleFull) for   PR 2093 at commit [`eb1305d`](https://github.com/apache/spark/commit/eb1305df44e030b086ca3cf518bdda3bef772109).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53200504
  
    It's supposed that count() will cheaper than collect(), we call count() instead of collect() to trigger the calculation in Scala/Java, It's better to keep the same style in Python.
    
    But in PySpark, count() depends on collect(), which will dump the result into disks and load them into Python. In future, this is maybe changed, count() will returned a number from JVM.
    
    Right now, no strong reason to change collect() to count(), revert it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53159927
  
    @mattf I had added a test case for it, thx.
    
    I had do much refactor in this PR, please re-review it, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53377821
  
    btw, MaxHeapQ is actually a min heap, why we call it `MaxHeapQ` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by mattf <gi...@git.apache.org>.
Github user mattf commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53415133
  
    > @mattf While I was scanning down the whole file line by line in order to find out all the issues related to persersesPartitioning, reformatting them in the same time, if some lines did not looks nice to me. It's a completely personal judgement, so maybe it does not make sense to others.
    > 
    > It's not a good idea to do this kind of reformatting in in PR, I also was thinking of do it as a separated PR or do not dot it if we have no necessary reason.
    > 
    > Should I remove these not-related changes?
    
    if it were up to me, i'd say yes. it's not though, so i'll go with the flow.
    
    i'm still trying to get a feel for what the spark community likes in its PRs and JIRAs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53203062
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19126/consoleFull) for   PR 2093 at commit [`1789cd4`](https://github.com/apache/spark/commit/1789cd46e38d1426deb6a4b14bddcbb8c751f585).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `    $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$`
      * `    $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53180510
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53386059
  
    Looks like I understand what you mean now, I am still curious to try to benchmark the takeOrdered function with both approaches. There is definitely 3x difference of C vs python implementation(w/o pypy). Can we use PyPy with spark ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53146874
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19097/consoleFull) for   PR 2093 at commit [`be0e8ba`](https://github.com/apache/spark/commit/be0e8bae494805faafe70456866cd9fa2bf5a3ef).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53634095
  
    This looks good to me.  Thanks for discussing MaxHeapQ and verifying that it's behavior is the same as `nsmallest`.
    
    I'm going to merge this into `master` for now.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53529444
  
    @davies Thanks, just saw your patch. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53378632
  
    If N (top) > S(total), sort() will be SlogS, MaxHeapQ will keep all of them, so it's also S * logS (logS is swap), they have same CPU time in algorithm.
    
    @ScrapCodes  Sorry, I misunderstood it. your are right, it's max heap. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53145628
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19097/consoleFull) for   PR 2093 at commit [`be0e8ba`](https://github.com/apache/spark/commit/be0e8bae494805faafe70456866cd9fa2bf5a3ef).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53159943
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19109/consoleFull) for   PR 2093 at commit [`0f1bce8`](https://github.com/apache/spark/commit/0f1bce8bbf6ca8ccec04f7030707f5a01f3a15ae).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53377764
  
    Only for N (top) > S (total), nsmallest() will sort, it's faster than heap, because it's implemented in C.
    
    For N << S, they should be the similar (same algorithm).
    
    nsmallest() is stable, for the same key, the values will keep in the order as them in the original RDD.
    
    nsmallest(n, it, key=None)/nlargest(n, it, key=None) are available since Python 2.5.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53376562
  
    Simply because, heapq is min heap implementation and to track "top N" I needed a Max heap. #97 has some discussion. Let me see if I can come up with something. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2093#issuecomment-53174680
  
    This PR is great.  I like how you systematically addressed all of the bugs in `preservesPartitioning`; the minor cleanups are good, too.
    
    It looks like there's a merge conflict due to me merging another one of your patches.  Once you fix that and address my comments, I'll merge this into master (and hopefully `branch-1.1`, since the partitioner preservation could be a huge performance win).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org