You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by AiHe <gi...@git.apache.org> on 2015/04/20 08:04:58 UTC

[GitHub] spark pull request: [PYSPARK] Fix a typo in "fold" function in rdd...

GitHub user AiHe opened a pull request:

    https://github.com/apache/spark/pull/5587

    [PYSPARK] Fix a typo in "fold" function in rdd.py

    This will make the “fold” function consistent with the "fold" in rdd.scala and other "aggregate" functions where “acc” goes first. Otherwise, users have to write a lambda function like “lambda x, y: op(y, x)” if they want to use “zeroValue” to change the result type.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/AiHe/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5587.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5587
    
----
commit 53b54cb6ba69d7313bbddbf1b291dc0371acfaca
Author: Alain <ai...@usc.edu>
Date:   2015-04-20T05:59:04Z

    [PYSPARK] Fix a typo in "fold" function in rdd.py
    
    This will make the “fold” function consistent with the "fold" in
    rdd.scala and other "aggregate" functions where “acc” goes first.
    Otherwise, users have to write a lambda function like “lambda x, y:
    op(y, x)” if they want to use “zeroValue” to the result type.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [PYSPARK] Fix a typo in "fold" function in rdd...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5587#issuecomment-94414485
  
    I think that's right that it would be more consistent, in that this is how the Scala function that's ultimately called for `RDD.fold` looks:
    
    ```
      def foldLeft[B](z: B)(op: (B, A) => B): B = {
        var result = z
        this.seq foreach (x => result = op(result, x))
        result
      }
    ```
    
    Elements are passed as the second arguments. The operation is commutative but it matters here since the operation is only supposed to modify the first arg.
    
    Why does this affect your ability to make a differently typed result? You need `aggregate` for that anyway I thought.
    
    CC @JoshRosen as initial author


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [PYSPARK] Fix a typo in "fold" function in rdd...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5587#issuecomment-94366385
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [PYSPARK] Fix a typo in "fold" function in rdd...

Posted by AiHe <gi...@git.apache.org>.

Github user AiHe commented on the pull request:

    https://github.com/apache/spark/pull/5587#issuecomment-94533106
  
    Thanks for replying. I guess it might be better to remain it unchanged and point it out at docs. I will be trying to do the modification at the doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [PYSPARK] Fix a typo in "fold" function in rdd...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5587#issuecomment-94529376
  
    This feels like it should be tracked as a JIRA, even if one targeted at version 2+ only to actually change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [PYSPARK] Fix a typo in "fold" function in rdd...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5587#issuecomment-94528642
  
    I agree that swapping the order of the function arguments makes more sense, but at this point I wonder whether it's safe to change this.  The old code has been around since Spark 0.7 or something and it's possible that someone could be relying on the existing confusing behavior.  Maybe we should just point out this inconsistency in the docs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [PYSPARK] Fix a typo in "fold" function in rdd...

Posted by AiHe <gi...@git.apache.org>.

Github user AiHe commented on the pull request:

    https://github.com/apache/spark/pull/5587#issuecomment-94509911
  
    @JoshRosen 
    
    I just follow the example of NAStatCounter in the book of "Advanced Analysis with Spark". NAStatCounter is supposed to get stats of a dataset where some missing values exist.
    
    Instead of using scala the same as the book, I use python to reimplement that. My code follows.
    
    ```python
    class NAStatCounter(object):
        
        def __init__(self):
            self.stats = StatCounter()
            self.missing = 0
        
        def add(self, x):
            if not x:
                self.missing += 1
            elif isinstance(x, NAStatCounter):
                self.stats.mergeStats(x.stats)
                self.missing += x.missing
            else:
                self.stats.merge(float(str(x)))
            return self
        
        def __str__(self):
            return 'stats: ' + str(self.stats) + ' missing: ' + str(self.missing)
        
        def __repr__(self):
            return self.__str__()
    ```
    
    Here I make up a dummy dataset and the do the stats.
    ```python
    rdd = sc.parallelize(['1,,2', ',3,1', '1,2,'])
    na_stat = rdd.map(lambda x: x.split(','))
    z = [NAStatCounter for i in xrange(3)]
    op = lambda x, y: map(lambda a: a[0].add(a[1]), zip(x, y)
    result = na_stat.fold(z, op)
    ```
    
    Then I get the error like "'str' object has no attribute 'add'" because it has op('1', NAStatCounter()) in the "fold" implementation. In the specified lambda function, it becomes '1'.add(NAStatCounter()). However, it's expected to be NAStatCounter().add('1').
    
    As you mentioned, only the first argument can be modified and I guess it should be the provided "zeroValue" and the element are the second argument which is not allowed to be changed.
    
    Intuitively, users specify the "zeroValue" as "x" and elements as "y" in the lambda function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [PYSPARK] Fix a typo in "fold" function in rdd...

Posted by AiHe <gi...@git.apache.org>.

Github user AiHe closed the pull request at:

    https://github.com/apache/spark/pull/5587


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org