You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by andy327 <gi...@git.apache.org> on 2014/07/31 20:42:29 UTC

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

GitHub user andy327 opened a pull request:

    https://github.com/apache/spark/pull/1698

    Add normalizeByCol method to mllib.util.MLUtils.

    Adds the ability to compute the mean and standard deviations of each vector (LabeledPoint) component and normalize each vector in the RDD, using only RDD transformations. The result is an RDD of Vectors where each column has a mean of zero and standard deviation of one.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tresata/spark feat-normalize-cols

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1698.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1698
    
----
commit 89e538b2a0ba65e9b7829ba3a5f706373211f3c8
Author: Andres Perez <an...@tresata.com>
Date:   2014-07-31T16:35:44Z

    Add normalizeByCol method to mllib.util.MLUtils.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1698#issuecomment-50892213
  
    They are not the same. We use treeReduce to avoid having all executors sending data to the driver, which is not available in reduceByKey. Broadcast is also different from cartesian. This solution cannot avoid having duplicate calculations to `rdd`. When the computation is triggered, we still need to visit `rdd` twice. One difference is if someone calls `normalizeByCol` but never uses the normalized rdd.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

Posted by koertkuipers <gi...@git.apache.org>.

Github user koertkuipers commented on the pull request:

    https://github.com/apache/spark/pull/1698#issuecomment-50896323
  
    why do you use treeReduce + broadcast? the data per partition is small no? only a few aggregates per partition


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

Posted by andy327 <gi...@git.apache.org>.

Github user andy327 commented on the pull request:

    https://github.com/apache/spark/pull/1698#issuecomment-50821196
  
    I see that #1207 covers re-scaling in mllib.util.FeatureScaling, but from what I can tell, it calls RowMatrix.computeColumnSummaryStatistics, making it not a lazy transformation. Would there be a benefit to implementing feature scaling without calling any RDD actions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

Posted by koertkuipers <gi...@git.apache.org>.

Github user koertkuipers commented on the pull request:

    https://github.com/apache/spark/pull/1698#issuecomment-50890613
  
    redudeByKey being the same as reduce, and cartesian being the same as broadcast is the whole point, the difference being that redudeByKey and cartesian are evaluated lazily.
    
    eager evaluation is often unexpected to the end user and can lead to duplicate calculations (since the user does not anticipate them and deal with them using rdd.cache calls)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

Posted by andy327 <gi...@git.apache.org>.

Github user andy327 commented on the pull request:

    https://github.com/apache/spark/pull/1698#issuecomment-50804551
  
    See Jira issue: https://issues.apache.org/jira/browse/SPARK-2776


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1698#issuecomment-50814109
  
    @andy327 This is covered in @dbtsai's PR: https://github.com/apache/spark/pull/1207 , which is in review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1698#issuecomment-50899568
  
    What if you have 10M columns? I agree that not sending data to the driver is a good practice. But the current operations `reduceByKey` and `cartesian` are not optimized for very big data. Please test it on a cluster with many partitions and you should see the bottleneck.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1698#issuecomment-53944079
  
    @andy327 Do you mind closing this PR for now? I'm definitely buying the idea of freeing up the master, but the current set of Core APIs doesn't provide an easy and efficient way to do it. We could re-visit this and other implementations once we have the right set of tools. Thanks @andy327 @koertkuipers for the PR and the discussion!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1698#issuecomment-50900786
  
    Yes, I tried to implement AllReduce without having driver in the middle in https://github.com/apache/spark/pull/506 but it introduced complex dependencies. So I fall back to the treeReduce + torrent broadcast approach. I hope this can be improved, maybe in v1.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1698#issuecomment-50801763
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1698#issuecomment-50849409
  
    Your implementation calls `reduceByKey` and `cartesian`. Those are not cheap streamline operations. `map(x => (1, x)).reduceByKey` is the same as `reduce` except that it reduces to some executor instead of the driver. Then `cartesian` is the same as `broadcast` but `broadcast` is more efficient with TorrentBroadcast. You can compare the performance and see the difference. `OnlineSummarizer` also uses a more accurate approach to compute the variance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

Posted by koertkuipers <gi...@git.apache.org>.

Github user koertkuipers commented on the pull request:

    https://github.com/apache/spark/pull/1698#issuecomment-50900320
  
    i can see your point of 10M columns.
    
    would be really nice if we have a lazy and efficient allReduce(RDD[T], (T,
    T) => T): RDD[T]
    
    a RDD transform not being lazy leading to multiple spark actions that the
    user did not explicitly start is tricky to me. its already difficult enough
    to get the cache and unpersist logic correct without unexpected actions.
    
    
    On Fri, Aug 1, 2014 at 11:43 AM, Xiangrui Meng <no...@github.com>
    wrote:
    
    > What if you have 10M columns? I agree that not sending data to the driver
    > is a good practice. But the current operations reduceByKey and cartesian
    > are not optimized for very big data. Please test it on a cluster with many
    > partitions and you should see the bottleneck.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/1698#issuecomment-50899568>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1698


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org