You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2016/04/07 23:23:25 UTC

[jira] [Comment Edited] (SPARK-14408) Update RDD.treeAggregate not to use reduce

    [ https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231097#comment-15231097 ] 

Joseph K. Bradley edited comment on SPARK-14408 at 4/7/16 9:22 PM:
-------------------------------------------------------------------

StandardScaler: This may be 2 confounded issues.  MLlib's StandardScaler uses the unbiased sample std to rescale, whereas sklearn uses the biased sample std.
* *Q*: [sklearn.preprocessing.StandardScaler | http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html] uses biased sample std.  R's [scale package | https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the unbiased sample std.  I'm used to seeing the biased sample std used in ML, probably because it is handy for proofs to know columns have L2 norm 1.  My main question is: What does glmnet do?  This is important since we compare with it for MLlib GLM unit tests.  The difference might be insignificant, though, for GLMs and the datasets we are testing on.


was (Author: josephkb):
StandardScaler
* This may be 2 confounded issues.  MLlib's StandardScaler uses the unbiased sample std to rescale, whereas sklearn uses the biased sample std.
** *Q*: [sklearn.preprocessing.StandardScaler | http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html] uses biased sample std.  R's [scale package | https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the unbiased sample std.  I'm used to seeing the biased sample std used in ML, probably because it is handy for proofs to know columns have L2 norm 1.  My main question is: What does glmnet do?  This is important since we compare with it for MLlib GLM unit tests.  The difference might be insignificant, though, for GLMs and the datasets we are testing on.

> Update RDD.treeAggregate not to use reduce
> ------------------------------------------
>
>                 Key: SPARK-14408
>                 URL: https://issues.apache.org/jira/browse/SPARK-14408
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib, Spark Core
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>            Priority: Critical
>
> **Issue**
> In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and {{combOp}} functions to modify and return their first argument, just like {{RDD.aggregate}}.  However, it is not documented that way.
> I started to add docs to this effect, but then noticed that {{treeAggregate}} uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which technically allows the seq/combOps to modify and return their first arguments.
> **Question**: Is the implementation safe, or does it need to be updated?
> **Decision**: Avoid using reduce.  Use fold instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org