You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Seth Hendrickson (JIRA)" <ji...@apache.org> on 2017/03/27 22:24:41 UTC

[jira] [Comment Edited] (SPARK-19634) Feature parity for descriptive statistics in MLlib

    [ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944030#comment-15944030 ] 

Seth Hendrickson edited comment on SPARK-19634 at 3/27/17 10:23 PM:
--------------------------------------------------------------------

I'm coming to this a bit late, but I'm finding things a bit hard to follow. Reading the design doc, it seems that the original plan was to implement two interfaces - an RDD one that provides the same performance as current {{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. 

from design doc:
"...In the meantime, there will be a (possibly faster) RDD interface and a (more flexible) Dataframe interface."

Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the it was pivoted away from UDAF, but the design doc does not reflect that. Also, if there is to be an RDD interface, what is the JIRA for it and what will it look like?

Also, there are several concerns raised in the design doc about this Catalyst aggregate approach being less efficient, and the consensus seemed to be: provide an initial API with a "slow" implementation that will be improved upon in the future. Is that correct? I'm not that familiar with the Catalyst optimizer, but are we sure there is a good way to implement the tree-reduce type aggregation, and if so could we document that? I'd prefer to get the details hashed out further rather than rushing to provide an API and initial slow implementation, that way we can make sure that we get this correct in the long-term. I really appreciate some clarification and my apologies if I have missed any of the details/discussion.


was (Author: sethah):
I'm coming to this a bit late, but I'm finding things a bit hard to follow. Reading the design doc, it seems that the original plan was to implement two interfaces - an RDD one that provides the same performance as current {{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. 

from design doc:
"...In the meantime, there will be a (possibly faster) RDD interface and a (more flexible) Dataframe interface."

Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the it was pivoted away from UDAF, but the design doc does not reflect that. Also, if there is to be an RDD interface, what is the JIRA for it and what will it look like?

Also, there are several concerns raised in the design doc about this Catalyst aggregate approach being less efficient, and the consensus seemed to be: provide an initial API with a "slow" implementation that will be improved upon in the future. Is that correct? I'm not that familiar with the Catalyst optimizer, but are we sure there is a good way to implement the tree-reduce type aggregation, and if so could we document that? If this is still targeted at 2.2, why? I'd prefer to get the details hashed out further rather than rushing to provide an API and initial slow implementation, that way we can make sure that we get this correct in the long-term. I really appreciate some clarification and my apologies if I have missed any of the details/discussion.

> Feature parity for descriptive statistics in MLlib
> --------------------------------------------------
>
>                 Key: SPARK-19634
>                 URL: https://issues.apache.org/jira/browse/SPARK-19634
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Timothy Hunter
>            Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org