You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Tim Hunter <ti...@databricks.com> on 2017/02/16 21:00:04 UTC

Design document - MLlib's statistical package for DataFrames

Hello all,

I have been looking at some of the missing items for complete feature
parity between spark.ml and spark.mllib. Here is a proposal for
porting mllib.stats, the descriptive statistics package:

https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit?usp=sharing

The umbrella ticket for this task is:
https://issues.apache.org/jira/browse/SPARK-4591

Please comment on the document. Also, if you want to work on one of
the algorithms, the design doc and the umbrella ticket have subtasks
that you can assign yourself to.

The cutoff deadline for Spark 2.2 is rapidly approaching, and it would
be great if we could claim parity for this release!

Cheers

Tim

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Design document - MLlib's statistical package for DataFrames

Posted by Holden Karau <ho...@pigscanfly.ca>.
It's at the bottom of every message (although some mail clients hide it for
some reason), send an email to dev-unsubscribe@spark.apache.org

On Sat, Feb 18, 2017 at 11:07 AM Pritish Nawlakhe <
pritish@nirvana-international.com> wrote:

> Hi
>
> Would anyone know how to unsubscribe to this list?
>
>
>
> Thank you!!
>
> Regards
> Pritish
> Nirvana International Inc.
>
> Big Data, Hadoop, Oracle EBS and IT Solutions
> VA - SWaM, MD - MBE Certified Company
> pritish@nirvana-international.com
> http://www.nirvana-international.com
> Twitter: @nirvanainternat
>
> -----Original Message-----
> From: Tim Hunter [mailto:timhunter@databricks.com]
> Sent: Friday, February 17, 2017 1:49 PM
> To: bradc
> Cc: dev@spark.apache.org
> Subject: Re: Design document - MLlib's statistical package for DataFrames
>
> Hi Brad,
>
> this task is focusing on moving the existing algorithms, so that we are
> held up by parity issues.
>
> Do you have some paper suggestions for cardinality? I do not think there
> is a feature request on JIRA either.
>
> Tim
>
> On Thu, Feb 16, 2017 at 2:21 PM, bradc <br...@oracle.com> wrote:
> > Hi,
> >
> > While it is also missing in spark.mllib, I'd suggest adding
> > cardinality as part of the Simple descriptive statistics for both
> spark.ml and spark.mlib?
> > This is useful even for data in double precision FP to understand the
> > "uniqueness" of the feature data.
> >
> > Cheers,
> > Brad
> >
> >
> >
> >
> > --
> > View this message in context:
> > http://apache-spark-developers-list.1001551.n3.nabble.com/Design-docum
> > ent-MLlib-s-statistical-package-for-DataFrames-tp21014p21016.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

RE: Design document - MLlib's statistical package for DataFrames

Posted by Pritish Nawlakhe <pr...@nirvana-international.com>.
Hi 

Would anyone know how to unsubscribe to this list?



Thank you!!

Regards
Pritish
Nirvana International Inc.

Big Data, Hadoop, Oracle EBS and IT Solutions
VA - SWaM, MD - MBE Certified Company
pritish@nirvana-international.com 
http://www.nirvana-international.com 
Twitter: @nirvanainternat 

-----Original Message-----
From: Tim Hunter [mailto:timhunter@databricks.com] 
Sent: Friday, February 17, 2017 1:49 PM
To: bradc
Cc: dev@spark.apache.org
Subject: Re: Design document - MLlib's statistical package for DataFrames

Hi Brad,

this task is focusing on moving the existing algorithms, so that we are held up by parity issues.

Do you have some paper suggestions for cardinality? I do not think there is a feature request on JIRA either.

Tim

On Thu, Feb 16, 2017 at 2:21 PM, bradc <br...@oracle.com> wrote:
> Hi,
>
> While it is also missing in spark.mllib, I'd suggest adding 
> cardinality as part of the Simple descriptive statistics for both spark.ml and spark.mlib?
> This is useful even for data in double precision FP to understand the 
> "uniqueness" of the feature data.
>
> Cheers,
> Brad
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Design-docum
> ent-MLlib-s-statistical-package-for-DataFrames-tp21014p21016.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org



---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Design document - MLlib's statistical package for DataFrames

Posted by Tim Hunter <ti...@databricks.com>.
Hi Brad,

this task is focusing on moving the existing algorithms, so that we
are held up by parity issues.

Do you have some paper suggestions for cardinality? I do not think
there is a feature request on JIRA either.

Tim

On Thu, Feb 16, 2017 at 2:21 PM, bradc <br...@oracle.com> wrote:
> Hi,
>
> While it is also missing in spark.mllib, I'd suggest adding cardinality as
> part of the Simple descriptive statistics for both spark.ml and spark.mlib?
> This is useful even for data in double precision FP to understand the
> "uniqueness" of the feature data.
>
> Cheers,
> Brad
>
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Design-document-MLlib-s-statistical-package-for-DataFrames-tp21014p21016.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Design document - MLlib's statistical package for DataFrames

Posted by bradc <br...@oracle.com>.
Hi,

While it is also missing in spark.mllib, I'd suggest adding cardinality as
part of the Simple descriptive statistics for both spark.ml and spark.mlib? 
This is useful even for data in double precision FP to understand the
"uniqueness" of the feature data.

Cheers,
Brad




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Design-document-MLlib-s-statistical-package-for-DataFrames-tp21014p21016.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org