You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by 诺铁 <no...@gmail.com> on 2015/07/08 05:26:27 UTC

how to use DoubleRDDFunctions on mllib Vector?

hi,

there are some useful functions in DoubleRDDFunctions, which I can use if I
have RDD[Double], eg, mean, variance.

Vector doesn't have such methods, how can I convert Vector to RDD[Double],
or maybe better if I can call mean directly on a Vector?

Re: Re: how to use DoubleRDDFunctions on mllib Vector?

Posted by 诺铁 <no...@gmail.com>.
Ok, got it , thanks.

On Thu, Jul 9, 2015 at 12:02 PM, prosp4300 <pr...@163.com> wrote:

>
>
> Seems what Feynman mentioned is the source code instead of documentation,
> vectorMean is private, see
>
> https://github.com/apache/spark/blob/v1.3.0/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
>
> At 2015-07-09 10:10:58, "诺铁" <no...@gmail.com> wrote:
>
> thanks, I understand now.
> but I can't find mllib.clustering.GaussianMixture#vectorMean   , what
> version of spark do you use?
>
> On Thu, Jul 9, 2015 at 1:16 AM, Feynman Liang <fl...@databricks.com>
> wrote:
>
>> A RDD[Double] is an abstraction for a large collection of doubles,
>> possibly distributed across multiple nodes. The DoubleRDDFunctions are
>> there for performing mean and variance calculations across this distributed
>> dataset.
>>
>> In contrast, a Vector is not distributed and fits on your local machine.
>> You would be better off computing these quantities on the Vector directly
>> (see mllib.clustering.GaussianMixture#vectorMean for an example of how to
>> compute the mean of a vector).
>>
>> On Tue, Jul 7, 2015 at 8:26 PM, 诺铁 <no...@gmail.com> wrote:
>>
>>> hi,
>>>
>>> there are some useful functions in DoubleRDDFunctions, which I can use
>>> if I have RDD[Double], eg, mean, variance.
>>>
>>> Vector doesn't have such methods, how can I convert Vector to
>>> RDD[Double], or maybe better if I can call mean directly on a Vector?
>>>
>>
>>
>
>
>

回复:Re: how to use DoubleRDDFunctions on mllib Vector?

Posted by prosp4300 <pr...@163.com>.


Seems what Feynman mentioned is the source code instead of documentation, vectorMean is private, see
https://github.com/apache/spark/blob/v1.3.0/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala


At 2015-07-09 10:10:58, "诺铁" <no...@gmail.com> wrote:

thanks, I understand now.
but I can't find mllib.clustering.GaussianMixture#vectorMean   , what version of spark do you use?


On Thu, Jul 9, 2015 at 1:16 AM, Feynman Liang <fl...@databricks.com> wrote:

A RDD[Double] is an abstraction for a large collection of doubles, possibly distributed across multiple nodes. The DoubleRDDFunctions are there for performing mean and variance calculations across this distributed dataset.


In contrast, a Vector is not distributed and fits on your local machine. You would be better off computing these quantities on the Vector directly (see mllib.clustering.GaussianMixture#vectorMean for an example of how to compute the mean of a vector).


On Tue, Jul 7, 2015 at 8:26 PM, 诺铁 <no...@gmail.com> wrote:

hi,


there are some useful functions in DoubleRDDFunctions, which I can use if I have RDD[Double], eg, mean, variance.  


Vector doesn't have such methods, how can I convert Vector to RDD[Double], or maybe better if I can call mean directly on a Vector?




Re: how to use DoubleRDDFunctions on mllib Vector?

Posted by 诺铁 <no...@gmail.com>.
thanks, I understand now.
but I can't find mllib.clustering.GaussianMixture#vectorMean   , what
version of spark do you use?

On Thu, Jul 9, 2015 at 1:16 AM, Feynman Liang <fl...@databricks.com> wrote:

> A RDD[Double] is an abstraction for a large collection of doubles,
> possibly distributed across multiple nodes. The DoubleRDDFunctions are
> there for performing mean and variance calculations across this distributed
> dataset.
>
> In contrast, a Vector is not distributed and fits on your local machine.
> You would be better off computing these quantities on the Vector directly
> (see mllib.clustering.GaussianMixture#vectorMean for an example of how to
> compute the mean of a vector).
>
> On Tue, Jul 7, 2015 at 8:26 PM, 诺铁 <no...@gmail.com> wrote:
>
>> hi,
>>
>> there are some useful functions in DoubleRDDFunctions, which I can use if
>> I have RDD[Double], eg, mean, variance.
>>
>> Vector doesn't have such methods, how can I convert Vector to
>> RDD[Double], or maybe better if I can call mean directly on a Vector?
>>
>
>

Re: how to use DoubleRDDFunctions on mllib Vector?

Posted by Feynman Liang <fl...@databricks.com>.
A RDD[Double] is an abstraction for a large collection of doubles, possibly
distributed across multiple nodes. The DoubleRDDFunctions are there for
performing mean and variance calculations across this distributed dataset.

In contrast, a Vector is not distributed and fits on your local machine.
You would be better off computing these quantities on the Vector directly
(see mllib.clustering.GaussianMixture#vectorMean for an example of how to
compute the mean of a vector).

On Tue, Jul 7, 2015 at 8:26 PM, 诺铁 <no...@gmail.com> wrote:

> hi,
>
> there are some useful functions in DoubleRDDFunctions, which I can use if
> I have RDD[Double], eg, mean, variance.
>
> Vector doesn't have such methods, how can I convert Vector to RDD[Double],
> or maybe better if I can call mean directly on a Vector?
>