You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Artemis User <ar...@dtechspace.com> on 2021/11/29 20:49:23 UTC

Equivalent Function in ml for computeCost()

The RDD-based org.apache.spark.mllib.clustering.KMeansModel class 
defines a method called computeCost that is used to calculate the WCSS 
error of K-Means clusters 
(https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/clustering/KMeansModel.html). 
Is there an equivalent method of computeCost in the new ml library for 
K-Means?

Thanks in advance!

-- ND

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Equivalent Function in ml for computeCost()

Posted by Sean Owen <sr...@gmail.com>.

I knew I was forgetting something, right. Feel free to make an update for
the doxs6

On Mon, Nov 29, 2021, 4:49 PM Artemis User <ar...@dtechspace.com> wrote:

> Thanks Sean!  After a little bit digging through the source code, it seems
> that the computeCost method has been replaced by the trainingCost method in
> KMeansSummary class.  This is the hidden comment in the source code for the
> trainingCost method (somehow it wasn't propagated to the online Spark API
> doc):
>
> @param trainingCost K-means cost (sum of squared distances to the nearest
> centroid for all points in the training dataset). This is equivalent to
> sklearn's inertia.
>
> Inertia actually means the same as within-cluster sum of squares (WCSS)..
> Just wished Spark's documentation could be made better...
>
> -- ND
>
> On 11/29/21 3:57 PM, Sean Owen wrote:
>
> I don't believe there is, directly, though there is ClusteringMetrics to
> evaluate clusterings in .ml. I'm kinda confused that it doesn't expose sum
> of squared distances though; it computes silhouette only?
> You can compute it directly, pretty easily, in any event, either by just
> writing up a few lines of code or using the .mllib model inside the .ml
> model object anyway.
>
> On Mon, Nov 29, 2021 at 2:50 PM Artemis User <ar...@dtechspace.com>
> wrote:
>
>> The RDD-based org.apache.spark.mllib.clustering.KMeansModel class
>> defines a method called computeCost that is used to calculate the WCSS
>> error of K-Means clusters
>> (
>> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/clustering/KMeansModel.html).
>>
>> Is there an equivalent method of computeCost in the new ml library for
>> K-Means?
>>
>> Thanks in advance!
>>
>> -- ND
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>

Re: Equivalent Function in ml for computeCost()

Posted by Artemis User <ar...@dtechspace.com>.

Thanks Sean!  After a little bit digging through the source code, it 
seems that the computeCost method has been replaced by the trainingCost 
method in KMeansSummary class.  This is the hidden comment in the source 
code for the trainingCost method (somehow it wasn't propagated to the 
online Spark API doc):

@param trainingCost K-means cost (sum of squared distances to the 
nearest centroid for all points in the training dataset). This is 
equivalent to sklearn's inertia.

Inertia actually means the same as within-cluster sum of squares 
(WCSS)..  Just wished Spark's documentation could be made better...

-- ND

On 11/29/21 3:57 PM, Sean Owen wrote:
> I don't believe there is, directly, though there is ClusteringMetrics 
> to evaluate clusterings in .ml. I'm kinda confused that it doesn't 
> expose sum of squared distances though; it computes silhouette only?
> You can compute it directly, pretty easily, in any event, either by 
> just writing up a few lines of code or using the .mllib model inside 
> the .ml model object anyway.
>
> On Mon, Nov 29, 2021 at 2:50 PM Artemis User <ar...@dtechspace.com> 
> wrote:
>
>     The RDD-based org.apache.spark.mllib.clustering.KMeansModel class
>     defines a method called computeCost that is used to calculate the
>     WCSS
>     error of K-Means clusters
>     (https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/clustering/KMeansModel.html).
>
>     Is there an equivalent method of computeCost in the new ml library
>     for
>     K-Means?
>
>     Thanks in advance!
>
>     -- ND
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

Re: Equivalent Function in ml for computeCost()

Posted by Sean Owen <sr...@gmail.com>.

I don't believe there is, directly, though there is ClusteringMetrics to
evaluate clusterings in .ml. I'm kinda confused that it doesn't expose sum
of squared distances though; it computes silhouette only?
You can compute it directly, pretty easily, in any event, either by just
writing up a few lines of code or using the .mllib model inside the .ml
model object anyway.

On Mon, Nov 29, 2021 at 2:50 PM Artemis User <ar...@dtechspace.com> wrote:

> The RDD-based org.apache.spark.mllib.clustering.KMeansModel class
> defines a method called computeCost that is used to calculate the WCSS
> error of K-Means clusters
> (
> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/clustering/KMeansModel.html).
>
> Is there an equivalent method of computeCost in the new ml library for
> K-Means?
>
> Thanks in advance!
>
> -- ND
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>