You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by st553 <st...@gmail.com> on 2014/09/17 21:21:38 UTC

How to run kmeans after pca?

I would like to reduce the dimensionality of my data before running kmeans.
The problem I'm having is that both RowMatrix.computePrincipalComponents()
and RowMatrix.computeSVD() return a DenseMatrix whereas KMeans.train()
requires an RDD[Vector]. Does MLlib provide a way to do this conversion?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to run kmeans after pca?

Posted by "Evan R. Sparks" <ev...@gmail.com>.

Caching after doing the multiply is a good idea. Keep in mind that during
the first iteration of KMeans, the cached rows haven't yet been
materialized - so it is both doing the multiply and the first pass of
KMeans all at once. To isolate which part is slow you can run
cachedRows.numRows() to force this to be materialized before you run KMeans.

Also, KMeans is optimized to run quickly on both sparse and dense data. The
result of PCA is going to be dense, but if your input data has #nnz ~=
size(pca data), performance might be about the same. (I haven't actually
verified this last point.)

Finally, speed is partially going to be dependent on how much data you have
relative to scheduler overheads - if your input data is small it could be
that the costs of distributing your task are greater than the time spent
actually computing - usually this would manifest itself in the stages
taking about the same amount of time even though you're passing datasets
that have different dimensionality.

On Tue, Sep 30, 2014 at 9:00 AM, st553 <st...@gmail.com> wrote:

> Thanks for your response Burak it was very helpful.
>
> I am noticing that if I run PCA before KMeans that the KMeans algorithm
> will
> actually take longer to run than if I had just run KMeans without PCA. I
> was
> hoping that by using PCA first it would actually speed up the KMeans
> algorithm.
>
> I have followed the steps you've outlined but Im wondering if I need to
> cache/persist the RDD[Vector] rows of the RowMatrix returned after
> multiplying. Something like:
>
> val newData: RowMatrix = data.multiply(bcPrincipalComponents.value)
> val cachedRows = newData.rows.persist()
> KMeans.run(cachedRows)
> cachedRows.unpersist()
>
> It doesnt seem intuitive to me that a smaller dimensional version of my
> data
> set would take longer for KMeans... unless Im missing something?
>
> Thanks!
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473p15409.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: How to run kmeans after pca?

Posted by st553 <st...@gmail.com>.

Thanks for your response Burak it was very helpful.

I am noticing that if I run PCA before KMeans that the KMeans algorithm will
actually take longer to run than if I had just run KMeans without PCA. I was
hoping that by using PCA first it would actually speed up the KMeans
algorithm.

I have followed the steps you've outlined but Im wondering if I need to
cache/persist the RDD[Vector] rows of the RowMatrix returned after
multiplying. Something like:

val newData: RowMatrix = data.multiply(bcPrincipalComponents.value) 
val cachedRows = newData.rows.persist()
KMeans.run(cachedRows) 
cachedRows.unpersist()

It doesnt seem intuitive to me that a smaller dimensional version of my data
set would take longer for KMeans... unless Im missing something?

Thanks!




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473p15409.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to run kmeans after pca?

Posted by Burak Yavuz <by...@stanford.edu>.

To properly perform PCA, you must left multiply the resulting DenseMatrix with the original RowMatrix. The result will also be a RowMatrix,
therefore you can easily access the values by .values, and train KMeans on that.

Don't forget to Broadcast the DenseMatrix returned from RowMatrix.computePrincipalComponents(), otherwise you'll get an OOME.

Here's how to do it in Scala: (didn't run the code, but should be something like this)

val data: RowMatrix = ...
val bcPrincipalComponents: DenseMatrix = data.context.broadcast(data.computePrincipalComponents())

val newData: RowMatrix = data.multiply(bcPrincipalComponents.value)

KMeans.run(newData.values)

Best,
Burak



----- Original Message -----
From: "st553" <st...@gmail.com>
To: user@spark.incubator.apache.org
Sent: Wednesday, September 17, 2014 12:21:38 PM
Subject: How to run kmeans after pca?

I would like to reduce the dimensionality of my data before running kmeans.
The problem I'm having is that both RowMatrix.computePrincipalComponents()
and RowMatrix.computeSVD() return a DenseMatrix whereas KMeans.train()
requires an RDD[Vector]. Does MLlib provide a way to do this conversion?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org