You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Matt Saunders (JIRA)" <ji...@apache.org> on 2018/10/19 19:27:00 UTC
[jira] [Created] (SPARK-25782) Add PCA Aggregator to support
grouping
Matt Saunders created SPARK-25782:
-------------------------------------
Summary: Add PCA Aggregator to support grouping
Key: SPARK-25782
URL: https://issues.apache.org/jira/browse/SPARK-25782
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 2.3.2
Reporter: Matt Saunders
I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset).
So I built a little Aggregator that can do that, here's an example of how it's called:
{noformat}
val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
// For each grouping, compute a PCA matrix/vector
val pcaModels = inputData
.groupBy(keys:_*)
.agg(pcaAggregation.as(pcaOutput)){noformat}
I used the same algorithms under the hood as RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works directly on Datasets without converting to RDD first.
I've seen others who wanted this ability (for example on Stack Overflow) so I'd like to contribute it if it would be a benefit to the larger community. If there is interest, I will prepare the code for a pull request.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org