You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Matt Saunders (JIRA)" <ji...@apache.org> on 2018/10/19 19:27:00 UTC

[jira] [Created] (SPARK-25782) Add PCA Aggregator to support grouping

Matt Saunders created SPARK-25782:
-------------------------------------

             Summary: Add PCA Aggregator to support grouping
                 Key: SPARK-25782
                 URL: https://issues.apache.org/jira/browse/SPARK-25782
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 2.3.2
            Reporter: Matt Saunders


I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset). 

So I built a little Aggregator that can do that, here's an example of how it's called:
{noformat}
val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn

// For each grouping, compute a PCA matrix/vector
val pcaModels = inputData
  .groupBy(keys:_*)
  .agg(pcaAggregation.as(pcaOutput)){noformat}
I used the same algorithms under the hood as RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works directly on Datasets without converting to RDD first.

I've seen others who wanted this ability (for example on Stack Overflow) so I'd like to contribute it if it would be a benefit to the larger community. If there is interest, I will prepare the code for a pull request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org