You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Matt Saunders <ma...@saunders.net> on 2018/10/17 23:53:59 UTC

[MLlib] PCA Aggregator

I built an Aggregator that computes PCA on grouped datasets. I wanted to
use the PCA functions provided by MLlib, but they only work on a full
dataset, and I needed to do it on a grouped dataset (like a
RelationalGroupedDataset).

So I built a little Aggregator that can do that, here’s an example of how
it’s called:

    val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn

    // For each grouping, compute a PCA matrix/vector
    val pcaModels = inputData
      .groupBy(keys:_*)
      .agg(pcaAggregation.as(pcaOutput))

I used the same algorithms under the hood as
RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
directly on Datasets without converting to RDD first.

I’ve seen others who wanted this ability (for example on Stack Overflow) so
I’d like to contribute it if it would be a benefit to the larger community.

So.. is this something worth contributing to MLlib? And if so, what are the
next steps to start the process?

thanks!

Re: [MLlib] PCA Aggregator

Posted by Sean Owen <sr...@gmail.com>.

I think this is great info and context to put in the JIRA.

On Fri, Oct 19, 2018, 6:53 PM Matt Saunders <ma...@saunders.net> wrote:

> Hi Sean, thanks for your feedback. I saw this as a missing feature in the
> existing PCA implementation in MLlib. I suspect the use case is a common
> one: you have data from different entities (could be different users,
> different locations, or different products, for example) and you need to
> model them separately since they behave differently--perhaps their features
> run in different ranges, or perhaps they have completely different
> features.
>
> For example if you were modeling the weather in different parts of the
> world for a given time period, and the features were things like
> temperature, humidity, wind speed, pressure, etc. With the current
> PCA/RowMatrix options, you can only calculate PCA on the entire dataset,
> when you really want to model the weather in New York separately from the
> weather in Buenos Aires. Today your options are to collect the data from
> each city and calculate PCA using some other library like Breeze, or use
> the PCA implementation from MLlib but only on one key at a time.
>
> The reason I thought it would be useful in Spark is that it makes the PCA
> offering in MLlib useful to more people. As it stands today, I wasn't able
> to use it for much and I suspect others had the same experience, for
> example:
>
> https://stackoverflow.com/questions/45240556/perform-pca-on-each-group-of-a-groupby-in-pyspark
>
> This isn't really big enough to warrant its own library--it's just a
> single class. But if you think it's better to publish it externally I can
> certainly do that.
>
> thanks again,
> --Matt
>
>
> On Fri, Oct 19, 2018 at 4:14 PM Sean Owen <sr...@gmail.com> wrote:
>
>> It's OK to open a JIRA though I generally doubt any new functionality
>> will be added. This might be viewed as a small worthwhile enhancement,
>> haven't looked at it. It's always more compelling if you can sketch the use
>> case for it and why it is more meaningful in spark than outside it.
>>
>> There is spark-packages for recording third party packages but it is not
>> required nor even necessarily a comprehensive list. You can just self
>> publish like any git or Maven project, if you develop a third party library
>>
>> On Fri, Oct 19, 2018, 2:32 PM Matt Saunders <ma...@saunders.net> wrote:
>>
>>> Thanks, Eric. I went ahead and created SPARK-25782 for this improvement
>>> since it is a feature I and others have looked for in MLlib, but doesn't
>>> seem to exist yet. Also, while searching for PCA-related issues in JIRA I
>>> noticed that someone added grouping support for PCA to the MADlib project a
>>> while back (see MADLIB-947), so there does seem to be a demand for it.
>>>
>>> thanks!
>>> --Matt
>>>
>>>
>>> On Fri, Oct 19, 2018 at 7:06 AM Erik Erlandson <ee...@redhat.com>
>>> wrote:
>>>
>>>> Hi Matt!
>>>>
>>>> There are a couple ways to do this. If you want to submit it for
>>>> inclusion in Spark, you should start by filing a JIRA for it, and then a
>>>> pull request.   Another possibility is to publish it as your own 3rd party
>>>> library, which I have done for aggregators before.
>>>>
>>>>
>>>> On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <ma...@saunders.net>
>>>> wrote:
>>>>
>>>>> I built an Aggregator that computes PCA on grouped datasets. I wanted
>>>>> to use the PCA functions provided by MLlib, but they only work on a full
>>>>> dataset, and I needed to do it on a grouped dataset (like a
>>>>> RelationalGroupedDataset).
>>>>>
>>>>> So I built a little Aggregator that can do that, here’s an example of
>>>>> how it’s called:
>>>>>
>>>>>     val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>>>>>
>>>>>     // For each grouping, compute a PCA matrix/vector
>>>>>     val pcaModels = inputData
>>>>>       .groupBy(keys:_*)
>>>>>       .agg(pcaAggregation.as(pcaOutput))
>>>>>
>>>>> I used the same algorithms under the hood as
>>>>> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
>>>>> directly on Datasets without converting to RDD first.
>>>>>
>>>>> I’ve seen others who wanted this ability (for example on Stack
>>>>> Overflow) so I’d like to contribute it if it would be a benefit to the
>>>>> larger community.
>>>>>
>>>>> So.. is this something worth contributing to MLlib? And if so, what
>>>>> are the next steps to start the process?
>>>>>
>>>>> thanks!
>>>>>
>>>>

Re: [MLlib] PCA Aggregator

Posted by Matt Saunders <ma...@saunders.net>.

Hi Sean, thanks for your feedback. I saw this as a missing feature in the
existing PCA implementation in MLlib. I suspect the use case is a common
one: you have data from different entities (could be different users,
different locations, or different products, for example) and you need to
model them separately since they behave differently--perhaps their features
run in different ranges, or perhaps they have completely different
features.

For example if you were modeling the weather in different parts of the
world for a given time period, and the features were things like
temperature, humidity, wind speed, pressure, etc. With the current
PCA/RowMatrix options, you can only calculate PCA on the entire dataset,
when you really want to model the weather in New York separately from the
weather in Buenos Aires. Today your options are to collect the data from
each city and calculate PCA using some other library like Breeze, or use
the PCA implementation from MLlib but only on one key at a time.

The reason I thought it would be useful in Spark is that it makes the PCA
offering in MLlib useful to more people. As it stands today, I wasn't able
to use it for much and I suspect others had the same experience, for
example:
https://stackoverflow.com/questions/45240556/perform-pca-on-each-group-of-a-groupby-in-pyspark

This isn't really big enough to warrant its own library--it's just a single
class. But if you think it's better to publish it externally I can
certainly do that.

thanks again,
--Matt

On Fri, Oct 19, 2018 at 4:14 PM Sean Owen <sr...@gmail.com> wrote:

> It's OK to open a JIRA though I generally doubt any new functionality will
> be added. This might be viewed as a small worthwhile enhancement, haven't
> looked at it. It's always more compelling if you can sketch the use case
> for it and why it is more meaningful in spark than outside it.
>
> There is spark-packages for recording third party packages but it is not
> required nor even necessarily a comprehensive list. You can just self
> publish like any git or Maven project, if you develop a third party library
>
> On Fri, Oct 19, 2018, 2:32 PM Matt Saunders <ma...@saunders.net> wrote:
>
>> Thanks, Eric. I went ahead and created SPARK-25782 for this improvement
>> since it is a feature I and others have looked for in MLlib, but doesn't
>> seem to exist yet. Also, while searching for PCA-related issues in JIRA I
>> noticed that someone added grouping support for PCA to the MADlib project a
>> while back (see MADLIB-947), so there does seem to be a demand for it.
>>
>> thanks!
>> --Matt
>>
>>
>> On Fri, Oct 19, 2018 at 7:06 AM Erik Erlandson <ee...@redhat.com>
>> wrote:
>>
>>> Hi Matt!
>>>
>>> There are a couple ways to do this. If you want to submit it for
>>> inclusion in Spark, you should start by filing a JIRA for it, and then a
>>> pull request.   Another possibility is to publish it as your own 3rd party
>>> library, which I have done for aggregators before.
>>>
>>>
>>> On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <ma...@saunders.net> wrote:
>>>
>>>> I built an Aggregator that computes PCA on grouped datasets. I wanted
>>>> to use the PCA functions provided by MLlib, but they only work on a full
>>>> dataset, and I needed to do it on a grouped dataset (like a
>>>> RelationalGroupedDataset).
>>>>
>>>> So I built a little Aggregator that can do that, here’s an example of
>>>> how it’s called:
>>>>
>>>>     val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>>>>
>>>>     // For each grouping, compute a PCA matrix/vector
>>>>     val pcaModels = inputData
>>>>       .groupBy(keys:_*)
>>>>       .agg(pcaAggregation.as(pcaOutput))
>>>>
>>>> I used the same algorithms under the hood as
>>>> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
>>>> directly on Datasets without converting to RDD first.
>>>>
>>>> I’ve seen others who wanted this ability (for example on Stack
>>>> Overflow) so I’d like to contribute it if it would be a benefit to the
>>>> larger community.
>>>>
>>>> So.. is this something worth contributing to MLlib? And if so, what are
>>>> the next steps to start the process?
>>>>
>>>> thanks!
>>>>
>>>

Re: [MLlib] PCA Aggregator

Posted by Sean Owen <sr...@gmail.com>.

It's OK to open a JIRA though I generally doubt any new functionality will
be added. This might be viewed as a small worthwhile enhancement, haven't
looked at it. It's always more compelling if you can sketch the use case
for it and why it is more meaningful in spark than outside it.

There is spark-packages for recording third party packages but it is not
required nor even necessarily a comprehensive list. You can just self
publish like any git or Maven project, if you develop a third party library

On Fri, Oct 19, 2018, 2:32 PM Matt Saunders <ma...@saunders.net> wrote:

> Thanks, Eric. I went ahead and created SPARK-25782 for this improvement
> since it is a feature I and others have looked for in MLlib, but doesn't
> seem to exist yet. Also, while searching for PCA-related issues in JIRA I
> noticed that someone added grouping support for PCA to the MADlib project a
> while back (see MADLIB-947), so there does seem to be a demand for it.
>
> thanks!
> --Matt
>
>
> On Fri, Oct 19, 2018 at 7:06 AM Erik Erlandson <ee...@redhat.com>
> wrote:
>
>> Hi Matt!
>>
>> There are a couple ways to do this. If you want to submit it for
>> inclusion in Spark, you should start by filing a JIRA for it, and then a
>> pull request.   Another possibility is to publish it as your own 3rd party
>> library, which I have done for aggregators before.
>>
>>
>> On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <ma...@saunders.net> wrote:
>>
>>> I built an Aggregator that computes PCA on grouped datasets. I wanted to
>>> use the PCA functions provided by MLlib, but they only work on a full
>>> dataset, and I needed to do it on a grouped dataset (like a
>>> RelationalGroupedDataset).
>>>
>>> So I built a little Aggregator that can do that, here’s an example of
>>> how it’s called:
>>>
>>>     val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>>>
>>>     // For each grouping, compute a PCA matrix/vector
>>>     val pcaModels = inputData
>>>       .groupBy(keys:_*)
>>>       .agg(pcaAggregation.as(pcaOutput))
>>>
>>> I used the same algorithms under the hood as
>>> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
>>> directly on Datasets without converting to RDD first.
>>>
>>> I’ve seen others who wanted this ability (for example on Stack Overflow)
>>> so I’d like to contribute it if it would be a benefit to the larger
>>> community.
>>>
>>> So.. is this something worth contributing to MLlib? And if so, what are
>>> the next steps to start the process?
>>>
>>> thanks!
>>>
>>

Re: [MLlib] PCA Aggregator

Posted by Matt Saunders <ma...@saunders.net>.

Thanks, Eric. I went ahead and created SPARK-25782 for this improvement
since it is a feature I and others have looked for in MLlib, but doesn't
seem to exist yet. Also, while searching for PCA-related issues in JIRA I
noticed that someone added grouping support for PCA to the MADlib project a
while back (see MADLIB-947), so there does seem to be a demand for it.

thanks!
--Matt


On Fri, Oct 19, 2018 at 7:06 AM Erik Erlandson <ee...@redhat.com> wrote:

> Hi Matt!
>
> There are a couple ways to do this. If you want to submit it for inclusion
> in Spark, you should start by filing a JIRA for it, and then a pull
> request.   Another possibility is to publish it as your own 3rd party
> library, which I have done for aggregators before.
>
>
> On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <ma...@saunders.net> wrote:
>
>> I built an Aggregator that computes PCA on grouped datasets. I wanted to
>> use the PCA functions provided by MLlib, but they only work on a full
>> dataset, and I needed to do it on a grouped dataset (like a
>> RelationalGroupedDataset).
>>
>> So I built a little Aggregator that can do that, here’s an example of how
>> it’s called:
>>
>>     val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>>
>>     // For each grouping, compute a PCA matrix/vector
>>     val pcaModels = inputData
>>       .groupBy(keys:_*)
>>       .agg(pcaAggregation.as(pcaOutput))
>>
>> I used the same algorithms under the hood as
>> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
>> directly on Datasets without converting to RDD first.
>>
>> I’ve seen others who wanted this ability (for example on Stack Overflow)
>> so I’d like to contribute it if it would be a benefit to the larger
>> community.
>>
>> So.. is this something worth contributing to MLlib? And if so, what are
>> the next steps to start the process?
>>
>> thanks!
>>
>

Re: [MLlib] PCA Aggregator

Posted by Erik Erlandson <ee...@redhat.com>.

For 3rd-party libs, I have been publishing independently, for example at
isarn-sketches-spark or silex:
https://github.com/isarn/isarn-sketches-spark
https://github.com/radanalyticsio/silex

Either of these repos provide some good working examples of publishing a
spark UDAF or ML library for jvm and pyspark.
(If anyone is interested in contributing new components to either of these,
feel free to reach out)

For people new to Spark library dev, Will Benton and I recently gave at
talk at SAI-EU on publishing Spark libraries:
https://databricks.com/session/apache-spark-for-library-developers-2
Cheers,
Erik

On Fri, Oct 19, 2018 at 9:40 AM Stephen Boesch <ja...@gmail.com> wrote:

> Erik - is there a current locale for approved/recommended third party
> additions?  The spark-packages has been stale for years it seems.
>
> Am Fr., 19. Okt. 2018 um 07:06 Uhr schrieb Erik Erlandson <
> eerlands@redhat.com>:
>
>> Hi Matt!
>>
>> There are a couple ways to do this. If you want to submit it for
>> inclusion in Spark, you should start by filing a JIRA for it, and then a
>> pull request.   Another possibility is to publish it as your own 3rd party
>> library, which I have done for aggregators before.
>>
>>
>> On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <ma...@saunders.net> wrote:
>>
>>> I built an Aggregator that computes PCA on grouped datasets. I wanted to
>>> use the PCA functions provided by MLlib, but they only work on a full
>>> dataset, and I needed to do it on a grouped dataset (like a
>>> RelationalGroupedDataset).
>>>
>>> So I built a little Aggregator that can do that, here’s an example of
>>> how it’s called:
>>>
>>>     val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>>>
>>>     // For each grouping, compute a PCA matrix/vector
>>>     val pcaModels = inputData
>>>       .groupBy(keys:_*)
>>>       .agg(pcaAggregation.as(pcaOutput))
>>>
>>> I used the same algorithms under the hood as
>>> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
>>> directly on Datasets without converting to RDD first.
>>>
>>> I’ve seen others who wanted this ability (for example on Stack Overflow)
>>> so I’d like to contribute it if it would be a benefit to the larger
>>> community.
>>>
>>> So.. is this something worth contributing to MLlib? And if so, what are
>>> the next steps to start the process?
>>>
>>> thanks!
>>>
>>

Re: [MLlib] PCA Aggregator

Posted by Stephen Boesch <ja...@gmail.com>.

Erik - is there a current locale for approved/recommended third party
additions?  The spark-packages has been stale for years it seems.

Am Fr., 19. Okt. 2018 um 07:06 Uhr schrieb Erik Erlandson <
eerlands@redhat.com>:

> Hi Matt!
>
> There are a couple ways to do this. If you want to submit it for inclusion
> in Spark, you should start by filing a JIRA for it, and then a pull
> request.   Another possibility is to publish it as your own 3rd party
> library, which I have done for aggregators before.
>
>
> On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <ma...@saunders.net> wrote:
>
>> I built an Aggregator that computes PCA on grouped datasets. I wanted to
>> use the PCA functions provided by MLlib, but they only work on a full
>> dataset, and I needed to do it on a grouped dataset (like a
>> RelationalGroupedDataset).
>>
>> So I built a little Aggregator that can do that, here’s an example of how
>> it’s called:
>>
>>     val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>>
>>     // For each grouping, compute a PCA matrix/vector
>>     val pcaModels = inputData
>>       .groupBy(keys:_*)
>>       .agg(pcaAggregation.as(pcaOutput))
>>
>> I used the same algorithms under the hood as
>> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
>> directly on Datasets without converting to RDD first.
>>
>> I’ve seen others who wanted this ability (for example on Stack Overflow)
>> so I’d like to contribute it if it would be a benefit to the larger
>> community.
>>
>> So.. is this something worth contributing to MLlib? And if so, what are
>> the next steps to start the process?
>>
>> thanks!
>>
>

Re: [MLlib] PCA Aggregator

Posted by Erik Erlandson <ee...@redhat.com>.

Hi Matt!

There are a couple ways to do this. If you want to submit it for inclusion
in Spark, you should start by filing a JIRA for it, and then a pull
request.   Another possibility is to publish it as your own 3rd party
library, which I have done for aggregators before.


On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <ma...@saunders.net> wrote:

> I built an Aggregator that computes PCA on grouped datasets. I wanted to
> use the PCA functions provided by MLlib, but they only work on a full
> dataset, and I needed to do it on a grouped dataset (like a
> RelationalGroupedDataset).
>
> So I built a little Aggregator that can do that, here’s an example of how
> it’s called:
>
>     val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>
>     // For each grouping, compute a PCA matrix/vector
>     val pcaModels = inputData
>       .groupBy(keys:_*)
>       .agg(pcaAggregation.as(pcaOutput))
>
> I used the same algorithms under the hood as
> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
> directly on Datasets without converting to RDD first.
>
> I’ve seen others who wanted this ability (for example on Stack Overflow)
> so I’d like to contribute it if it would be a benefit to the larger
> community.
>
> So.. is this something worth contributing to MLlib? And if so, what are
> the next steps to start the process?
>
> thanks!
>