You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Raphael Cendrillon <ce...@gmail.com> on 2011/12/05 19:33:36 UTC

Re: When is PCA expected to be fully implemented into Mahout?

Given that this request seems to come up frequently, would it be worth
putting this approach under mahout-examples?  Initially it could use the
brute force approach together with SSVD, and updated later once support is
ready for mean-subtraction within SSVD.

I could put something together if there's interest.

On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I am working on the addtions to ssvd algorithms and the mods to current
> solver will probably emerge in a matter of a month, my schedule permitting.
>
> However, a brute force approach is already possible. If your input is of
> moderate size, or if it is already dense, you could compute median and
> substract it yourself very easily and then shove it into ssvd solver while
> requesting to produce either u or v depending if subtract column wise or
> row wise mean.
>
> The only problem with brute force approach is that it would densify
> originally sparse input. Depending on your problem and # of machine nodes
> you can spare, it may or may not be a problem.
> On Dec 4, 2011 7:59 PM, "magicalo" <ma...@yahoo.com> wrote:
>
> > Hello,
> >
> > Is there an expected release date for the PCA algorithm as part of
> Mahout?
> > Tx!
> >
> >
>

Re: When is PCA expected to be fully implemented into Mahout?

Posted by magicalo <ma...@yahoo.com>.

Having this added to the examples would really awesome! Tx

Re: When is PCA expected to be fully implemented into Mahout?

Posted by Ted Dunning <te...@gmail.com>.

My guess is that a single reducer is simpler and will be good enough.

On Mon, Dec 5, 2011 at 1:02 PM, Raphael Cendrillon <cendrillon1978@gmail.com
> wrote:

> By the way, currently I have the rows being added at the combiner, and
> then the results of the combiners added in a single reducer. Do you think
> this is sufficient, or should multiple reducers be used (per column) to
> further spread the load?

Re: When is PCA expected to be fully implemented into Mahout?

Posted by Raphael Cendrillon <ce...@gmail.com>.

I've put a patch for column-wise row average calculation under MAHOUT-880.
It's also up on the reviewboard here:

https://reviews.apache.org/r/2955/diff/3/

Could you please take a look?

Re: When is PCA expected to be fully implemented into Mahout?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Ted, yes one can defeat it, i usually take proper care. but rarely
completely. even if you bulk up the values to reduce number of keys,
there's still going to be sorting. The other way to do it is sort of
keep things in many small map output files that correspond to order of
rows in original splits and reuse that structure later. That's also
something i do in map-only QR step thus bypassing sort overhead
completely.

But with general multiplication i think one can only partially defeat
it by using bulk block values.

On Mon, Dec 5, 2011 at 3:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Sometimes when i care.
>
> in SSVD i just play with outer products of (k+p)x(k+p) size and
> accumulate them in mappers directly, reduce to manageable number
> (<=1000) and finalize in front end in
>
> a single thread.
>
> with power iterations i also use vertical blocks of significant size
> as values when i compute AB'. Vertical blocking is easy because i know
> product rows are short. That alone alowed me to slash multiplication
> step there by an order of magnitude compared to when i was using
> per-row summation of outer products (i guess that's the way DRM does
> multiplication but it can't assume vertical blocking). Anyway, the key
> is to store less values for shuffle-and sort, bigger blocks if needed,
> instead of doing it on per-vector basis.
>
> -d
>
> On Mon, Dec 5, 2011 at 2:29 PM, Ted Dunning <te...@gmail.com> wrote:
>> Dmitriy,
>>
>> Are you sure that you can't defeat this sorting?
>>
>> On Mon, Dec 5, 2011 at 2:05 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>>> what i usually do for small products is to accumulate them in the map
>>> without combiners and then send them to reducers to combine there.
>>> Using combiners creates local sort for something you don't really want
>>> to do unless you can't keep it in memory .
>>>
>>> you also have a choice to force one reducer to ensure one product. Or
>>> (what i do) build a loader that just goes all vectors and reduce them
>>> on the fly whenever it is needed.
>>>
>>> MR is fundamentally too often not so cool when it comes to linear
>>> algebra operations. This is a major deficiency as far as i see it.
>>> Because it forces to sort output which already have inner structure
>>> that one can capitalize on.
>>>
>>> Simple example: suppose you want to deinterlace a TV signal. So you
>>> have two fields, one for odd lines, and another for even lines. So
>>> naturally, what you do is just create a simple mering algorithm that
>>> reads lines from each of inputs intermittently and outputs one single
>>> deinterlaced frame. O(n).
>>>
>>> now, mapreduce way to deal with it is to shuffle and sort all lines by
>>> their line No and then output total order (O(n*logN)
>>>
>>> so dealing with large blocks is like deinterlacing. you sort it but
>>> you absolutely don't have to because you already have structure in the
>>> output (even if blocks are not fitting in the memory). so hence.
>>>
>>>
>>> On Mon, Dec 5, 2011 at 1:02 PM, Raphael Cendrillon
>>> <ce...@gmail.com> wrote:
>>> > Thanks for clarifying. I think we're all on the same page on this,
>>> although using different terms. I'll package up the job I currently have
>>> for this and submit a patch.
>>> >
>>> > By the way, currently I have the rows being added at the combiner, and
>>> then the results of the combiners added in a single reducer. Do you think
>>> this is sufficient, or should multiple reducers be used (per column) to
>>> further spread the load?
>>> >
>>> > On Dec 5, 2011, at 11:38 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> >
>>> >> ok column-wise mean. (the mean of all rows).
>>> >>
>>> >> On Mon, Dec 5, 2011 at 11:00 AM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>> >>> Row-wise mean usually means that a mean of each row is computed.
>>> >>>
>>> >>> I think that most PCA users would want column-wise means for
>>> subtraction.
>>> >>>
>>> >>> On Mon, Dec 5, 2011 at 10:58 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> >>>
>>> >>>> We probably need  row wise mean computation job anyway as a separate
>>> mr
>>> >>>> step. Wanna take a stab?
>>> >>>> On Dec 5, 2011 10:34 AM, "Raphael Cendrillon" <
>>> cendrillon1978@gmail.com>
>>> >>>> wrote:
>>> >>>>
>>> >>>>> Given that this request seems to come up frequently, would it be
>>> worth
>>> >>>>> putting this approach under mahout-examples?  Initially it could use
>>> the
>>> >>>>> brute force approach together with SSVD, and updated later once
>>> support
>>> >>>> is
>>> >>>>> ready for mean-subtraction within SSVD.
>>> >>>>>
>>> >>>>> I could put something together if there's interest.
>>> >>>>>
>>> >>>>> On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> >>>>> wrote:
>>> >>>>>
>>> >>>>>> I am working on the addtions to ssvd algorithms and the mods to
>>> current
>>> >>>>>> solver will probably emerge in a matter of a month, my schedule
>>> >>>>> permitting.
>>> >>>>>>
>>> >>>>>> However, a brute force approach is already possible. If your input
>>> is
>>> >>>> of
>>> >>>>>> moderate size, or if it is already dense, you could compute median
>>> and
>>> >>>>>> substract it yourself very easily and then shove it into ssvd solver
>>> >>>>> while
>>> >>>>>> requesting to produce either u or v depending if subtract column
>>> wise
>>> >>>> or
>>> >>>>>> row wise mean.
>>> >>>>>>
>>> >>>>>> The only problem with brute force approach is that it would densify
>>> >>>>>> originally sparse input. Depending on your problem and # of machine
>>> >>>> nodes
>>> >>>>>> you can spare, it may or may not be a problem.
>>> >>>>>> On Dec 4, 2011 7:59 PM, "magicalo" <ma...@yahoo.com> wrote:
>>> >>>>>>
>>> >>>>>>> Hello,
>>> >>>>>>>
>>> >>>>>>> Is there an expected release date for the PCA algorithm as part of
>>> >>>>>> Mahout?
>>> >>>>>>> Tx!
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>>

Re: When is PCA expected to be fully implemented into Mahout?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Sometimes when i care.

in SSVD i just play with outer products of (k+p)x(k+p) size and
accumulate them in mappers directly, reduce to manageable number
(<=1000) and finalize in front end in

a single thread.

with power iterations i also use vertical blocks of significant size
as values when i compute AB'. Vertical blocking is easy because i know
product rows are short. That alone alowed me to slash multiplication
step there by an order of magnitude compared to when i was using
per-row summation of outer products (i guess that's the way DRM does
multiplication but it can't assume vertical blocking). Anyway, the key
is to store less values for shuffle-and sort, bigger blocks if needed,
instead of doing it on per-vector basis.

-d

On Mon, Dec 5, 2011 at 2:29 PM, Ted Dunning <te...@gmail.com> wrote:
> Dmitriy,
>
> Are you sure that you can't defeat this sorting?
>
> On Mon, Dec 5, 2011 at 2:05 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> what i usually do for small products is to accumulate them in the map
>> without combiners and then send them to reducers to combine there.
>> Using combiners creates local sort for something you don't really want
>> to do unless you can't keep it in memory .
>>
>> you also have a choice to force one reducer to ensure one product. Or
>> (what i do) build a loader that just goes all vectors and reduce them
>> on the fly whenever it is needed.
>>
>> MR is fundamentally too often not so cool when it comes to linear
>> algebra operations. This is a major deficiency as far as i see it.
>> Because it forces to sort output which already have inner structure
>> that one can capitalize on.
>>
>> Simple example: suppose you want to deinterlace a TV signal. So you
>> have two fields, one for odd lines, and another for even lines. So
>> naturally, what you do is just create a simple mering algorithm that
>> reads lines from each of inputs intermittently and outputs one single
>> deinterlaced frame. O(n).
>>
>> now, mapreduce way to deal with it is to shuffle and sort all lines by
>> their line No and then output total order (O(n*logN)
>>
>> so dealing with large blocks is like deinterlacing. you sort it but
>> you absolutely don't have to because you already have structure in the
>> output (even if blocks are not fitting in the memory). so hence.
>>
>>
>> On Mon, Dec 5, 2011 at 1:02 PM, Raphael Cendrillon
>> <ce...@gmail.com> wrote:
>> > Thanks for clarifying. I think we're all on the same page on this,
>> although using different terms. I'll package up the job I currently have
>> for this and submit a patch.
>> >
>> > By the way, currently I have the rows being added at the combiner, and
>> then the results of the combiners added in a single reducer. Do you think
>> this is sufficient, or should multiple reducers be used (per column) to
>> further spread the load?
>> >
>> > On Dec 5, 2011, at 11:38 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> >
>> >> ok column-wise mean. (the mean of all rows).
>> >>
>> >> On Mon, Dec 5, 2011 at 11:00 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>> >>> Row-wise mean usually means that a mean of each row is computed.
>> >>>
>> >>> I think that most PCA users would want column-wise means for
>> subtraction.
>> >>>
>> >>> On Mon, Dec 5, 2011 at 10:58 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >>>
>> >>>> We probably need  row wise mean computation job anyway as a separate
>> mr
>> >>>> step. Wanna take a stab?
>> >>>> On Dec 5, 2011 10:34 AM, "Raphael Cendrillon" <
>> cendrillon1978@gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>>> Given that this request seems to come up frequently, would it be
>> worth
>> >>>>> putting this approach under mahout-examples?  Initially it could use
>> the
>> >>>>> brute force approach together with SSVD, and updated later once
>> support
>> >>>> is
>> >>>>> ready for mean-subtraction within SSVD.
>> >>>>>
>> >>>>> I could put something together if there's interest.
>> >>>>>
>> >>>>> On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> I am working on the addtions to ssvd algorithms and the mods to
>> current
>> >>>>>> solver will probably emerge in a matter of a month, my schedule
>> >>>>> permitting.
>> >>>>>>
>> >>>>>> However, a brute force approach is already possible. If your input
>> is
>> >>>> of
>> >>>>>> moderate size, or if it is already dense, you could compute median
>> and
>> >>>>>> substract it yourself very easily and then shove it into ssvd solver
>> >>>>> while
>> >>>>>> requesting to produce either u or v depending if subtract column
>> wise
>> >>>> or
>> >>>>>> row wise mean.
>> >>>>>>
>> >>>>>> The only problem with brute force approach is that it would densify
>> >>>>>> originally sparse input. Depending on your problem and # of machine
>> >>>> nodes
>> >>>>>> you can spare, it may or may not be a problem.
>> >>>>>> On Dec 4, 2011 7:59 PM, "magicalo" <ma...@yahoo.com> wrote:
>> >>>>>>
>> >>>>>>> Hello,
>> >>>>>>>
>> >>>>>>> Is there an expected release date for the PCA algorithm as part of
>> >>>>>> Mahout?
>> >>>>>>> Tx!
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>>

Re: When is PCA expected to be fully implemented into Mahout?

Posted by Ted Dunning <te...@gmail.com>.

Dmitriy,

Are you sure that you can't defeat this sorting?

On Mon, Dec 5, 2011 at 2:05 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> what i usually do for small products is to accumulate them in the map
> without combiners and then send them to reducers to combine there.
> Using combiners creates local sort for something you don't really want
> to do unless you can't keep it in memory .
>
> you also have a choice to force one reducer to ensure one product. Or
> (what i do) build a loader that just goes all vectors and reduce them
> on the fly whenever it is needed.
>
> MR is fundamentally too often not so cool when it comes to linear
> algebra operations. This is a major deficiency as far as i see it.
> Because it forces to sort output which already have inner structure
> that one can capitalize on.
>
> Simple example: suppose you want to deinterlace a TV signal. So you
> have two fields, one for odd lines, and another for even lines. So
> naturally, what you do is just create a simple mering algorithm that
> reads lines from each of inputs intermittently and outputs one single
> deinterlaced frame. O(n).
>
> now, mapreduce way to deal with it is to shuffle and sort all lines by
> their line No and then output total order (O(n*logN)
>
> so dealing with large blocks is like deinterlacing. you sort it but
> you absolutely don't have to because you already have structure in the
> output (even if blocks are not fitting in the memory). so hence.
>
>
> On Mon, Dec 5, 2011 at 1:02 PM, Raphael Cendrillon
> <ce...@gmail.com> wrote:
> > Thanks for clarifying. I think we're all on the same page on this,
> although using different terms. I'll package up the job I currently have
> for this and submit a patch.
> >
> > By the way, currently I have the rows being added at the combiner, and
> then the results of the combiners added in a single reducer. Do you think
> this is sufficient, or should multiple reducers be used (per column) to
> further spread the load?
> >
> > On Dec 5, 2011, at 11:38 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> >> ok column-wise mean. (the mean of all rows).
> >>
> >> On Mon, Dec 5, 2011 at 11:00 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >>> Row-wise mean usually means that a mean of each row is computed.
> >>>
> >>> I think that most PCA users would want column-wise means for
> subtraction.
> >>>
> >>> On Mon, Dec 5, 2011 at 10:58 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >>>
> >>>> We probably need  row wise mean computation job anyway as a separate
> mr
> >>>> step. Wanna take a stab?
> >>>> On Dec 5, 2011 10:34 AM, "Raphael Cendrillon" <
> cendrillon1978@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Given that this request seems to come up frequently, would it be
> worth
> >>>>> putting this approach under mahout-examples?  Initially it could use
> the
> >>>>> brute force approach together with SSVD, and updated later once
> support
> >>>> is
> >>>>> ready for mean-subtraction within SSVD.
> >>>>>
> >>>>> I could put something together if there's interest.
> >>>>>
> >>>>> On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov <dl...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I am working on the addtions to ssvd algorithms and the mods to
> current
> >>>>>> solver will probably emerge in a matter of a month, my schedule
> >>>>> permitting.
> >>>>>>
> >>>>>> However, a brute force approach is already possible. If your input
> is
> >>>> of
> >>>>>> moderate size, or if it is already dense, you could compute median
> and
> >>>>>> substract it yourself very easily and then shove it into ssvd solver
> >>>>> while
> >>>>>> requesting to produce either u or v depending if subtract column
> wise
> >>>> or
> >>>>>> row wise mean.
> >>>>>>
> >>>>>> The only problem with brute force approach is that it would densify
> >>>>>> originally sparse input. Depending on your problem and # of machine
> >>>> nodes
> >>>>>> you can spare, it may or may not be a problem.
> >>>>>> On Dec 4, 2011 7:59 PM, "magicalo" <ma...@yahoo.com> wrote:
> >>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> Is there an expected release date for the PCA algorithm as part of
> >>>>>> Mahout?
> >>>>>>> Tx!
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
>

Re: When is PCA expected to be fully implemented into Mahout?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

what i usually do for small products is to accumulate them in the map
without combiners and then send them to reducers to combine there.
Using combiners creates local sort for something you don't really want
to do unless you can't keep it in memory .

you also have a choice to force one reducer to ensure one product. Or
(what i do) build a loader that just goes all vectors and reduce them
on the fly whenever it is needed.

MR is fundamentally too often not so cool when it comes to linear
algebra operations. This is a major deficiency as far as i see it.
Because it forces to sort output which already have inner structure
that one can capitalize on.

Simple example: suppose you want to deinterlace a TV signal. So you
have two fields, one for odd lines, and another for even lines. So
naturally, what you do is just create a simple mering algorithm that
reads lines from each of inputs intermittently and outputs one single
deinterlaced frame. O(n).

now, mapreduce way to deal with it is to shuffle and sort all lines by
their line No and then output total order (O(n*logN)

so dealing with large blocks is like deinterlacing. you sort it but
you absolutely don't have to because you already have structure in the
output (even if blocks are not fitting in the memory). so hence.


On Mon, Dec 5, 2011 at 1:02 PM, Raphael Cendrillon
<ce...@gmail.com> wrote:
> Thanks for clarifying. I think we're all on the same page on this, although using different terms. I'll package up the job I currently have for this and submit a patch.
>
> By the way, currently I have the rows being added at the combiner, and then the results of the combiners added in a single reducer. Do you think this is sufficient, or should multiple reducers be used (per column) to further spread the load?
>
> On Dec 5, 2011, at 11:38 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> ok column-wise mean. (the mean of all rows).
>>
>> On Mon, Dec 5, 2011 at 11:00 AM, Ted Dunning <te...@gmail.com> wrote:
>>> Row-wise mean usually means that a mean of each row is computed.
>>>
>>> I think that most PCA users would want column-wise means for subtraction.
>>>
>>> On Mon, Dec 5, 2011 at 10:58 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>>> We probably need  row wise mean computation job anyway as a separate mr
>>>> step. Wanna take a stab?
>>>> On Dec 5, 2011 10:34 AM, "Raphael Cendrillon" <ce...@gmail.com>
>>>> wrote:
>>>>
>>>>> Given that this request seems to come up frequently, would it be worth
>>>>> putting this approach under mahout-examples?  Initially it could use the
>>>>> brute force approach together with SSVD, and updated later once support
>>>> is
>>>>> ready for mean-subtraction within SSVD.
>>>>>
>>>>> I could put something together if there's interest.
>>>>>
>>>>> On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I am working on the addtions to ssvd algorithms and the mods to current
>>>>>> solver will probably emerge in a matter of a month, my schedule
>>>>> permitting.
>>>>>>
>>>>>> However, a brute force approach is already possible. If your input is
>>>> of
>>>>>> moderate size, or if it is already dense, you could compute median and
>>>>>> substract it yourself very easily and then shove it into ssvd solver
>>>>> while
>>>>>> requesting to produce either u or v depending if subtract column wise
>>>> or
>>>>>> row wise mean.
>>>>>>
>>>>>> The only problem with brute force approach is that it would densify
>>>>>> originally sparse input. Depending on your problem and # of machine
>>>> nodes
>>>>>> you can spare, it may or may not be a problem.
>>>>>> On Dec 4, 2011 7:59 PM, "magicalo" <ma...@yahoo.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Is there an expected release date for the PCA algorithm as part of
>>>>>> Mahout?
>>>>>>> Tx!
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>

Re: When is PCA expected to be fully implemented into Mahout?

Posted by Raphael Cendrillon <ce...@gmail.com>.

Thanks for clarifying. I think we're all on the same page on this, although using different terms. I'll package up the job I currently have for this and submit a patch. 

By the way, currently I have the rows being added at the combiner, and then the results of the combiners added in a single reducer. Do you think this is sufficient, or should multiple reducers be used (per column) to further spread the load?

On Dec 5, 2011, at 11:38 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> ok column-wise mean. (the mean of all rows).
> 
> On Mon, Dec 5, 2011 at 11:00 AM, Ted Dunning <te...@gmail.com> wrote:
>> Row-wise mean usually means that a mean of each row is computed.
>> 
>> I think that most PCA users would want column-wise means for subtraction.
>> 
>> On Mon, Dec 5, 2011 at 10:58 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>>> We probably need  row wise mean computation job anyway as a separate mr
>>> step. Wanna take a stab?
>>> On Dec 5, 2011 10:34 AM, "Raphael Cendrillon" <ce...@gmail.com>
>>> wrote:
>>> 
>>>> Given that this request seems to come up frequently, would it be worth
>>>> putting this approach under mahout-examples?  Initially it could use the
>>>> brute force approach together with SSVD, and updated later once support
>>> is
>>>> ready for mean-subtraction within SSVD.
>>>> 
>>>> I could put something together if there's interest.
>>>> 
>>>> On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>>> wrote:
>>>> 
>>>>> I am working on the addtions to ssvd algorithms and the mods to current
>>>>> solver will probably emerge in a matter of a month, my schedule
>>>> permitting.
>>>>> 
>>>>> However, a brute force approach is already possible. If your input is
>>> of
>>>>> moderate size, or if it is already dense, you could compute median and
>>>>> substract it yourself very easily and then shove it into ssvd solver
>>>> while
>>>>> requesting to produce either u or v depending if subtract column wise
>>> or
>>>>> row wise mean.
>>>>> 
>>>>> The only problem with brute force approach is that it would densify
>>>>> originally sparse input. Depending on your problem and # of machine
>>> nodes
>>>>> you can spare, it may or may not be a problem.
>>>>> On Dec 4, 2011 7:59 PM, "magicalo" <ma...@yahoo.com> wrote:
>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> Is there an expected release date for the PCA algorithm as part of
>>>>> Mahout?
>>>>>> Tx!
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>

Re: When is PCA expected to be fully implemented into Mahout?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

ok column-wise mean. (the mean of all rows).

On Mon, Dec 5, 2011 at 11:00 AM, Ted Dunning <te...@gmail.com> wrote:
> Row-wise mean usually means that a mean of each row is computed.
>
> I think that most PCA users would want column-wise means for subtraction.
>
> On Mon, Dec 5, 2011 at 10:58 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> We probably need  row wise mean computation job anyway as a separate mr
>> step. Wanna take a stab?
>> On Dec 5, 2011 10:34 AM, "Raphael Cendrillon" <ce...@gmail.com>
>> wrote:
>>
>> > Given that this request seems to come up frequently, would it be worth
>> > putting this approach under mahout-examples?  Initially it could use the
>> > brute force approach together with SSVD, and updated later once support
>> is
>> > ready for mean-subtraction within SSVD.
>> >
>> > I could put something together if there's interest.
>> >
>> > On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> > wrote:
>> >
>> > > I am working on the addtions to ssvd algorithms and the mods to current
>> > > solver will probably emerge in a matter of a month, my schedule
>> > permitting.
>> > >
>> > > However, a brute force approach is already possible. If your input is
>> of
>> > > moderate size, or if it is already dense, you could compute median and
>> > > substract it yourself very easily and then shove it into ssvd solver
>> > while
>> > > requesting to produce either u or v depending if subtract column wise
>> or
>> > > row wise mean.
>> > >
>> > > The only problem with brute force approach is that it would densify
>> > > originally sparse input. Depending on your problem and # of machine
>> nodes
>> > > you can spare, it may or may not be a problem.
>> > > On Dec 4, 2011 7:59 PM, "magicalo" <ma...@yahoo.com> wrote:
>> > >
>> > > > Hello,
>> > > >
>> > > > Is there an expected release date for the PCA algorithm as part of
>> > > Mahout?
>> > > > Tx!
>> > > >
>> > > >
>> > >
>> >
>>

Re: When is PCA expected to be fully implemented into Mahout?

Posted by Raphael Cendrillon <ce...@gmail.com>.

I guess we should define what we mean by 'row' here. I'm thinking of the 'distributed row matrix' row, is that also what you have in mind?

I've been working on row wise mean computation as part of MAHOUT-880. I could add a column wise mean job as well.

By the way, if you have any comments on the current patch (available at https://reviews.apache.org/r/2955/diff/2/) they would be much appreciated!

On Dec 5, 2011, at 11:00 AM, Ted Dunning <te...@gmail.com> wrote:

> Row-wise mean usually means that a mean of each row is computed.
> 
> I think that most PCA users would want column-wise means for subtraction.
> 
> On Mon, Dec 5, 2011 at 10:58 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> We probably need  row wise mean computation job anyway as a separate mr
> step. Wanna take a stab?
> On Dec 5, 2011 10:34 AM, "Raphael Cendrillon" <ce...@gmail.com>
> wrote:
> 
> > Given that this request seems to come up frequently, would it be worth
> > putting this approach under mahout-examples?  Initially it could use the
> > brute force approach together with SSVD, and updated later once support is
> > ready for mean-subtraction within SSVD.
> >
> > I could put something together if there's interest.
> >
> > On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > I am working on the addtions to ssvd algorithms and the mods to current
> > > solver will probably emerge in a matter of a month, my schedule
> > permitting.
> > >
> > > However, a brute force approach is already possible. If your input is of
> > > moderate size, or if it is already dense, you could compute median and
> > > substract it yourself very easily and then shove it into ssvd solver
> > while
> > > requesting to produce either u or v depending if subtract column wise or
> > > row wise mean.
> > >
> > > The only problem with brute force approach is that it would densify
> > > originally sparse input. Depending on your problem and # of machine nodes
> > > you can spare, it may or may not be a problem.
> > > On Dec 4, 2011 7:59 PM, "magicalo" <ma...@yahoo.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > Is there an expected release date for the PCA algorithm as part of
> > > Mahout?
> > > > Tx!
> > > >
> > > >
> > >
> >
>

Re: When is PCA expected to be fully implemented into Mahout?

Posted by Ted Dunning <te...@gmail.com>.

Row-wise mean usually means that a mean of each row is computed.

I think that most PCA users would want column-wise means for subtraction.

On Mon, Dec 5, 2011 at 10:58 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> We probably need  row wise mean computation job anyway as a separate mr
> step. Wanna take a stab?
> On Dec 5, 2011 10:34 AM, "Raphael Cendrillon" <ce...@gmail.com>
> wrote:
>
> > Given that this request seems to come up frequently, would it be worth
> > putting this approach under mahout-examples?  Initially it could use the
> > brute force approach together with SSVD, and updated later once support
> is
> > ready for mean-subtraction within SSVD.
> >
> > I could put something together if there's interest.
> >
> > On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > I am working on the addtions to ssvd algorithms and the mods to current
> > > solver will probably emerge in a matter of a month, my schedule
> > permitting.
> > >
> > > However, a brute force approach is already possible. If your input is
> of
> > > moderate size, or if it is already dense, you could compute median and
> > > substract it yourself very easily and then shove it into ssvd solver
> > while
> > > requesting to produce either u or v depending if subtract column wise
> or
> > > row wise mean.
> > >
> > > The only problem with brute force approach is that it would densify
> > > originally sparse input. Depending on your problem and # of machine
> nodes
> > > you can spare, it may or may not be a problem.
> > > On Dec 4, 2011 7:59 PM, "magicalo" <ma...@yahoo.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > Is there an expected release date for the PCA algorithm as part of
> > > Mahout?
> > > > Tx!
> > > >
> > > >
> > >
> >
>

Re: When is PCA expected to be fully implemented into Mahout?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

We probably need  row wise mean computation job anyway as a separate mr
step. Wanna take a stab?
On Dec 5, 2011 10:34 AM, "Raphael Cendrillon" <ce...@gmail.com>
wrote:

> Given that this request seems to come up frequently, would it be worth
> putting this approach under mahout-examples?  Initially it could use the
> brute force approach together with SSVD, and updated later once support is
> ready for mean-subtraction within SSVD.
>
> I could put something together if there's interest.
>
> On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > I am working on the addtions to ssvd algorithms and the mods to current
> > solver will probably emerge in a matter of a month, my schedule
> permitting.
> >
> > However, a brute force approach is already possible. If your input is of
> > moderate size, or if it is already dense, you could compute median and
> > substract it yourself very easily and then shove it into ssvd solver
> while
> > requesting to produce either u or v depending if subtract column wise or
> > row wise mean.
> >
> > The only problem with brute force approach is that it would densify
> > originally sparse input. Depending on your problem and # of machine nodes
> > you can spare, it may or may not be a problem.
> > On Dec 4, 2011 7:59 PM, "magicalo" <ma...@yahoo.com> wrote:
> >
> > > Hello,
> > >
> > > Is there an expected release date for the PCA algorithm as part of
> > Mahout?
> > > Tx!
> > >
> > >
> >
>