You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2011/11/26 00:42:11 UTC

Soliciting SSVD documentation review

Hi,

I put a usage and overview doc for SSVD onto wiki. I'd appreciate if
somebody else could look thru it, to scan for completeness and
suggestions.

I tried to approach it as a user-facing documentation, i.e. I tried to
avoid discussing any implementation specifics .

I had several users and Nathan Halko trying it out and actually
favorably commenting on its scalability vs. Lanczos but i don't know
first hand of any production use (even our own use is fairly limited
(in terms of input volume we ever processed) and actually somewhat
diverged from this Mahout implementation. Perhaps putting it more in
front of users will help to receive more feedback.

Thanks.
-Dmitriy

Re: Soliciting SSVD documentation review

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Thank you, Nathan.

On Wed, Nov 30, 2011 at 9:49 AM, Nathan Halko <na...@spotinfluence.com>wrote:

> Yes I will time the phases.  My largest dataset is only a couple of gigs
> currently, I ran into the 5G limit on Amazon S3 and need to find a work
> around.  But I figured that might be large enough to see scaling using the
> small instances but maybe not.  I will work on these issues and see what
> happens, thanks for you help Dmitriy.
>
> Nathan
>
> On Tue, Nov 29, 2011 at 3:24 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > ok thanks. I will file an issue for default p.
> >
> > also i updated the docs re: --reduceTasks.
> >
> > it would be nice if you could log time for map and reduce phases for
> > all tasks (it is reported in MR web ui at namenode:50030 by default)
> > in each case if you think there's a performance issue. It would at
> > least allow to narrow any problem to a particular part of computation.
> > My datasets are too small ~10G, and i run them for a rather small k,
> > at that size i don't see any visible irregularties.
> >
> > Thanks.
> > -Dmitriy
> >
> > On Tue, Nov 29, 2011 at 2:12 PM, Nathan Halko <na...@spotinfluence.com>
> > wrote:
> > > Thanks for the heads up with numReduceTasks.  I haven't changed the
> > > parameters yet much from the default so this is probably my problem.
> > >
> > > By slave I mean machine, I'm running an m1.small as master and either
> > > m1.small's or m1.large's as slaves (datanode, tasktracker, child).
> > >
> > > p depends mostly on the decay of singular values rather than the rank
> k.
> > >  In fact (in the analysis at least) it is completely independent of k.
> >  The
> > > quantity of interest is sig_k/sig_k+p, (signal to noise ratio) this
> > should
> > > be large.  Ideally we would set p as a function of this parameter which
> > is
> > > dependent on the matrix (and unknown until we have already solved the
> > > problem  :-) ).  I suggest 25 since for example tf-idf matrices have a
> > low
> > > sig/noise ratio.  You could probably for some cases use less, if you
> > need p
> > > to be larger you probably need a power iteration so it seems to be a
> good
> > > default point.  Also the parameter is not an initial point of
> > optimization
> > > so to error on the larger side is fine.  After all, Lanczos method
> > suggests
> > > that only 1/3 of singular triplets are accurate, which corresponds to
> > p=2k,
> > > which is very large.  Basically, the exact value of p is insensitive so
> > > long as it is large 'enough'.
> > >
> > > On Tue, Nov 29, 2011 at 12:32 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > >wrote:
> > >
> > >> PPS also make sure you specify numReduceTasks. Default is I beleive 1
> > >> which will not scale at multiplication steps at all.
> > >>
> > >> On Tue, Nov 29, 2011 at 10:15 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > >> wrote:
> > >> > PS actually i think it should scale horizontally a little better
> than
> > >> > vertically but that's just a guess.
> > >> >
> > >> > On Tue, Nov 29, 2011 at 10:10 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > >
> > >> wrote:
> > >> >> On Tue, Nov 29, 2011 at 9:56 AM, Nathan Halko <
> > nathan@spotinfluence.com>
> > >> wrote:
> > >> >>>
> > >> >>> The docs look great Dmitriy.  Has anyone considered giving
> > oversampling
> > >> >>> ssvd over lanczos which is promising.  Trying to scale out
> > >> horizontally but
> > >> >>> not seeing any difference between using one slave or many slaves.
> >  Any
> > >> >>> ideas? (I won't go into detail about the setup here but if sounds
> > >> familiar
> > >> >>> I'd like to talk more).
> > >> >>
> > >> >> What do you mean by a slave? a mapper? a machine?
> > >> >>
> > >> >> whether you increase input horizontally or vertically, you should
> see
> > >> >> more mappers. If your cluster has enough capacity to scheudle all
> > >> >> mappers right away, i beleive you will get almost the same time
> (i.e.
> > >> >> almost linear scaling) for most of the jobs.
> > >> >>
> > >> >>> The basic problem with lanczos in the distributed
> > >> >>> environment seems to be that a matrix-vector multiply is not
> enough
> > >> work to
> > >> >>> offset any setup costs, also there is not a distributed
> > >> orthogonalization
> > >> >>> with lanczos and I'm getting OOM's making it difficult to scale.
>  I
> > >> would
> > >> >>> still like to contribute what results I have found but I'm short
> on
> > >> time so
> > >> >>> nothing besides work directly related to the completion of my
> thesis
> > >> will
> > >> >>> happen until that is done.
> > >> >>>
> > >> >>
> > >> >>> On Fri, Nov 25, 2011 at 5:37 PM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com>
> > >> wrote:
> > >> >>>
> > >> >>> > I attached the latex source as well (lyx, actually). I would've
> > used
> > >> >>> > Wiki if it supported mathjax. So anyone can modify the usage if
> > need
> > >> >>> > be. (Anyone who has lyx anyway).
> > >> >>> >
> > >> >>> > Dev docs were attached to several jira issues (and i had blog
> > >> >>> > entries), if you want to move more recent copies of them moved
> >  over
> > >> >>> > to wiki, i'd be happy to. Mainly, so far there are 2 working
> > notes,
> > >> >>> > one for original method, and another for power iterations,
> > attached
> > >> to
> > >> >>> > corresponding jiras.
> > >> >>> >
> > >> >>> >
> > >> >>> > On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll <
> > >> gsingers@apache.org>
> > >> >>> > wrote:
> > >> >>> > > I hooked it into the Algorithms page.
> > >> >>> > >
> > >> >>> > > How do you intend to keep the PDF up to date?  I like the
> focus
> > >> more on
> > >> >>> > the user, but it would also be good to have some dev docs.
> > >> >>> > >
> > >> >>> > > Also, with both Lanczos and this it would be good if we could
> > hook
> > >> them
> > >> >>> > into some real examples.
> > >> >>> > >
> > >> >>> > > On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:
> > >> >>> > >
> > >> >>> > >> Hi,
> > >> >>> > >>
> > >> >>> > >> I put a usage and overview doc for SSVD onto wiki. I'd
> > appreciate
> > >> if
> > >> >>> > >> somebody else could look thru it, to scan for completeness
> and
> > >> >>> > >> suggestions.
> > >> >>> > >>
> > >> >>> > >> I tried to approach it as a user-facing documentation, i.e. I
> > >> tried to
> > >> >>> > >> avoid discussing any implementation specifics .
> > >> >>> > >>
> > >> >>> > >> I had several users and Nathan Halko trying it out and
> actually
> > >> >>> > >> favorably commenting on its scalability vs. Lanczos but i
> don't
> > >> know
> > >> >>> > >> first hand of any production use (even our own use is fairly
> > >> limited
> > >> >>> > >> (in terms of input volume we ever processed) and actually
> > somewhat
> > >> >>> > >> diverged from this Mahout implementation. Perhaps putting it
> > more
> > >> in
> > >> >>> > >> front of users will help to receive more feedback.
> > >> >>> > >>
> > >> >>> > >> Thanks.
> > >> >>> > >> -Dmitriy
> > >> >>> > >
> > >> >>> > > --------------------------------------------
> > >> >>> > > Grant Ingersoll
> > >> >>> > > http://www.lucidimagination.com
> > >> >>> > >
> > >> >>> > >
> > >> >>> > >
> > >> >>> > >
> > >> >>> >
> > >>
> >
>

Re: Soliciting SSVD documentation review

Posted by Nathan Halko <na...@spotinfluence.com>.

Yes I will time the phases.  My largest dataset is only a couple of gigs
currently, I ran into the 5G limit on Amazon S3 and need to find a work
around.  But I figured that might be large enough to see scaling using the
small instances but maybe not.  I will work on these issues and see what
happens, thanks for you help Dmitriy.

Nathan

On Tue, Nov 29, 2011 at 3:24 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> ok thanks. I will file an issue for default p.
>
> also i updated the docs re: --reduceTasks.
>
> it would be nice if you could log time for map and reduce phases for
> all tasks (it is reported in MR web ui at namenode:50030 by default)
> in each case if you think there's a performance issue. It would at
> least allow to narrow any problem to a particular part of computation.
> My datasets are too small ~10G, and i run them for a rather small k,
> at that size i don't see any visible irregularties.
>
> Thanks.
> -Dmitriy
>
> On Tue, Nov 29, 2011 at 2:12 PM, Nathan Halko <na...@spotinfluence.com>
> wrote:
> > Thanks for the heads up with numReduceTasks.  I haven't changed the
> > parameters yet much from the default so this is probably my problem.
> >
> > By slave I mean machine, I'm running an m1.small as master and either
> > m1.small's or m1.large's as slaves (datanode, tasktracker, child).
> >
> > p depends mostly on the decay of singular values rather than the rank k.
> >  In fact (in the analysis at least) it is completely independent of k.
>  The
> > quantity of interest is sig_k/sig_k+p, (signal to noise ratio) this
> should
> > be large.  Ideally we would set p as a function of this parameter which
> is
> > dependent on the matrix (and unknown until we have already solved the
> > problem  :-) ).  I suggest 25 since for example tf-idf matrices have a
> low
> > sig/noise ratio.  You could probably for some cases use less, if you
> need p
> > to be larger you probably need a power iteration so it seems to be a good
> > default point.  Also the parameter is not an initial point of
> optimization
> > so to error on the larger side is fine.  After all, Lanczos method
> suggests
> > that only 1/3 of singular triplets are accurate, which corresponds to
> p=2k,
> > which is very large.  Basically, the exact value of p is insensitive so
> > long as it is large 'enough'.
> >
> > On Tue, Nov 29, 2011 at 12:32 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
> >
> >> PPS also make sure you specify numReduceTasks. Default is I beleive 1
> >> which will not scale at multiplication steps at all.
> >>
> >> On Tue, Nov 29, 2011 at 10:15 AM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >> > PS actually i think it should scale horizontally a little better than
> >> > vertically but that's just a guess.
> >> >
> >> > On Tue, Nov 29, 2011 at 10:10 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> >> wrote:
> >> >> On Tue, Nov 29, 2011 at 9:56 AM, Nathan Halko <
> nathan@spotinfluence.com>
> >> wrote:
> >> >>>
> >> >>> The docs look great Dmitriy.  Has anyone considered giving
> oversampling
> >> >>> ssvd over lanczos which is promising.  Trying to scale out
> >> horizontally but
> >> >>> not seeing any difference between using one slave or many slaves.
>  Any
> >> >>> ideas? (I won't go into detail about the setup here but if sounds
> >> familiar
> >> >>> I'd like to talk more).
> >> >>
> >> >> What do you mean by a slave? a mapper? a machine?
> >> >>
> >> >> whether you increase input horizontally or vertically, you should see
> >> >> more mappers. If your cluster has enough capacity to scheudle all
> >> >> mappers right away, i beleive you will get almost the same time (i.e.
> >> >> almost linear scaling) for most of the jobs.
> >> >>
> >> >>> The basic problem with lanczos in the distributed
> >> >>> environment seems to be that a matrix-vector multiply is not enough
> >> work to
> >> >>> offset any setup costs, also there is not a distributed
> >> orthogonalization
> >> >>> with lanczos and I'm getting OOM's making it difficult to scale.  I
> >> would
> >> >>> still like to contribute what results I have found but I'm short on
> >> time so
> >> >>> nothing besides work directly related to the completion of my thesis
> >> will
> >> >>> happen until that is done.
> >> >>>
> >> >>
> >> >>> On Fri, Nov 25, 2011 at 5:37 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> >> wrote:
> >> >>>
> >> >>> > I attached the latex source as well (lyx, actually). I would've
> used
> >> >>> > Wiki if it supported mathjax. So anyone can modify the usage if
> need
> >> >>> > be. (Anyone who has lyx anyway).
> >> >>> >
> >> >>> > Dev docs were attached to several jira issues (and i had blog
> >> >>> > entries), if you want to move more recent copies of them moved
>  over
> >> >>> > to wiki, i'd be happy to. Mainly, so far there are 2 working
> notes,
> >> >>> > one for original method, and another for power iterations,
> attached
> >> to
> >> >>> > corresponding jiras.
> >> >>> >
> >> >>> >
> >> >>> > On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll <
> >> gsingers@apache.org>
> >> >>> > wrote:
> >> >>> > > I hooked it into the Algorithms page.
> >> >>> > >
> >> >>> > > How do you intend to keep the PDF up to date?  I like the focus
> >> more on
> >> >>> > the user, but it would also be good to have some dev docs.
> >> >>> > >
> >> >>> > > Also, with both Lanczos and this it would be good if we could
> hook
> >> them
> >> >>> > into some real examples.
> >> >>> > >
> >> >>> > > On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:
> >> >>> > >
> >> >>> > >> Hi,
> >> >>> > >>
> >> >>> > >> I put a usage and overview doc for SSVD onto wiki. I'd
> appreciate
> >> if
> >> >>> > >> somebody else could look thru it, to scan for completeness and
> >> >>> > >> suggestions.
> >> >>> > >>
> >> >>> > >> I tried to approach it as a user-facing documentation, i.e. I
> >> tried to
> >> >>> > >> avoid discussing any implementation specifics .
> >> >>> > >>
> >> >>> > >> I had several users and Nathan Halko trying it out and actually
> >> >>> > >> favorably commenting on its scalability vs. Lanczos but i don't
> >> know
> >> >>> > >> first hand of any production use (even our own use is fairly
> >> limited
> >> >>> > >> (in terms of input volume we ever processed) and actually
> somewhat
> >> >>> > >> diverged from this Mahout implementation. Perhaps putting it
> more
> >> in
> >> >>> > >> front of users will help to receive more feedback.
> >> >>> > >>
> >> >>> > >> Thanks.
> >> >>> > >> -Dmitriy
> >> >>> > >
> >> >>> > > --------------------------------------------
> >> >>> > > Grant Ingersoll
> >> >>> > > http://www.lucidimagination.com
> >> >>> > >
> >> >>> > >
> >> >>> > >
> >> >>> > >
> >> >>> >
> >>
>

Re: Soliciting SSVD documentation review

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

ok thanks. I will file an issue for default p.

also i updated the docs re: --reduceTasks.

it would be nice if you could log time for map and reduce phases for
all tasks (it is reported in MR web ui at namenode:50030 by default)
in each case if you think there's a performance issue. It would at
least allow to narrow any problem to a particular part of computation.
My datasets are too small ~10G, and i run them for a rather small k,
at that size i don't see any visible irregularties.

Thanks.
-Dmitriy

On Tue, Nov 29, 2011 at 2:12 PM, Nathan Halko <na...@spotinfluence.com> wrote:
> Thanks for the heads up with numReduceTasks.  I haven't changed the
> parameters yet much from the default so this is probably my problem.
>
> By slave I mean machine, I'm running an m1.small as master and either
> m1.small's or m1.large's as slaves (datanode, tasktracker, child).
>
> p depends mostly on the decay of singular values rather than the rank k.
>  In fact (in the analysis at least) it is completely independent of k.  The
> quantity of interest is sig_k/sig_k+p, (signal to noise ratio) this should
> be large.  Ideally we would set p as a function of this parameter which is
> dependent on the matrix (and unknown until we have already solved the
> problem  :-) ).  I suggest 25 since for example tf-idf matrices have a low
> sig/noise ratio.  You could probably for some cases use less, if you need p
> to be larger you probably need a power iteration so it seems to be a good
> default point.  Also the parameter is not an initial point of optimization
> so to error on the larger side is fine.  After all, Lanczos method suggests
> that only 1/3 of singular triplets are accurate, which corresponds to p=2k,
> which is very large.  Basically, the exact value of p is insensitive so
> long as it is large 'enough'.
>
> On Tue, Nov 29, 2011 at 12:32 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> PPS also make sure you specify numReduceTasks. Default is I beleive 1
>> which will not scale at multiplication steps at all.
>>
>> On Tue, Nov 29, 2011 at 10:15 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> > PS actually i think it should scale horizontally a little better than
>> > vertically but that's just a guess.
>> >
>> > On Tue, Nov 29, 2011 at 10:10 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >> On Tue, Nov 29, 2011 at 9:56 AM, Nathan Halko <na...@spotinfluence.com>
>> wrote:
>> >>>
>> >>> The docs look great Dmitriy.  Has anyone considered giving oversampling
>> >>> ssvd over lanczos which is promising.  Trying to scale out
>> horizontally but
>> >>> not seeing any difference between using one slave or many slaves.  Any
>> >>> ideas? (I won't go into detail about the setup here but if sounds
>> familiar
>> >>> I'd like to talk more).
>> >>
>> >> What do you mean by a slave? a mapper? a machine?
>> >>
>> >> whether you increase input horizontally or vertically, you should see
>> >> more mappers. If your cluster has enough capacity to scheudle all
>> >> mappers right away, i beleive you will get almost the same time (i.e.
>> >> almost linear scaling) for most of the jobs.
>> >>
>> >>> The basic problem with lanczos in the distributed
>> >>> environment seems to be that a matrix-vector multiply is not enough
>> work to
>> >>> offset any setup costs, also there is not a distributed
>> orthogonalization
>> >>> with lanczos and I'm getting OOM's making it difficult to scale.  I
>> would
>> >>> still like to contribute what results I have found but I'm short on
>> time so
>> >>> nothing besides work directly related to the completion of my thesis
>> will
>> >>> happen until that is done.
>> >>>
>> >>
>> >>> On Fri, Nov 25, 2011 at 5:37 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >>>
>> >>> > I attached the latex source as well (lyx, actually). I would've used
>> >>> > Wiki if it supported mathjax. So anyone can modify the usage if need
>> >>> > be. (Anyone who has lyx anyway).
>> >>> >
>> >>> > Dev docs were attached to several jira issues (and i had blog
>> >>> > entries), if you want to move more recent copies of them moved  over
>> >>> > to wiki, i'd be happy to. Mainly, so far there are 2 working notes,
>> >>> > one for original method, and another for power iterations, attached
>> to
>> >>> > corresponding jiras.
>> >>> >
>> >>> >
>> >>> > On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll <
>> gsingers@apache.org>
>> >>> > wrote:
>> >>> > > I hooked it into the Algorithms page.
>> >>> > >
>> >>> > > How do you intend to keep the PDF up to date?  I like the focus
>> more on
>> >>> > the user, but it would also be good to have some dev docs.
>> >>> > >
>> >>> > > Also, with both Lanczos and this it would be good if we could hook
>> them
>> >>> > into some real examples.
>> >>> > >
>> >>> > > On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:
>> >>> > >
>> >>> > >> Hi,
>> >>> > >>
>> >>> > >> I put a usage and overview doc for SSVD onto wiki. I'd appreciate
>> if
>> >>> > >> somebody else could look thru it, to scan for completeness and
>> >>> > >> suggestions.
>> >>> > >>
>> >>> > >> I tried to approach it as a user-facing documentation, i.e. I
>> tried to
>> >>> > >> avoid discussing any implementation specifics .
>> >>> > >>
>> >>> > >> I had several users and Nathan Halko trying it out and actually
>> >>> > >> favorably commenting on its scalability vs. Lanczos but i don't
>> know
>> >>> > >> first hand of any production use (even our own use is fairly
>> limited
>> >>> > >> (in terms of input volume we ever processed) and actually somewhat
>> >>> > >> diverged from this Mahout implementation. Perhaps putting it more
>> in
>> >>> > >> front of users will help to receive more feedback.
>> >>> > >>
>> >>> > >> Thanks.
>> >>> > >> -Dmitriy
>> >>> > >
>> >>> > > --------------------------------------------
>> >>> > > Grant Ingersoll
>> >>> > > http://www.lucidimagination.com
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> >
>>

Re: Soliciting SSVD documentation review

Posted by Nathan Halko <na...@spotinfluence.com>.

Thanks for the heads up with numReduceTasks.  I haven't changed the
parameters yet much from the default so this is probably my problem.

By slave I mean machine, I'm running an m1.small as master and either
m1.small's or m1.large's as slaves (datanode, tasktracker, child).

p depends mostly on the decay of singular values rather than the rank k.
 In fact (in the analysis at least) it is completely independent of k.  The
quantity of interest is sig_k/sig_k+p, (signal to noise ratio) this should
be large.  Ideally we would set p as a function of this parameter which is
dependent on the matrix (and unknown until we have already solved the
problem  :-) ).  I suggest 25 since for example tf-idf matrices have a low
sig/noise ratio.  You could probably for some cases use less, if you need p
to be larger you probably need a power iteration so it seems to be a good
default point.  Also the parameter is not an initial point of optimization
so to error on the larger side is fine.  After all, Lanczos method suggests
that only 1/3 of singular triplets are accurate, which corresponds to p=2k,
which is very large.  Basically, the exact value of p is insensitive so
long as it is large 'enough'.

On Tue, Nov 29, 2011 at 12:32 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> PPS also make sure you specify numReduceTasks. Default is I beleive 1
> which will not scale at multiplication steps at all.
>
> On Tue, Nov 29, 2011 at 10:15 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> > PS actually i think it should scale horizontally a little better than
> > vertically but that's just a guess.
> >
> > On Tue, Nov 29, 2011 at 10:10 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >> On Tue, Nov 29, 2011 at 9:56 AM, Nathan Halko <na...@spotinfluence.com>
> wrote:
> >>>
> >>> The docs look great Dmitriy.  Has anyone considered giving oversampling
> >>> ssvd over lanczos which is promising.  Trying to scale out
> horizontally but
> >>> not seeing any difference between using one slave or many slaves.  Any
> >>> ideas? (I won't go into detail about the setup here but if sounds
> familiar
> >>> I'd like to talk more).
> >>
> >> What do you mean by a slave? a mapper? a machine?
> >>
> >> whether you increase input horizontally or vertically, you should see
> >> more mappers. If your cluster has enough capacity to scheudle all
> >> mappers right away, i beleive you will get almost the same time (i.e.
> >> almost linear scaling) for most of the jobs.
> >>
> >>> The basic problem with lanczos in the distributed
> >>> environment seems to be that a matrix-vector multiply is not enough
> work to
> >>> offset any setup costs, also there is not a distributed
> orthogonalization
> >>> with lanczos and I'm getting OOM's making it difficult to scale.  I
> would
> >>> still like to contribute what results I have found but I'm short on
> time so
> >>> nothing besides work directly related to the completion of my thesis
> will
> >>> happen until that is done.
> >>>
> >>
> >>> On Fri, Nov 25, 2011 at 5:37 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >>>
> >>> > I attached the latex source as well (lyx, actually). I would've used
> >>> > Wiki if it supported mathjax. So anyone can modify the usage if need
> >>> > be. (Anyone who has lyx anyway).
> >>> >
> >>> > Dev docs were attached to several jira issues (and i had blog
> >>> > entries), if you want to move more recent copies of them moved  over
> >>> > to wiki, i'd be happy to. Mainly, so far there are 2 working notes,
> >>> > one for original method, and another for power iterations, attached
> to
> >>> > corresponding jiras.
> >>> >
> >>> >
> >>> > On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll <
> gsingers@apache.org>
> >>> > wrote:
> >>> > > I hooked it into the Algorithms page.
> >>> > >
> >>> > > How do you intend to keep the PDF up to date?  I like the focus
> more on
> >>> > the user, but it would also be good to have some dev docs.
> >>> > >
> >>> > > Also, with both Lanczos and this it would be good if we could hook
> them
> >>> > into some real examples.
> >>> > >
> >>> > > On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:
> >>> > >
> >>> > >> Hi,
> >>> > >>
> >>> > >> I put a usage and overview doc for SSVD onto wiki. I'd appreciate
> if
> >>> > >> somebody else could look thru it, to scan for completeness and
> >>> > >> suggestions.
> >>> > >>
> >>> > >> I tried to approach it as a user-facing documentation, i.e. I
> tried to
> >>> > >> avoid discussing any implementation specifics .
> >>> > >>
> >>> > >> I had several users and Nathan Halko trying it out and actually
> >>> > >> favorably commenting on its scalability vs. Lanczos but i don't
> know
> >>> > >> first hand of any production use (even our own use is fairly
> limited
> >>> > >> (in terms of input volume we ever processed) and actually somewhat
> >>> > >> diverged from this Mahout implementation. Perhaps putting it more
> in
> >>> > >> front of users will help to receive more feedback.
> >>> > >>
> >>> > >> Thanks.
> >>> > >> -Dmitriy
> >>> > >
> >>> > > --------------------------------------------
> >>> > > Grant Ingersoll
> >>> > > http://www.lucidimagination.com
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> >
>

Re: Soliciting SSVD documentation review

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

PPS also make sure you specify numReduceTasks. Default is I beleive 1
which will not scale at multiplication steps at all.

On Tue, Nov 29, 2011 at 10:15 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> PS actually i think it should scale horizontally a little better than
> vertically but that's just a guess.
>
> On Tue, Nov 29, 2011 at 10:10 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> On Tue, Nov 29, 2011 at 9:56 AM, Nathan Halko <na...@spotinfluence.com> wrote:
>>>
>>> The docs look great Dmitriy.  Has anyone considered giving oversampling
>>> ssvd over lanczos which is promising.  Trying to scale out horizontally but
>>> not seeing any difference between using one slave or many slaves.  Any
>>> ideas? (I won't go into detail about the setup here but if sounds familiar
>>> I'd like to talk more).
>>
>> What do you mean by a slave? a mapper? a machine?
>>
>> whether you increase input horizontally or vertically, you should see
>> more mappers. If your cluster has enough capacity to scheudle all
>> mappers right away, i beleive you will get almost the same time (i.e.
>> almost linear scaling) for most of the jobs.
>>
>>> The basic problem with lanczos in the distributed
>>> environment seems to be that a matrix-vector multiply is not enough work to
>>> offset any setup costs, also there is not a distributed orthogonalization
>>> with lanczos and I'm getting OOM's making it difficult to scale.  I would
>>> still like to contribute what results I have found but I'm short on time so
>>> nothing besides work directly related to the completion of my thesis will
>>> happen until that is done.
>>>
>>
>>> On Fri, Nov 25, 2011 at 5:37 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> > I attached the latex source as well (lyx, actually). I would've used
>>> > Wiki if it supported mathjax. So anyone can modify the usage if need
>>> > be. (Anyone who has lyx anyway).
>>> >
>>> > Dev docs were attached to several jira issues (and i had blog
>>> > entries), if you want to move more recent copies of them moved  over
>>> > to wiki, i'd be happy to. Mainly, so far there are 2 working notes,
>>> > one for original method, and another for power iterations, attached to
>>> > corresponding jiras.
>>> >
>>> >
>>> > On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll <gs...@apache.org>
>>> > wrote:
>>> > > I hooked it into the Algorithms page.
>>> > >
>>> > > How do you intend to keep the PDF up to date?  I like the focus more on
>>> > the user, but it would also be good to have some dev docs.
>>> > >
>>> > > Also, with both Lanczos and this it would be good if we could hook them
>>> > into some real examples.
>>> > >
>>> > > On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:
>>> > >
>>> > >> Hi,
>>> > >>
>>> > >> I put a usage and overview doc for SSVD onto wiki. I'd appreciate if
>>> > >> somebody else could look thru it, to scan for completeness and
>>> > >> suggestions.
>>> > >>
>>> > >> I tried to approach it as a user-facing documentation, i.e. I tried to
>>> > >> avoid discussing any implementation specifics .
>>> > >>
>>> > >> I had several users and Nathan Halko trying it out and actually
>>> > >> favorably commenting on its scalability vs. Lanczos but i don't know
>>> > >> first hand of any production use (even our own use is fairly limited
>>> > >> (in terms of input volume we ever processed) and actually somewhat
>>> > >> diverged from this Mahout implementation. Perhaps putting it more in
>>> > >> front of users will help to receive more feedback.
>>> > >>
>>> > >> Thanks.
>>> > >> -Dmitriy
>>> > >
>>> > > --------------------------------------------
>>> > > Grant Ingersoll
>>> > > http://www.lucidimagination.com
>>> > >
>>> > >
>>> > >
>>> > >
>>> >

Re: Soliciting SSVD documentation review

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

PS actually i think it should scale horizontally a little better than
vertically but that's just a guess.

On Tue, Nov 29, 2011 at 10:10 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> On Tue, Nov 29, 2011 at 9:56 AM, Nathan Halko <na...@spotinfluence.com> wrote:
>>
>> The docs look great Dmitriy.  Has anyone considered giving oversampling
>> ssvd over lanczos which is promising.  Trying to scale out horizontally but
>> not seeing any difference between using one slave or many slaves.  Any
>> ideas? (I won't go into detail about the setup here but if sounds familiar
>> I'd like to talk more).
>
> What do you mean by a slave? a mapper? a machine?
>
> whether you increase input horizontally or vertically, you should see
> more mappers. If your cluster has enough capacity to scheudle all
> mappers right away, i beleive you will get almost the same time (i.e.
> almost linear scaling) for most of the jobs.
>
>> The basic problem with lanczos in the distributed
>> environment seems to be that a matrix-vector multiply is not enough work to
>> offset any setup costs, also there is not a distributed orthogonalization
>> with lanczos and I'm getting OOM's making it difficult to scale.  I would
>> still like to contribute what results I have found but I'm short on time so
>> nothing besides work directly related to the completion of my thesis will
>> happen until that is done.
>>
>
>> On Fri, Nov 25, 2011 at 5:37 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> > I attached the latex source as well (lyx, actually). I would've used
>> > Wiki if it supported mathjax. So anyone can modify the usage if need
>> > be. (Anyone who has lyx anyway).
>> >
>> > Dev docs were attached to several jira issues (and i had blog
>> > entries), if you want to move more recent copies of them moved  over
>> > to wiki, i'd be happy to. Mainly, so far there are 2 working notes,
>> > one for original method, and another for power iterations, attached to
>> > corresponding jiras.
>> >
>> >
>> > On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll <gs...@apache.org>
>> > wrote:
>> > > I hooked it into the Algorithms page.
>> > >
>> > > How do you intend to keep the PDF up to date?  I like the focus more on
>> > the user, but it would also be good to have some dev docs.
>> > >
>> > > Also, with both Lanczos and this it would be good if we could hook them
>> > into some real examples.
>> > >
>> > > On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:
>> > >
>> > >> Hi,
>> > >>
>> > >> I put a usage and overview doc for SSVD onto wiki. I'd appreciate if
>> > >> somebody else could look thru it, to scan for completeness and
>> > >> suggestions.
>> > >>
>> > >> I tried to approach it as a user-facing documentation, i.e. I tried to
>> > >> avoid discussing any implementation specifics .
>> > >>
>> > >> I had several users and Nathan Halko trying it out and actually
>> > >> favorably commenting on its scalability vs. Lanczos but i don't know
>> > >> first hand of any production use (even our own use is fairly limited
>> > >> (in terms of input volume we ever processed) and actually somewhat
>> > >> diverged from this Mahout implementation. Perhaps putting it more in
>> > >> front of users will help to receive more feedback.
>> > >>
>> > >> Thanks.
>> > >> -Dmitriy
>> > >
>> > > --------------------------------------------
>> > > Grant Ingersoll
>> > > http://www.lucidimagination.com
>> > >
>> > >
>> > >
>> > >
>> >

Re: Soliciting SSVD documentation review

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Tue, Nov 29, 2011 at 9:56 AM, Nathan Halko <na...@spotinfluence.com> wrote:
>
> The docs look great Dmitriy.  Has anyone considered giving oversampling
> ssvd over lanczos which is promising.  Trying to scale out horizontally but
> not seeing any difference between using one slave or many slaves.  Any
> ideas? (I won't go into detail about the setup here but if sounds familiar
> I'd like to talk more).

What do you mean by a slave? a mapper? a machine?

whether you increase input horizontally or vertically, you should see
more mappers. If your cluster has enough capacity to scheudle all
mappers right away, i beleive you will get almost the same time (i.e.
almost linear scaling) for most of the jobs.

> The basic problem with lanczos in the distributed
> environment seems to be that a matrix-vector multiply is not enough work to
> offset any setup costs, also there is not a distributed orthogonalization
> with lanczos and I'm getting OOM's making it difficult to scale.  I would
> still like to contribute what results I have found but I'm short on time so
> nothing besides work directly related to the completion of my thesis will
> happen until that is done.
>

> On Fri, Nov 25, 2011 at 5:37 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> > I attached the latex source as well (lyx, actually). I would've used
> > Wiki if it supported mathjax. So anyone can modify the usage if need
> > be. (Anyone who has lyx anyway).
> >
> > Dev docs were attached to several jira issues (and i had blog
> > entries), if you want to move more recent copies of them moved  over
> > to wiki, i'd be happy to. Mainly, so far there are 2 working notes,
> > one for original method, and another for power iterations, attached to
> > corresponding jiras.
> >
> >
> > On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll <gs...@apache.org>
> > wrote:
> > > I hooked it into the Algorithms page.
> > >
> > > How do you intend to keep the PDF up to date?  I like the focus more on
> > the user, but it would also be good to have some dev docs.
> > >
> > > Also, with both Lanczos and this it would be good if we could hook them
> > into some real examples.
> > >
> > > On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:
> > >
> > >> Hi,
> > >>
> > >> I put a usage and overview doc for SSVD onto wiki. I'd appreciate if
> > >> somebody else could look thru it, to scan for completeness and
> > >> suggestions.
> > >>
> > >> I tried to approach it as a user-facing documentation, i.e. I tried to
> > >> avoid discussing any implementation specifics .
> > >>
> > >> I had several users and Nathan Halko trying it out and actually
> > >> favorably commenting on its scalability vs. Lanczos but i don't know
> > >> first hand of any production use (even our own use is fairly limited
> > >> (in terms of input volume we ever processed) and actually somewhat
> > >> diverged from this Mahout implementation. Perhaps putting it more in
> > >> front of users will help to receive more feedback.
> > >>
> > >> Thanks.
> > >> -Dmitriy
> > >
> > > --------------------------------------------
> > > Grant Ingersoll
> > > http://www.lucidimagination.com
> > >
> > >
> > >
> > >
> >

Re: Soliciting SSVD documentation review

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Tue, Nov 29, 2011 at 9:56 AM, Nathan Halko <na...@spotinfluence.com> wrote:
> The docs look great Dmitriy.  Has anyone considered giving oversampling
> parameter p a default value? Say p = 25.  Slightly high but I imagine most
> use cases are noisy and could benefit from the larger value.

Yes that's a good idea that did not occur to me. This guy might get a
default value.

But wouldn't a good default also depend on k? Say if you ask for k=100
than perhaps p=15 is enough but if you ask for k=500 then 25 sounds
about right. Perhaps we could coin an heuristics here for a default as
a default p = some f(k).

I have been
> testing ssvd and lanczos svd on Amazon EMR.  Seeing about a 15x speedup in
> ssvd over lanczos which is promising.  Trying to scale out horizontally but
> not seeing any difference between using one slave or many slaves.  Any
> ideas? (I won't go into detail about the setup here but if sounds familiar
> I'd like to talk more).  The basic problem with lanczos in the distributed
> environment seems to be that a matrix-vector multiply is not enough work to
> offset any setup costs, also there is not a distributed orthogonalization
> with lanczos and I'm getting OOM's making it difficult to scale.  I would
> still like to contribute what results I have found but I'm short on time so
> nothing besides work directly related to the completion of my thesis will
> happen until that is done.
>
> On Fri, Nov 25, 2011 at 5:37 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> I attached the latex source as well (lyx, actually). I would've used
>> Wiki if it supported mathjax. So anyone can modify the usage if need
>> be. (Anyone who has lyx anyway).
>>
>> Dev docs were attached to several jira issues (and i had blog
>> entries), if you want to move more recent copies of them moved  over
>> to wiki, i'd be happy to. Mainly, so far there are 2 working notes,
>> one for original method, and another for power iterations, attached to
>> corresponding jiras.
>>
>>
>> On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll <gs...@apache.org>
>> wrote:
>> > I hooked it into the Algorithms page.
>> >
>> > How do you intend to keep the PDF up to date?  I like the focus more on
>> the user, but it would also be good to have some dev docs.
>> >
>> > Also, with both Lanczos and this it would be good if we could hook them
>> into some real examples.
>> >
>> > On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:
>> >
>> >> Hi,
>> >>
>> >> I put a usage and overview doc for SSVD onto wiki. I'd appreciate if
>> >> somebody else could look thru it, to scan for completeness and
>> >> suggestions.
>> >>
>> >> I tried to approach it as a user-facing documentation, i.e. I tried to
>> >> avoid discussing any implementation specifics .
>> >>
>> >> I had several users and Nathan Halko trying it out and actually
>> >> favorably commenting on its scalability vs. Lanczos but i don't know
>> >> first hand of any production use (even our own use is fairly limited
>> >> (in terms of input volume we ever processed) and actually somewhat
>> >> diverged from this Mahout implementation. Perhaps putting it more in
>> >> front of users will help to receive more feedback.
>> >>
>> >> Thanks.
>> >> -Dmitriy
>> >
>> > --------------------------------------------
>> > Grant Ingersoll
>> > http://www.lucidimagination.com
>> >
>> >
>> >
>> >
>>

Re: Soliciting SSVD documentation review

Posted by Nathan Halko <na...@spotinfluence.com>.

The docs look great Dmitriy.  Has anyone considered giving oversampling
parameter p a default value? Say p = 25.  Slightly high but I imagine most
use cases are noisy and could benefit from the larger value.  I have been
testing ssvd and lanczos svd on Amazon EMR.  Seeing about a 15x speedup in
ssvd over lanczos which is promising.  Trying to scale out horizontally but
not seeing any difference between using one slave or many slaves.  Any
ideas? (I won't go into detail about the setup here but if sounds familiar
I'd like to talk more).  The basic problem with lanczos in the distributed
environment seems to be that a matrix-vector multiply is not enough work to
offset any setup costs, also there is not a distributed orthogonalization
with lanczos and I'm getting OOM's making it difficult to scale.  I would
still like to contribute what results I have found but I'm short on time so
nothing besides work directly related to the completion of my thesis will
happen until that is done.

On Fri, Nov 25, 2011 at 5:37 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I attached the latex source as well (lyx, actually). I would've used
> Wiki if it supported mathjax. So anyone can modify the usage if need
> be. (Anyone who has lyx anyway).
>
> Dev docs were attached to several jira issues (and i had blog
> entries), if you want to move more recent copies of them moved  over
> to wiki, i'd be happy to. Mainly, so far there are 2 working notes,
> one for original method, and another for power iterations, attached to
> corresponding jiras.
>
>
> On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
> > I hooked it into the Algorithms page.
> >
> > How do you intend to keep the PDF up to date?  I like the focus more on
> the user, but it would also be good to have some dev docs.
> >
> > Also, with both Lanczos and this it would be good if we could hook them
> into some real examples.
> >
> > On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:
> >
> >> Hi,
> >>
> >> I put a usage and overview doc for SSVD onto wiki. I'd appreciate if
> >> somebody else could look thru it, to scan for completeness and
> >> suggestions.
> >>
> >> I tried to approach it as a user-facing documentation, i.e. I tried to
> >> avoid discussing any implementation specifics .
> >>
> >> I had several users and Nathan Halko trying it out and actually
> >> favorably commenting on its scalability vs. Lanczos but i don't know
> >> first hand of any production use (even our own use is fairly limited
> >> (in terms of input volume we ever processed) and actually somewhat
> >> diverged from this Mahout implementation. Perhaps putting it more in
> >> front of users will help to receive more feedback.
> >>
> >> Thanks.
> >> -Dmitriy
> >
> > --------------------------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com
> >
> >
> >
> >
>

Re: Soliciting SSVD documentation review

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I attached the latex source as well (lyx, actually). I would've used
Wiki if it supported mathjax. So anyone can modify the usage if need
be. (Anyone who has lyx anyway).

Dev docs were attached to several jira issues (and i had blog
entries), if you want to move more recent copies of them moved  over
to wiki, i'd be happy to. Mainly, so far there are 2 working notes,
one for original method, and another for power iterations, attached to
corresponding jiras.


On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll <gs...@apache.org> wrote:
> I hooked it into the Algorithms page.
>
> How do you intend to keep the PDF up to date?  I like the focus more on the user, but it would also be good to have some dev docs.
>
> Also, with both Lanczos and this it would be good if we could hook them into some real examples.
>
> On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:
>
>> Hi,
>>
>> I put a usage and overview doc for SSVD onto wiki. I'd appreciate if
>> somebody else could look thru it, to scan for completeness and
>> suggestions.
>>
>> I tried to approach it as a user-facing documentation, i.e. I tried to
>> avoid discussing any implementation specifics .
>>
>> I had several users and Nathan Halko trying it out and actually
>> favorably commenting on its scalability vs. Lanczos but i don't know
>> first hand of any production use (even our own use is fairly limited
>> (in terms of input volume we ever processed) and actually somewhat
>> diverged from this Mahout implementation. Perhaps putting it more in
>> front of users will help to receive more feedback.
>>
>> Thanks.
>> -Dmitriy
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>
>
>

Re: Soliciting SSVD documentation review

Posted by Grant Ingersoll <gs...@apache.org>.

I hooked it into the Algorithms page.  

How do you intend to keep the PDF up to date?  I like the focus more on the user, but it would also be good to have some dev docs.

Also, with both Lanczos and this it would be good if we could hook them into some real examples.

On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:

> Hi,
> 
> I put a usage and overview doc for SSVD onto wiki. I'd appreciate if
> somebody else could look thru it, to scan for completeness and
> suggestions.
> 
> I tried to approach it as a user-facing documentation, i.e. I tried to
> avoid discussing any implementation specifics .
> 
> I had several users and Nathan Halko trying it out and actually
> favorably commenting on its scalability vs. Lanczos but i don't know
> first hand of any production use (even our own use is fairly limited
> (in terms of input volume we ever processed) and actually somewhat
> diverged from this Mahout implementation. Perhaps putting it more in
> front of users will help to receive more feedback.
> 
> Thanks.
> -Dmitriy

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com