You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Radim Rehurek <Ra...@seznam.cz> on 2011/12/18 18:39:46 UTC

SVD in Mahout (was: Mahout Lanczos SVD complexity)

By the way, are there similar experimental results for performance/accuracy of SVD in Mahout? 

I believe there are several SVD implementations available in Mahout (Lanczos/randomized, distributed/local), I would be interested in seeing their comparison in terms of speed and accuracy, on some reasonably large public dataset.

Cheers,
Radim


> ------------ Původní zpráva ------------
> Od: Radim Rehurek <Ra...@seznam.cz>
> Předmět: Re: Mahout Lanczos SVD complexity
> Datum: 18.12.2011 09:41:23
> ----------------------------------------
> Hi Ted,
> 
> appreciate your comments!
> 
> > You are correct that the randomized projection methods are of limited
> > accuracy when the singular values are not decreasing rapidly, but I think
> > that you were asking for too many singular values (500) from a relatively
> > small matrix to make much sense.  My argument for saying this is that the
> > singular values at this point have become quite small and are likely to be
> > modeling the randomized variation in your corpus rather than the systematic
> > patterns of occurrence.  As such, the "accuracy" of the decomposition is of
> > little interest.
> 
> Agreed that the small factors are modelling random variations and could likely
> be replaced by random direction vectors (of comparable magnitude). 
> 
> The 500 is a common "gold-standard" dimensionality used in Latent Semantic
> Indexing (one of the applications of SVD), and users explicitly ask for SVD
> accuracy -- so there it is, hard numbers :)
> Also note that a few million documents, with a few 10k-100k vocabulary, is by
> far the most common use-case for gensim users. That's why I picked the English
> wikipedia to test on. If use-cases of Mahout SVD target millions of features on
> billions of documents, YMMV.
> 
> 
> > It may well be that using a high number of singular values increases
> > apparent accuracy in your application over using a smaller number, but it
> > is likely that augmenting a small decomposition with random vectors instead
> > of true singular vectors would provide virtually identical performance.  As
> > such, computing these extra singular vectors more or less accuracy is
> > hardly of the essence.
> > 
> > In order to determine whether this is the case, the interesting comparison
> > to make is whether you get higher application accuracy on held out data
> > from, say, 100 singular vectors + 400 random vectors or your 500 singular
> > vectors.  It is common with LSI-like applications to test 100 singular
> > vectors against 500 singular vectors, but does not really measure how much
> > information is being extracted in the singular decomposition.
> 
> But those tests didn't measure application accuracy! (though I very much urge
> users to do that). They simply measure SVD reconstruction error (not LSI
> retrieval error or anything).
> 
> Again, I complete agree that real app accuracy beats some theoretical nuisance
> metric (such as reconstruction error by frobenius norm).
> 
> 
> Best,
> Radim
> 
> 
> > Regarding the accuracy of the decomposition, your choice of k is well below
> > the bound suggested by the theoretical considerations in the Halko, et al,
> > paper.  For small matrices with the size measured in thousands or tens of
> > thousands and very tight desired accuracies, this bound is not all that
> > helpful and it isn't surprising that you may have some problems.
> > 
> > In any, case I think that your experience is divergent enough from the
> > applications being considered in Mahout, that conclusions should be drawn
> > with some caution.
> > 
> > On Sat, Dec 17, 2011 at 10:00 PM, Radim Rehurek
> <Ra...@seznam.cz>wrote:
> > 
> > > Ted Dunning wrote:
> > > > Also, it isn't entirely clear yet whether power iterations are more
> > > > efficient than simply increasing the fudge factor p. Power iterations are
> > > > very effective, and increasing p increases costs in the cube, but running
> > > > MR passes is expensive enough that some increase in p might be sufficient
> > > > and still faster than a power iteration.
> > >
> > > Don't even think about it -- on large matrices (significant truncation),
> > > the randomized SVD without power iterations is complete rubbish, regardless
> > > of any sane value of `p`.
> > >
> > > (assuming your fudge factor p = oversampling parameter `p` from Halko et
> > > al.).
> > >
> > > You can see some accuracy experiments (wikipedia, images) here:
> > > http://groups.google.com/group/gensim/msg/1f3106b85837ce9c and here
> > > http://groups.google.com/group/gensim/msg/240a348b70f18b30 .
> > >
> > > Also see the end of that (longish) thread for some notes on accuracy in
> > > face of many power iterations (~ beware of numeric overflows).
> > >
> > > Best,
> > > Radim
> > >
> > 
> > 
> > 
> 
> 
>

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Wed, Dec 21, 2011 at 3:14 AM, Radim Rehurek <Ra...@seznam.cz> wrote:

>
> Also interesting: I'm not sure it's understood, but distributing SVD (the math) doesn't bring you anything here. The cost is dominated by the number of passes over input data, which is dominated by I/O (for such small values of `k`). This is an area where Mahout can truly shine, because of HDFS -- if the data is already pre-distributed to workers, the cost of IO can be shared. If, on the other hand, you'd need to read the data first, then distribute them further to nodes for processing, then a sequential algo will be faster.
>

Also , to that point: in my experiments cost is not dominated by I/O
when used on barebone clusters. (Amazon EC2, other kind of VMs, YMMV
but not in a barebone rack).
In fact, in terms of performance, if anything, the splitting might be
even more fine grained for optimum horizontal scaled than it is now
when used with extra sparse inputs. I.e. we could easily run more
tasks (but they are collocated in many cases). Performance is an
aspect where we actually may make some significant differences with
hadoop. Improving horizontal scale is tricky though because hadoop
splits based on input sizes by default but flops don't scale
proportionally to input sizes but somewhat faster. Therefore, it would
need some hack of default hadoop splitting mechanism, generating more
clones of splits to run. If that's done that indeed almost constant
running time can be acheived provided cluster has capacity to take
them.

Hope it is useful for your inquery.

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Posted by Radim Rehurek <Ra...@seznam.cz>.

Hi Sean,

good to hear from a user! :)


> That's great info. Do you have a distributed version of this? :) I was
> actually hoping you would...

Actually, yes. It's only small-scale though (a handful of company/lab computers, a few billion non-zeroes), to scratch my own itch. We never had hundreds or thousands of machines, the robustness requirements are quite different there. That's exactly why I say I'm interested in the Mahout experiments. Trust me, I get no kicks out of prodding other implementations, I just want to learn about their results, stories and pitfalls. Saying that "no evaluation needed, we are different" is not an acceptable answer doesn't mean that I'm upset.

Also interesting: I'm not sure it's understood, but distributing SVD (the math) doesn't bring you anything here. The cost is dominated by the number of passes over input data, which is dominated by I/O (for such small values of `k`). This is an area where Mahout can truly shine, because of HDFS -- if the data is already pre-distributed to workers, the cost of IO can be shared. If, on the other hand, you'd need to read the data first, then distribute them further to nodes for processing, then a sequential algo will be faster.


> Your interests are not the same as, say, mine, as a user of the SVD
> for recs. A reconstruction with small error is good all else equal,
> but all else is not equal. The quality of my output does not scale
> proportionally with accuracy. Unfortunately what you suggest simply
> doesn't exist in a form I can use at scale, which is a big issue!


The form is to simply use more power iterations. It already exists, right there, in Mahout, bar the numerical issues I pointed out. It's not like I'm suggesting some crazy to-epsilon accuracy settings, but I doubt any application can be happy with a decomposition that most resembles a baseline of "return zero matrices" in quality.


> I don't think anyone pinged you to claim what's in the project now is
> even finished, let alone optimal. For example I'm not sure if you've
> seen the improvement Dmitriy has done in this area on a few open JIRA
> issues? 

Cool -- hence my plea for cc on progress.

Best of luck,
Radim

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Posted by Sean Owen <sr...@gmail.com>.

2011/12/21 Radim Rehurek <Ra...@seznam.cz>:
> Heh. Is that the official Mahout philosophy for other algorithms as well? "Who cares about correctness, we are happy we can run it on some data at all, so shut up?" I hope you're not serious Ted.

"No", and calm down Radim, this isn't useful. I think it's easy to
read Ted's messages with the wrong voice. I don't think it's meant
that way.


> The results are completely in line with Martinsson et al.'s [1] as well as my previous experiments: no power iteration steps with massive truncation = rubbish output. Accuracy improves exponentially with increasing no. of iteration steps (but see my initial warning re. numerical issues with higher number of steps if implemented naively).

That's great info. Do you have a distributed version of this? :) I was
actually hoping you would...


> From all the dev replies here -- no users actually replied -- I get the vibe that the accuracy discussion annoys you. Now, I dropped by to give a friendly hint about possible serious accuracy concerns, based on experience with mid-scale (billions of non-zeroes) SVD computations in one specific domain (term-document matrices in NLP). And possibly learning about your issues on tera-feature scale datasets in return, which I'm very interested in. Apparently neither of us is getting anything out of this, so I'll stop here.

Your interests are not the same as, say, mine, as a user of the SVD
for recs. A reconstruction with small error is good all else equal,
but all else is not equal. The quality of my output does not scale
proportionally with accuracy. Unfortunately what you suggest simply
doesn't exist in a form I can use at scale, which is a big issue!

I don't think anyone pinged you to claim what's in the project now is
even finished, let alone optimal. For example I'm not sure if you've
seen the improvement Dmitriy has done in this area on a few open JIRA
issues? I am sure it's fair to call it a work in progress as an when
volunteers want to create something better. But I'm sure that
suggesting that this is unuseful or needs warnings is misguided... at
least you may not be considering how this is actually used in practice
for, say, recs.

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Rahim,

thank you for your candor attitude.

The estimate for method accuracy is already given in the study. I did
also study accuracy on good decay and bad decay spectrums for
in-memory datasets. I did not compare Frobenious norm but rather eigen
values compared to other method and synthetic values. For good decay
it's great, for non-existent decay it's bad. So that's a laugh check,
and the rest is in the study (there's exact formula in there for
accuracy error based on singular values decay and probability of
such). I did the laugh check, I also did the R simulation and compared
to it, and I think some other folks who worked on it did the same. But
i don't think anyone had a purpose to re-validate the original study's
estimates so much exactly.

So that's the method, which trades off some accuracy esp. with bad
decay spectrums for speed and ability to take on larger problems .
Nobody ever argued with that. Nobody ever marketed it as an accurate
method. In fact, I continuously warned users that they will not get
good results with a bad decay spectrum (with the notion that why would
they want to study randomized inputs at all pragmatically though). It
is also first thing that is listed on the wiki in trade-offs list.

That said, implementation is believed to be accurate to the method
description. There are probably nuances in terms of what solver to use
for projected problem in front end, whether to solve SVD of B' or
eigen of BB', and whether to use Cholesky decomposition instead of QR,
so that may induce some analysis later. We do simulate corner cases at
small volumes and assert laugh check accuracy (even in unit tests). We
don't do the same for bigger volumes that we can't verify with
alternative solvers.

That said, I feel we need to keep people fully informed, provide
reference to original research work for those who want to dig deeper,
and humbly hope that it will be useful for some pragmatical problems
(and it is has been successfully used here and there).

2011/12/21 Radim Rehurek <Ra...@seznam.cz>:
>> Od: Ted Dunning <te...@gmail.com>
>> Předmět: Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)
>> Datum: 19.12.2011 16:58:57
>> ----------------------------------------
>> The users of SVD on Mahout are more interested in their own application
>> level metrics than the metrics associated with the decomposition itself.
>>  Moreover, at the scale people have been working with, getting any
>> decomposition at all is refreshing.
>
>
> Heh. Is that the official Mahout philosophy for other algorithms as well? "Who cares about correctness, we are happy we can run it on some data at all, so shut up?" I hope you're not serious Ted.
>
> Aren't you afraid people will draw wrong conclusions about an SVD application, using your (possibly wildly inaccurate) SVD implementation? Retract publications?
>
> By all means, use whatever decomposition suits you. But SVD already has a well-established meaning in linear algebra and using that acronym comes with certain expectations. People unfamiliar with the pitfalls of your implementation may assume they're really getting SVD (or at least a version that's "reasonably close" -- in the numerical computing sense). A big fat accuracy warning is in order here. Nobody expects more or less random vectors, even if these happen to perform better than the real truncated SVD in your app [citation needed].
>
>> The examples that you gave in your thread involved walking *way* down the
>> spectrum to the smaller singular values.  That is absolutely not the
>> interest with most Mahout users because that would involve fairly massive
>> over-fitting.
>
>
> Too many opinions, too little data. Instead, I decided to run the English wikipedia experiments with factors=10 and oversampling=5, as per your concerns.
>
> (cross-posting to the gensim mailing list, as this might be of interest to gensim users as well)
>
> Data: English Wikipedia as term-document matrix (0.42B non-zeroes, 3.5M documents, 10K features).
> Requesting top 10 factors (1% of the total singular value mass), not 500 factors like before (15% total mass). Accuracy is evaluated by comparing reconstruction error to Lapack's DSYEV in-core routine on A*A^T. Error = |A*A^T-U*S^2*U^T| / |A*A^T-U_lapack*Sigma_lapack*U_lapack^T|.
>
> batch algo    error
> --------------+------
> baseline*      1.986
> 0 power iters  1.877
> 1 power iter   1.094
> 2 power iters  1.009
> 4 power iters  1.0005
> 6 power iters  1.00009
>
> The results are completely in line with Martinsson et al.'s [1] as well as my previous experiments: no power iteration steps with massive truncation = rubbish output. Accuracy improves exponentially with increasing no. of iteration steps (but see my initial warning re. numerical issues with higher number of steps if implemented naively).
>
> So, your worry that the SVD inaccuracy is somehow due to asking too many factors and irrelevant for thinner SVDs is without substance. Your users certainly deserve to know that without power iterations, the SVD output is on par with baseline, which is a "decomposition" where all factors are simply zero.
>
> From all the dev replies here -- no users actually replied -- I get the vibe that the accuracy discussion annoys you. Now, I dropped by to give a friendly hint about possible serious accuracy concerns, based on experience with mid-scale (billions of non-zeroes) SVD computations in one specific domain (term-document matrices in NLP). And possibly learning about your issues on tera-feature scale datasets in return, which I'm very interested in. Apparently neither of us is getting anything out of this, so I'll stop here.
>
> Best,
> Radim
>
> [1] http://www.sci.ccny.cuny.edu/~szlam/npisvdnipsshort.pdf
>
>
>
>>
>> 2011/12/19 Radim Rehurek <Ra...@seznam.cz>
>>
>> > No problem.
>> >
>> > If you decide to run some SSVD accuracy experiments in the future (don't
>> > your users ask for that?), please include me in cc -- just in case I miss
>> > the post on this mailing list.
>> >
>>
>>
>>

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Posted by Ted Dunning <te...@gmail.com>.

2011/12/21 Radim Rehurek <Ra...@seznam.cz>

> > Od: Ted Dunning <te...@gmail.com>
> > Předmět: Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)
> > Datum: 19.12.2011 16:58:57
> > ----------------------------------------
> > The users of SVD on Mahout are more interested in their own application
> > level metrics than the metrics associated with the decomposition itself.
> >  Moreover, at the scale people have been working with, getting any
> > decomposition at all is refreshing.
>
>
> Heh. Is that the official Mahout philosophy for other algorithms as well?
> "Who cares about correctness, we are happy we can run it on some data at
> all, so shut up?" I hope you're not serious Ted.
>

Approximate algorithms are used all the time for all kinds of purposes.
 Stochastic gradient descent is an order-1 convergence and is rarely run to
the point where the model is at the precise optimum but instead is run to
the point that practical benefit is optimized.

Practical considerations have a major impact in large-scale data analysis.

Aren't you afraid people will draw wrong conclusions about an SVD
> application, using your (possibly wildly inaccurate) SVD implementation?
> Retract publications?
>

Radim, I think you are over-blowing the risk here.  I expect anybody using
any software to determine whether the level of accuracy provided is
sufficient to their needs.


> By all means, use whatever decomposition suits you. But SVD already has a
> well-established meaning in linear algebra and using that acronym comes
> with certain expectations.


And if you look at the recommendation literature you will find that the
meaning is considerably relaxed.


> People unfamiliar with the pitfalls of your implementation may assume
> they're really getting SVD (or at least a version that's "reasonably close"
> -- in the numerical computing sense). A big fat accuracy warning is in
> order here. Nobody expects more or less random vectors, even if these
> happen to perform better than the real truncated SVD in your app [citation
> needed].
>

Anybody who thinks that a "real SVD" produces anything but random numbers
for the more extreme singular vectors when applied to small count data
needs to have their head examined as well.


> > The examples that you gave in your thread involved walking *way* down the
> > spectrum to the smaller singular values.  That is absolutely not the
> > interest with most Mahout users because that would involve fairly massive
> > over-fitting.
>
> Too many opinions, too little data. Instead, I decided to run the English
> wikipedia experiments with factors=10 and oversampling=5, as per your
> concerns.
>

Too much over-heated rhetoric as well.  I wouldn't mind if you tone it down
a bit.


> (cross-posting to the gensim mailing list, as this might be of interest to
> gensim users as well)
>
> Data: English Wikipedia as term-document matrix (0.42B non-zeroes, 3.5M
> documents, 10K features).
> Requesting top 10 factors (1% of the total singular value mass), not 500
> factors like before (15% total mass). Accuracy is evaluated by comparing
> reconstruction error to Lapack's DSYEV in-core routine on A*A^T. Error =
> |A*A^T-U*S^2*U^T| / |A*A^T-U_lapack*Sigma_lapack*U_lapack^T|.
>
> batch algo    error
> --------------+------
> baseline*      1.986
> 0 power iters  1.877
> 1 power iter   1.094
> 2 power iters  1.009
> 4 power iters  1.0005
> 6 power iters  1.00009
>
> The results are completely in line with Martinsson et al.'s [1] as well as
> my previous experiments: no power iteration steps with massive truncation =
> rubbish output. Accuracy improves exponentially with increasing no. of
> iteration steps (but see my initial warning re. numerical issues with
> higher number of steps if implemented naively).
>
> So, your worry that the SVD inaccuracy is somehow due to asking too many
> factors and irrelevant for thinner SVDs is without substance.


Radim, you might go back to my original comment.  You are inventing a
statement here.

My original comment was that it is not well known whether increasing p
would compensate for lack of a power iteration.  You are using a smaller
value of p than the default and have not examined the question that I posed.

I never said that a power iteration was a bad idea.


> ...
> From all the dev replies here -- no users actually replied -- I get the
> vibe that the accuracy discussion annoys you.


No.  It doesn't.

But I think that it can be done in a less confrontational way.


> Now, I dropped by to give a friendly hint about possible serious accuracy
> concerns, based on experience with mid-scale (billions of non-zeroes) SVD
> computations in one specific domain (term-document matrices in NLP). And
> possibly learning about your issues on tera-feature scale datasets in
> return, which I'm very interested in. Apparently neither of us is getting
> anything out of this, so I'll stop here.
>

As you like.

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Posted by Radim Rehurek <Ra...@seznam.cz>.

> Od: Ted Dunning <te...@gmail.com>
> Předmět: Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)
> Datum: 19.12.2011 16:58:57
> ----------------------------------------
> The users of SVD on Mahout are more interested in their own application
> level metrics than the metrics associated with the decomposition itself.
>  Moreover, at the scale people have been working with, getting any
> decomposition at all is refreshing.


Heh. Is that the official Mahout philosophy for other algorithms as well? "Who cares about correctness, we are happy we can run it on some data at all, so shut up?" I hope you're not serious Ted.

Aren't you afraid people will draw wrong conclusions about an SVD application, using your (possibly wildly inaccurate) SVD implementation? Retract publications?

By all means, use whatever decomposition suits you. But SVD already has a well-established meaning in linear algebra and using that acronym comes with certain expectations. People unfamiliar with the pitfalls of your implementation may assume they're really getting SVD (or at least a version that's "reasonably close" -- in the numerical computing sense). A big fat accuracy warning is in order here. Nobody expects more or less random vectors, even if these happen to perform better than the real truncated SVD in your app [citation needed].

> The examples that you gave in your thread involved walking *way* down the
> spectrum to the smaller singular values.  That is absolutely not the
> interest with most Mahout users because that would involve fairly massive
> over-fitting.


Too many opinions, too little data. Instead, I decided to run the English wikipedia experiments with factors=10 and oversampling=5, as per your concerns.

(cross-posting to the gensim mailing list, as this might be of interest to gensim users as well)

Data: English Wikipedia as term-document matrix (0.42B non-zeroes, 3.5M documents, 10K features).
Requesting top 10 factors (1% of the total singular value mass), not 500 factors like before (15% total mass). Accuracy is evaluated by comparing reconstruction error to Lapack's DSYEV in-core routine on A*A^T. Error = |A*A^T-U*S^2*U^T| / |A*A^T-U_lapack*Sigma_lapack*U_lapack^T|.

batch algo    error
--------------+------
baseline*      1.986
0 power iters  1.877
1 power iter   1.094
2 power iters  1.009
4 power iters  1.0005
6 power iters  1.00009

The results are completely in line with Martinsson et al.'s [1] as well as my previous experiments: no power iteration steps with massive truncation = rubbish output. Accuracy improves exponentially with increasing no. of iteration steps (but see my initial warning re. numerical issues with higher number of steps if implemented naively).

So, your worry that the SVD inaccuracy is somehow due to asking too many factors and irrelevant for thinner SVDs is without substance. Your users certainly deserve to know that without power iterations, the SVD output is on par with baseline, which is a "decomposition" where all factors are simply zero.

>From all the dev replies here -- no users actually replied -- I get the vibe that the accuracy discussion annoys you. Now, I dropped by to give a friendly hint about possible serious accuracy concerns, based on experience with mid-scale (billions of non-zeroes) SVD computations in one specific domain (term-document matrices in NLP). And possibly learning about your issues on tera-feature scale datasets in return, which I'm very interested in. Apparently neither of us is getting anything out of this, so I'll stop here.

Best,
Radim

[1] http://www.sci.ccny.cuny.edu/~szlam/npisvdnipsshort.pdf 



> 
> 2011/12/19 Radim Rehurek <Ra...@seznam.cz>
> 
> > No problem.
> >
> > If you decide to run some SSVD accuracy experiments in the future (don't
> > your users ask for that?), please include me in cc -- just in case I miss
> > the post on this mailing list.
> >
> 
> 
>

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Posted by Ted Dunning <te...@gmail.com>.

The users of SVD on Mahout are more interested in their own application
level metrics than the metrics associated with the decomposition itself.
 Moreover, at the scale people have been working with, getting any
decomposition at all is refreshing.

The examples that you gave in your thread involved walking *way* down the
spectrum to the smaller singular values.  That is absolutely not the
interest with most Mahout users because that would involve fairly massive
over-fitting.

2011/12/19 Radim Rehurek <Ra...@seznam.cz>

> No problem.
>
> If you decide to run some SSVD accuracy experiments in the future (don't
> your users ask for that?), please include me in cc -- just in case I miss
> the post on this mailing list.
>

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Posted by Radim Rehurek <Ra...@seznam.cz>.

No problem.

If you decide to run some SSVD accuracy experiments in the future (don't your users ask for that?), please include me in cc -- just in case I miss the post on this mailing list.

Best,
Radim


> ------------ Původní zpráva ------------
> Od: Dmitriy Lyubimov <dl...@gmail.com>
> Předmět: Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)
> Datum: 18.12.2011 23:41:51
> ----------------------------------------
> Radim,
> 
> And thanks for the pointer to accuracy studies. I haven't read them frankly.
> 
> It would seem to me that compiling 500 values very accurately is
> currently either prohibitively costly or not very plausible in large
> scale solvers regardless of approach.it would be interesting to
> compare if they were. Also I am dubious if at scale it really matters
> that much as even double precision arithmetic would perhaps accumulate
> enough errors from that many flops. In practical terms, people don't
> care so much even if they skip a couple documents or even hundreds of
> documents if they are to process millions, which may be a disturbance
> far greater than a precision error accumulated. But like you said,
> it's just my opinion.
> 
> 
>

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Radim,

And thanks for the pointer to accuracy studies. I haven't read them frankly.

It would seem to me that compiling 500 values very accurately is
currently either prohibitively costly or not very plausible in large
scale solvers regardless of approach.it would be interesting to
compare if they were. Also I am dubious if at scale it really matters
that much as even double precision arithmetic would perhaps accumulate
enough errors from that many flops. In practical terms, people don't
care so much even if they skip a couple documents or even hundreds of
documents if they are to process millions, which may be a disturbance
far greater than a precision error accumulated. But like you said,
it's just my opinion.

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

>
> Not sure what you mean by "my solution".

Sorry, in this context it is mean "one's LSI solution" as in "one's
implementation" as in "anyone who builds LSI implementation"/.

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Posted by Radim Rehurek <Ra...@seznam.cz>.

Hello Dmitriy,

> ------------ Původní zpráva ------------
> Od: Dmitriy Lyubimov <dl...@gmail.com>
> Předmět: Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)
> Datum: 18.12.2011 22:36:11
> ----------------------------------------
> >> The 500 is a common "gold-standard" dimensionality used in Latent Semantic
> >> Indexing (one of the applications of SVD), and users explicitly ask for SVD
> >> accuracy -- so there it is, hard numbers :)
> >> Also note that a few million documents, with a few 10k-100k vocabulary, is
> by
> >> far the most common use-case for gensim users. That's why I picked the
> English
> >> wikipedia to test on. If use-cases of Mahout SVD target millions of features
> on
> >> billions of documents, YMMV.
> 
> If I remember it correctly, Dumais and Deerwester were speaking of
> ~200 s.v. in their experiments.

actually, there have been many experiments with setting the optimal dimensionality since. See e.g. "Bradford, 2008: An empirical study of required dimensionality for large-scale latent semantic indexing applications" or "Zha et al. 1998: Large-scale SVD and subspace-based methods for information retrieval" with an MDL approach for more info.


> As far as lsi is concerned, why would one be interested in that many?
> The measures you get are going to greatly depend on the corpus you are
> picking. So your solution for "topics" is biased to begin with. ( The
> mental model for it that i kind of like to think of is that every
> person would have a slightly different meaning of what "politeness"
> means, depending on his upbringing and experience, i.e. on his
> personal "training corpus") .


Not sure what you mean by "my solution". Latent semantic analysis was developed in the 80s (not by me). For whatever reason, it still seems to be popular, and people ask for tools that implement LSA efficiently. I kinda thought perhaps some people used the SVD in Mahout for similar goals, that's why I brought it up.

Also, the discussion is about truncated SVD accuracy. This can be measured down to machine precision -- no need to resort to opinions :) I'm genuinely curious about the Mahout implementation, whatever your domain of application is (incl. rocket boosters).

Best,
Radim


> So in many cases data is rather biased to begin with. that's why lsi
> is not the same as trying to compute geometry of a rocket booster.
> 
> 
>

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

as far as accuracy, i only did experiments on small-ish inputs that i
could hold in memory for control of things and with a good decay in
s.v., and it was quite good with even single power iterations. Now
this is not the same as to do stuff with a completely random inputs,
which i did too and then sure enough the problems were evident.

However, if you are looking for trends, and trends are there, you will
find them. And if they are not, then the results are pretty useless
regardless which ways they are computed.

I guess there are problems that try to git rid of low frequency but
high amplitudal nosie out there, i think i saw a post from someone
recently, but they require a modified approach anyway. Direct SVD
doesn't help much either.

On Sun, Dec 18, 2011 at 1:45 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> or, in terms of pragmatical problems: if i work for computer industry
> and want for LSI to figure out that "java coffee" and "java code" is
> completely orthogonal concepts despte of common terms present, i just
> throw in a mixture of texts mentioning both uses and as long it tells
> me those are different things with high degree of confidence, i don't
> care about abolute value of that confidence. Which it does.
>
> On Sun, Dec 18, 2011 at 1:35 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> The 500 is a common "gold-standard" dimensionality used in Latent Semantic
>>>> Indexing (one of the applications of SVD), and users explicitly ask for SVD
>>>> accuracy -- so there it is, hard numbers :)
>>>> Also note that a few million documents, with a few 10k-100k vocabulary, is by
>>>> far the most common use-case for gensim users. That's why I picked the English
>>>> wikipedia to test on. If use-cases of Mahout SVD target millions of features on
>>>> billions of documents, YMMV.
>>
>> If I remember it correctly, Dumais and Deerwester were speaking of
>> ~200 s.v. in their experiments.
>>
>> As far as lsi is concerned, why would one be interested in that many?
>> The measures you get are going to greatly depend on the corpus you are
>> picking. So your solution for "topics" is biased to begin with. ( The
>> mental model for it that i kind of like to think of is that every
>> person would have a slightly different meaning of what "politeness"
>> means, depending on his upbringing and experience, i.e. on his
>> personal "training corpus") .
>>
>> So in many cases data is rather biased to begin with. that's why lsi
>> is not the same as trying to compute geometry of a rocket booster.

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

or, in terms of pragmatical problems: if i work for computer industry
and want for LSI to figure out that "java coffee" and "java code" is
completely orthogonal concepts despte of common terms present, i just
throw in a mixture of texts mentioning both uses and as long it tells
me those are different things with high degree of confidence, i don't
care about abolute value of that confidence. Which it does.

On Sun, Dec 18, 2011 at 1:35 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> The 500 is a common "gold-standard" dimensionality used in Latent Semantic
>>> Indexing (one of the applications of SVD), and users explicitly ask for SVD
>>> accuracy -- so there it is, hard numbers :)
>>> Also note that a few million documents, with a few 10k-100k vocabulary, is by
>>> far the most common use-case for gensim users. That's why I picked the English
>>> wikipedia to test on. If use-cases of Mahout SVD target millions of features on
>>> billions of documents, YMMV.
>
> If I remember it correctly, Dumais and Deerwester were speaking of
> ~200 s.v. in their experiments.
>
> As far as lsi is concerned, why would one be interested in that many?
> The measures you get are going to greatly depend on the corpus you are
> picking. So your solution for "topics" is biased to begin with. ( The
> mental model for it that i kind of like to think of is that every
> person would have a slightly different meaning of what "politeness"
> means, depending on his upbringing and experience, i.e. on his
> personal "training corpus") .
>
> So in many cases data is rather biased to begin with. that's why lsi
> is not the same as trying to compute geometry of a rocket booster.

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

>> The 500 is a common "gold-standard" dimensionality used in Latent Semantic
>> Indexing (one of the applications of SVD), and users explicitly ask for SVD
>> accuracy -- so there it is, hard numbers :)
>> Also note that a few million documents, with a few 10k-100k vocabulary, is by
>> far the most common use-case for gensim users. That's why I picked the English
>> wikipedia to test on. If use-cases of Mahout SVD target millions of features on
>> billions of documents, YMMV.

If I remember it correctly, Dumais and Deerwester were speaking of
~200 s.v. in their experiments.

As far as lsi is concerned, why would one be interested in that many?
The measures you get are going to greatly depend on the corpus you are
picking. So your solution for "topics" is biased to begin with. ( The
mental model for it that i kind of like to think of is that every
person would have a slightly different meaning of what "politeness"
means, depending on his upbringing and experience, i.e. on his
personal "training corpus") .

So in many cases data is rather biased to begin with. that's why lsi
is not the same as trying to compute geometry of a rocket booster.