You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Dan Brickley <da...@danbri.org> on 2012/03/27 11:01:10 UTC

options for finding smallest eigenvectors

If one wanted the *smallest* eigenvectors in a (potentially rather
large but sparse) matrix, what are the options in Mahout currently?
Lanczos? ssvd?

Specifically the eigenvector corresponding to the second-smallest
eigenvalue... (there's always a boring zero one) of laplacian
re-representation of affinity matrix, in my case.

Reading around,
http://www.sfu.ca/personal/archives/richards/Pages/NAS.AJS-WDR.pdf

"While the Laplacian spectrum has many attractive theoretical
properties which make it very useful for partitioning large sparse
arrays, it has at least one drawback: the Fiedler eigenvector is
associated with the second smallest eigenvalue. The current method of
choice for finding a few eigenvectors of very large sparse matrices is
the Lanczos method, which tends not to find small eigenvectors as
quickly as it finds the largest eigenvectors (Pothen, et al, 1990). "

That paper also suggests a workaround for this, "The definition of L
shows that there is no loss of sparsity (except for the diagonal) and
that the sparse methods mentioned earlier can be applied to find all
or some of the eigenpairs. The requirement that we must find the
smallest eigenpairs is easily overcome by subtracting a suitably large
constant from the diagonal of -L (which subtracts that constant from
the eigenvalues without  hanging the eigenvectors).This guarantees
that the first eigenpairs returned by the Power Method or Lanczos
iteration are associated with the smallest eigenvalues of L."

There's a thesis at http://am.vsb.cz/theses/kabelikova_ing.pdf that
seems to try this (or a related) method, but I didn't dig into the
details.


If I understand the literature correctly (I've been poking around
spectral graph theory e.g. see
http://www.cs.purdue.edu/homes/dgleich/demos/matlab/spectral/spectral.html
for a fairly accessible matlab example), this second-smallest (aka
fiedler) eigenvector can be very interesting for finding structure in
graph datasets, even if intuitively we're drawn more towards the
larger eigenvectors for explaining bulk variation in the data. This is
all closely related to Mahout's spectral clustering component, but
from talking with Shannon I don't believe we currently do anything to
find the smallest eigenvectors there. Are Lanczos and SSVD Mahout's
two natural starting points for this kind of thing, or am I forgetting
some other machinery?

Thanks for any pointers or fixes to my understanding,

Dan

Re: options for finding smallest eigenvectors

Posted by Isabel Drost <is...@apache.org>.

On 28.03.2012 Dmitriy Lyubimov wrote:
> Nathan Halko's thesis did detailed comparisons of singular values
> between Mahout's Lanczos and SSVD. You can look up a link to his
> dissertation on this list archive. (or perhaps he mentioned it @dev,
> can't remember on top of my head).

When you find it - could you please add it there:

https://cwiki.apache.org/confluence/display/MAHOUT/Books+Tutorials+and+Talks

Maybe add a separate section for scientific publications involving Mahout.


Isabel

Re: options for finding smallest eigenvectors

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Tue, Mar 27, 2012 at 4:41 PM, Dan Brickley <da...@danbri.org> wrote:
> Maybe this is as good an excuse as any to get my hands dirty with
> SSVD... Is there any strong reason to prefer Lanczos for this sort of
> thing? Depends on app? (mine won't care massively about super
> precision; speed would generally be more of a concern)

Reasons to prefer SSVD are mainly speed and higher eigenvalue count.
Also, trade-offs speed/precision is adjustable with -q and -p
parameters. With q=1 i don't notice significant difference in
eigenvalues (significant as in by eyeballing the plots of such values)
and it looks like SSVD is also preferrable even due to precision
beyond 40-th eigenvalue on a wikipedia dataset.

>
>
> cheers,
>
>
> Dan

Re: options for finding smallest eigenvectors

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Dan,

Nathan Halko's thesis did detailed comparisons of singular values
between Mahout's Lanczos and SSVD. You can look up a link to his
dissertation on this list archive. (or perhaps he mentioned it @dev,
can't remember on top of my head).

Bottom line, the way i read this thesis, Mahout's Lanczos loses
precision in his experiments in the area of 40-th eigenvalue when used
on wikipedia dataset and runs in OOM in the area of 60-th eigenvalue.
SSVD results on the other hand look good with q=1 and outstanding with
q=2 and there's virtually no difference between q=2 and q=3. He did
not go beyond 100 or 200 eigenvalues though, i think. Doing 200
eigenvalues and beyond is fairly flops intensive but i think one could
do it if he or she wanted to, with little evidence that precision will
suffer at a faster ratee than it did for first 100 values.

It looks like Nathan tested it with pre-0.6 release as he ran into
some deficiencies with power iterations which largely were fixed in
0.6. But can't speak for him.

If you find anything to add on top of this, this would certainly be a
welcome testimony.

-d

On Tue, Mar 27, 2012 at 4:41 PM, Dan Brickley <da...@danbri.org> wrote:
> On 27 March 2012 18:22, Ted Dunning <te...@gmail.com> wrote:
>> THe smallest eigenvalues are always problematic in large matrices.
>>
>> Any trick to expose them (such as the diagonal subtraction that you
>> mention) should work with any of our stuff as well.
>
> Thanks, Ted; much appreciated! I'll see if I can get a stronger grasp
> on those tricks via matlab/R/etc first, then take a look at the Mahout
> options (Lanczos or SSVD). Fairly nearby is a discussion with Shannon
> about the Spectral Clustering code, and whether investigating
> migration of that to SSVD might make sense. (c.f.
> https://issues.apache.org/jira/browse/MAHOUT-986 )
>
> Maybe this is as good an excuse as any to get my hands dirty with
> SSVD... Is there any strong reason to prefer Lanczos for this sort of
> thing? Depends on app? (mine won't care massively about super
> precision; speed would generally be more of a concern)
>
>
> cheers,
>
>
> Dan

Re: options for finding smallest eigenvectors

Posted by Dan Brickley <da...@danbri.org>.

On 27 March 2012 18:22, Ted Dunning <te...@gmail.com> wrote:
> THe smallest eigenvalues are always problematic in large matrices.
>
> Any trick to expose them (such as the diagonal subtraction that you
> mention) should work with any of our stuff as well.

Thanks, Ted; much appreciated! I'll see if I can get a stronger grasp
on those tricks via matlab/R/etc first, then take a look at the Mahout
options (Lanczos or SSVD). Fairly nearby is a discussion with Shannon
about the Spectral Clustering code, and whether investigating
migration of that to SSVD might make sense. (c.f.
https://issues.apache.org/jira/browse/MAHOUT-986 )

Maybe this is as good an excuse as any to get my hands dirty with
SSVD... Is there any strong reason to prefer Lanczos for this sort of
thing? Depends on app? (mine won't care massively about super
precision; speed would generally be more of a concern)

cheers,

Dan

Re: options for finding smallest eigenvectors

Posted by Ted Dunning <te...@gmail.com>.

THe smallest eigenvalues are always problematic in large matrices.

Any trick to expose them (such as the diagonal subtraction that you
mention) should work with any of our stuff as well.

On Tue, Mar 27, 2012 at 2:01 AM, Dan Brickley <da...@danbri.org> wrote:

> If one wanted the *smallest* eigenvectors in a (potentially rather
> large but sparse) matrix, what are the options in Mahout currently?
> Lanczos? ssvd?
>
> Specifically the eigenvector corresponding to the second-smallest
> eigenvalue... (there's always a boring zero one) of laplacian
> re-representation of affinity matrix, in my case.
>
> Reading around,
> http://www.sfu.ca/personal/archives/richards/Pages/NAS.AJS-WDR.pdf
>
> "While the Laplacian spectrum has many attractive theoretical
> properties which make it very useful for partitioning large sparse
> arrays, it has at least one drawback: the Fiedler eigenvector is
> associated with the second smallest eigenvalue. The current method of
> choice for finding a few eigenvectors of very large sparse matrices is
> the Lanczos method, which tends not to find small eigenvectors as
> quickly as it finds the largest eigenvectors (Pothen, et al, 1990). "
>
> That paper also suggests a workaround for this, "The definition of L
> shows that there is no loss of sparsity (except for the diagonal) and
> that the sparse methods mentioned earlier can be applied to find all
> or some of the eigenpairs. The requirement that we must find the
> smallest eigenpairs is easily overcome by subtracting a suitably large
> constant from the diagonal of -L (which subtracts that constant from
> the eigenvalues without  hanging the eigenvectors).This guarantees
> that the first eigenpairs returned by the Power Method or Lanczos
> iteration are associated with the smallest eigenvalues of L."
>
> There's a thesis at http://am.vsb.cz/theses/kabelikova_ing.pdf that
> seems to try this (or a related) method, but I didn't dig into the
> details.
>
>
> If I understand the literature correctly (I've been poking around
> spectral graph theory e.g. see
> http://www.cs.purdue.edu/homes/dgleich/demos/matlab/spectral/spectral.html
> for a fairly accessible matlab example), this second-smallest (aka
> fiedler) eigenvector can be very interesting for finding structure in
> graph datasets, even if intuitively we're drawn more towards the
> larger eigenvectors for explaining bulk variation in the data. This is
> all closely related to Mahout's spectral clustering component, but
> from talking with Shannon I don't believe we currently do anything to
> find the smallest eigenvectors there. Are Lanczos and SSVD Mahout's
> two natural starting points for this kind of thing, or am I forgetting
> some other machinery?
>
> Thanks for any pointers or fixes to my understanding,
>
> Dan
>