You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/07/05 15:43:40 UTC

SVD and Clustering

Hi,

I think Jake mentioned earlier that he recommends running SVD before running clustering.  I was wondering if anyone had any pointers on papers, etc. that discuss this, or is it just seen as a general way of doing feature reduction and therefore it makes sense to do.

Thanks,
Grant

Re: SVD and Clustering

Posted by Ted Dunning <te...@gmail.com>.
Practically speaking, term weighting is important, but you also have to
watch out for eigen-spoke behavior.

https://research.sprintlabs.com/publications/uploads/icdm-09-ldmta-camera-ready.pdf

This can arise when you have strong clique-phenomenon in your data (not
likely in your case) or where you have strong mechanical effects in your
data.  This latter problem can be due to fixed phrases or boilerplate in
text.

One way of dealing with eigen-spokes is to normalize your vectors to the
unit sphere after deriving them using SVD.  This reduces an eigenspoke to a
very tight cluster.

Even after normalization, it isn't uncommon to see power law cluster
membership.  Whether this is a problem for you depends on what the purpose
of your clustering is.

Do also note that clustering SVD-reduced representations is pretty much just
the simplest form of spectral clustering which is what the Eigencuts summer
of code project is all about.  Shannon seems pretty well informed on these
techniques, so you might talk to him.

On Mon, Jul 5, 2010 at 9:57 AM, Grant Ingersoll <gs...@apache.org> wrote:

> Actually, what I'm really after are what techniques people find successful
> in producing good results.  It's pretty clear to me in testing that things
> like good stopword (obviously for text clustering) removal are helpful, but
> what else?  I see a number of papers on the subject (search for "feature
> selection for clustering") but I'm interested in what people have found on
> the practical side of things.
>
>
> -Grant
>
> On Jul 5, 2010, at 9:43 AM, Grant Ingersoll wrote:
>
> > Hi,
> >
> > I think Jake mentioned earlier that he recommends running SVD before
> running clustering.  I was wondering if anyone had any pointers on papers,
> etc. that discuss this, or is it just seen as a general way of doing feature
> reduction and therefore it makes sense to do.
> >
> > Thanks,
> > Grant
>
>

Re: SVD and Clustering

Posted by Grant Ingersoll <gs...@apache.org>.
Actually, what I'm really after are what techniques people find successful in producing good results.  It's pretty clear to me in testing that things like good stopword (obviously for text clustering) removal are helpful, but what else?  I see a number of papers on the subject (search for "feature selection for clustering") but I'm interested in what people have found on the practical side of things.


-Grant

On Jul 5, 2010, at 9:43 AM, Grant Ingersoll wrote:

> Hi,
> 
> I think Jake mentioned earlier that he recommends running SVD before running clustering.  I was wondering if anyone had any pointers on papers, etc. that discuss this, or is it just seen as a general way of doing feature reduction and therefore it makes sense to do.
> 
> Thanks,
> Grant


Re: SVD and Clustering

Posted by Ted Dunning <te...@gmail.com>.
The logic is that all documents are in the same language so that the most
common words in that language will provide a profile that dominates all
content specific patterns.  The vector for each document should therefore be
roughly a scaled version of the content-neutral, language-specific vector
plus some corrections for the actual content variations.  This dominant
vector should be the first eigenvector.

As I look at it again, this argument is not so strong.  Or perhaps, we
should consider it "subtle".

First, for linear term weighting (traditional tf.idf), all of the components
will scale with length and the argument is hooey.

For log(tf+1) . idf, on the other hand doubling the length of the document
will give an offset to common words because they already have tf > 1 so
log(tf + 1) \approx log(tf).  For less common words, we will see a
transition from tf=0 to tf=1 as the length occurs.  This will definitely
come much closer to isolating length and universal word frequency patterns
in the first eigenvector.

A second problem could come about if your stop word list really is removing
all non-content words.  Obviously, for a short list, it won't be removing
all of the words common across all documents so you will still have the
remnant of the pattern I mentioned for log(tf+1), but it might conceivably
be decimated to the point of not being the first eigenvector.  As such, you
might view dropping the first eigenvector as roughly equivalent to using a
stop word list.


On Tue, Jul 6, 2010 at 12:04 AM, Jake Mannix <ja...@gmail.com> wrote:

> Hey Ted,
>
>  What is the reasoning again that dropping the first eigenvector is
> equivalent
> to forgetting about normalization / centering?  I have heard this before,
> but
> don't know if the math pops out at me... (isn't the first right singular
> vector
> also effectively the "TextRank" vector, for suitable input documents?)
>
>  -jake
>
> On Tue, Jul 6, 2010 at 8:27 AM, Ted Dunning <te...@gmail.com> wrote:
>
> > Related to normalization, the original LSA team claimed better results
> with
> > tf.idf weighting.  I would tend to use log(1+tf) . idf instead.  I think
> > that term weighting of this sort is quite common.
> >
> > Document level normalization is a bit less common.  It is common
> practice,
> > however, to not normalize documents but instead to drop the first
> > eigenvector on the theory that is where the document norm winds up
> anyway.
> >  I would imagine that normalizing documents to some degree would make the
> > numerics of computing the SVD a bit better and save the extra work of
> > computing and then throwing away that eigenvector.  The first eigenvector
> > also takes the load of centering the documents.
> >
> > I do know that I have forgotten to toss that first eigenvector on several
> > occasions and been mystified for a time at how my results weren't as
> good.
> >
> > On Mon, Jul 5, 2010 at 11:16 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> >
> > > In my own experience, things like graphs (including bipartite graphs
> like
> > > ratings matrices) I normalized before *and* after, but text I don't
> > (unit)
> > > normalize before, but do normalize after.
> > >
> > > The reasoning I use is that normalizing the rows of graphs has
> > > a meaning in the context of the graph (you're doing the PageRank-like
> > > thing of normalizing outflowing probability when looking at random
> > > walks, for example, or for ratings matrices, you're saying that
> > > everyone gets "one vote" to distribute amongst the things they've
> > > rated [these apply for doing L_1 normalization, which isn't always
> > > appropriate]), while I don't know if I buy the similar description of
> > > what pre-normalizing the rows of a text corpus.
> > >
> >
>

Re: SVD and Clustering

Posted by Jake Mannix <ja...@gmail.com>.
Hey Ted,

  What is the reasoning again that dropping the first eigenvector is
equivalent
to forgetting about normalization / centering?  I have heard this before,
but
don't know if the math pops out at me... (isn't the first right singular
vector
also effectively the "TextRank" vector, for suitable input documents?)

  -jake

On Tue, Jul 6, 2010 at 8:27 AM, Ted Dunning <te...@gmail.com> wrote:

> Related to normalization, the original LSA team claimed better results with
> tf.idf weighting.  I would tend to use log(1+tf) . idf instead.  I think
> that term weighting of this sort is quite common.
>
> Document level normalization is a bit less common.  It is common practice,
> however, to not normalize documents but instead to drop the first
> eigenvector on the theory that is where the document norm winds up anyway.
>  I would imagine that normalizing documents to some degree would make the
> numerics of computing the SVD a bit better and save the extra work of
> computing and then throwing away that eigenvector.  The first eigenvector
> also takes the load of centering the documents.
>
> I do know that I have forgotten to toss that first eigenvector on several
> occasions and been mystified for a time at how my results weren't as good.
>
> On Mon, Jul 5, 2010 at 11:16 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > In my own experience, things like graphs (including bipartite graphs like
> > ratings matrices) I normalized before *and* after, but text I don't
> (unit)
> > normalize before, but do normalize after.
> >
> > The reasoning I use is that normalizing the rows of graphs has
> > a meaning in the context of the graph (you're doing the PageRank-like
> > thing of normalizing outflowing probability when looking at random
> > walks, for example, or for ratings matrices, you're saying that
> > everyone gets "one vote" to distribute amongst the things they've
> > rated [these apply for doing L_1 normalization, which isn't always
> > appropriate]), while I don't know if I buy the similar description of
> > what pre-normalizing the rows of a text corpus.
> >
>

Re: SVD and Clustering

Posted by Ted Dunning <te...@gmail.com>.
Related to normalization, the original LSA team claimed better results with
tf.idf weighting.  I would tend to use log(1+tf) . idf instead.  I think
that term weighting of this sort is quite common.

Document level normalization is a bit less common.  It is common practice,
however, to not normalize documents but instead to drop the first
eigenvector on the theory that is where the document norm winds up anyway.
 I would imagine that normalizing documents to some degree would make the
numerics of computing the SVD a bit better and save the extra work of
computing and then throwing away that eigenvector.  The first eigenvector
also takes the load of centering the documents.

I do know that I have forgotten to toss that first eigenvector on several
occasions and been mystified for a time at how my results weren't as good.

On Mon, Jul 5, 2010 at 11:16 PM, Jake Mannix <ja...@gmail.com> wrote:

> In my own experience, things like graphs (including bipartite graphs like
> ratings matrices) I normalized before *and* after, but text I don't (unit)
> normalize before, but do normalize after.
>
> The reasoning I use is that normalizing the rows of graphs has
> a meaning in the context of the graph (you're doing the PageRank-like
> thing of normalizing outflowing probability when looking at random
> walks, for example, or for ratings matrices, you're saying that
> everyone gets "one vote" to distribute amongst the things they've
> rated [these apply for doing L_1 normalization, which isn't always
> appropriate]), while I don't know if I buy the similar description of
> what pre-normalizing the rows of a text corpus.
>

Re: SVD and Clustering

Posted by Jake Mannix <ja...@gmail.com>.
In my own experience, things like graphs (including bipartite graphs like
ratings matrices) I normalized before *and* after, but text I don't (unit)
normalize before, but do normalize after.

The reasoning I use is that normalizing the rows of graphs has
a meaning in the context of the graph (you're doing the PageRank-like
thing of normalizing outflowing probability when looking at random
walks, for example, or for ratings matrices, you're saying that
everyone gets "one vote" to distribute amongst the things they've
rated [these apply for doing L_1 normalization, which isn't always
appropriate]), while I don't know if I buy the similar description of
what pre-normalizing the rows of a text corpus.

  -jake


On Tue, Jul 6, 2010 at 1:08 AM, Ted Dunning <te...@gmail.com> wrote:

> On Mon, Jul 5, 2010 at 12:34 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
>
> >
> > On Jul 5, 2010, at 1:17 PM, Ted Dunning wrote:
> >
> > > Yes to this.
> > >
> > > On Mon, Jul 5, 2010 at 6:43 AM, Grant Ingersoll <gs...@apache.org>
> > wrote:
> > >
> > >> is it just seen as a general way of doing feature reduction and
> > therefore
> > >> it makes sense to do.
> >
> > Should I normalize my vectors before doing SVD or after or not at all?
>
>
> Yes.  :-)
>
> Any of these can help.  Normalizing before will probably not have a huge
> effect, but could be helpful if you have certain kinds of odd documents.
>  Normalizing document vectors after SVD may be critical to avoid problems
> with eigenspokes.  Avoiding normalization is important in certain other
> situations.
>
> So the answer to your two binary questions expressed as four possible
> options is "Yes".
>
> Try it and apply the laugh test to each option.
>

Re: SVD and Clustering

Posted by Ted Dunning <te...@gmail.com>.
On Mon, Jul 5, 2010 at 12:34 PM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Jul 5, 2010, at 1:17 PM, Ted Dunning wrote:
>
> > Yes to this.
> >
> > On Mon, Jul 5, 2010 at 6:43 AM, Grant Ingersoll <gs...@apache.org>
> wrote:
> >
> >> is it just seen as a general way of doing feature reduction and
> therefore
> >> it makes sense to do.
>
> Should I normalize my vectors before doing SVD or after or not at all?


Yes.  :-)

Any of these can help.  Normalizing before will probably not have a huge
effect, but could be helpful if you have certain kinds of odd documents.
 Normalizing document vectors after SVD may be critical to avoid problems
with eigenspokes.  Avoiding normalization is important in certain other
situations.

So the answer to your two binary questions expressed as four possible
options is "Yes".

Try it and apply the laugh test to each option.

Re: SVD and Clustering

Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 5, 2010, at 1:17 PM, Ted Dunning wrote:

> Yes to this.
> 
> On Mon, Jul 5, 2010 at 6:43 AM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> is it just seen as a general way of doing feature reduction and therefore
>> it makes sense to do.

Should I normalize my vectors before doing SVD or after or not at all?

Re: SVD and Clustering

Posted by Ted Dunning <te...@gmail.com>.
Yes to this.

On Mon, Jul 5, 2010 at 6:43 AM, Grant Ingersoll <gs...@apache.org> wrote:

> is it just seen as a general way of doing feature reduction and therefore
> it makes sense to do.