You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Paul Jones <pa...@yahoo.co.uk> on 2009/06/24 02:01:48 UTC

LSI, cosine and others which use vectors

Yes another question, am going through a rapid learning curve...

All these vector based systems, which require you to build a term-doc etc, are they of any use in a system where the data is changing, i.e lets assume the docs are webpages, which are being crawled, and hence updated. Surely if there is a vector diagram being formed, then the position of these vectors changes based on the changes (size, content) of the entire matrix, or am I missing something here.

If the above is correct, then is a actual live project how is this done, are distances worked out on a per-day type of basis, and the indexes then updated ?

Paul

Re: LSI, cosine and others which use vectors

Posted by Ted Dunning <te...@gmail.com>.

The changes in the weightings are exceedingly small and can be neglected in
practice for quite a long time.

On Wed, Jun 24, 2009 at 6:02 AM, Paul Jones <pa...@yahoo.co.uk>wrote:

> >>>> I don't follow how would you do just part of the matrix, i.e if you
> have created the term-doc, surely the weighted data would need to be
> weighted across the whole set, in order to give you relevance, not sure how
> you would slice through a subset of the matrix you want to work with.
>

Re: LSI, cosine and others which use vectors

Posted by Ted Dunning <te...@gmail.com>.

The matrix algebra is just a compact notation for a pattern of arithmetic
operations.

Let's actually take documents as rows and words as columns since that is
more common practice.  If you look at the definition of the matrix product
A' A, (where A' is the transpose of A, i.e. it is term by document rather
than document by term), then we get this:

   b_ij = sum_k a_ki a_kj

If all of the elements of A are binary, then the product inside the sum will
be 1 where both a_ik and a_jk are 1.  That means the sum will be a count of
the documents which have both word i and word j.  This is the cooccurrence
count for words i and j.  IF A is not binary, but is weighted then this sum
is the weighted similarity between words i and j.

Repeating this trick, you find that B'B = (A'A)' (A'A) is a term by term
matrix that measures how similar the coocurrence vectors are for two words.
If you expand it out, you will see that B'B is nothing but a bunch of
dot-products (i.e. cosines of angles multiplied by magnitudes).  You may
want to normalize the rows of A'A if you are using weighted arithmetic or
sparsify A'A if you are using counts, but the pattern of operations.

Again, there is nothing magical about the matrix notation.  It is just a
compact way of describing a bunch of arithmetic.  It also let's us tap into
a wealth of results that have been derived for linear algebra that we can
either misuse for our purposes or from which we can derive inspiration.

On Wed, Jun 24, 2009 at 6:02 AM, Paul Jones <pa...@yahoo.co.uk>wrote:

> >>>> Okay aside from being confused with matrix algebra :-), am confused
> with the "easy to implement using a doc x term matrix", i.e not sure how a
> doc-term matrix would work out the similiarity between words, is it not
> working out the occurrence of words in a doc. Maybe I am
> misunderstanding...Lets say I have a matrix built, where the docs are the
> columns, and the words as rows, now my limited understanding from what I
> have read says that this matrix can be represented as a number of vectors,
> eg lets say we have one document, with 3 words, then the x/y/z axis will
> represent each word and its freq of occurence, and hence the point in space
> forming the vector depicts this word related to that document.
>
> And this can be expanded. Now if we have 2 documents, with 2 more words, we
> have another point. The distance between them shows how similar they are,
> and hence how similar the document is too each other.
>
> So far so good, but I am unsure how this translates in showing how similar
> the words are themselves, i.e co-occurence, would that not have to be a
> term-term matrix
>

-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: LSI, cosine and others which use vectors

Posted by Paul Jones <pa...@yahoo.co.uk>.

>>> inline

________________________________
From: Ted Dunning <te...@gmail.com>
To: mahout-user@lucene.apache.org
Sent: Wednesday, 24 June, 2009 7:37:38
Subject: Re: LSI, cosine and others which use vectors

On Tue, Jun 23, 2009 at 8:05 PM, Paul Jones <pa...@yahoo.co.uk>wrote:

> tks Ted, but if its a live system, and you have 10 million documents, then
> isn't the computation on the fly going to be a pain, if you add say 1000
> docs per hour or whatever, which is why I was assuming that its a batch
> process.
>

That really depends on what you are doing.  Mostly, it turns out vastly
better.  For instance, if you have 10^7 documents as you say, then you have
10^14 distances to think about.   In fact, most of those distances are large
enough that you can ignore them so it is nice to have a (notional/virtual)
sparse similarity matrix instead.  The cost of computing the entire doc-doc
matrix is substantial and more than you can typically do on the fly, but you
almost never need the entire similarity matrix.  So it is cheaper in many
cases to just compute the part that you need on the fly.

>>>> I don't follow how would you do just part of the matrix, i.e if you have created the term-doc, surely the weighted data would need to be weighted across the whole set, in order to give you relevance, not sure how you would slice through a subset of the matrix you want to work with. 

There are exceptions, notably in recommendations where it pays to compute a
large number of item-item similarities, sparsify the result and store it as
a binary vector.

[snip]

Indeed.

If you look at your problem again, there are some interesting connections
between what you want to do and matrix algebra.  For instance, let's call
your document x term matrix of counts A.  Then the document level
cooccurrence matrix is A' A.  It often helps to use inverse-document
frequency scaled values instead of  counters here or to post-process the raw
counts to find anomalously large one.  The next step, however, is to not
just find words that co-occur, but to find words that have similar patterns
of cooccurrence with each other.  If two words are interchangeable, it is
reasonable to expect that they wouldn't ever occur together, but you would
expect that they would co-occur with other words in just the same way.  You
can find words that have similar cooccurrence patterns by comparing rows (or
columns) of the cooccurrence matrix with other rows of the smae matrix.
Using a matrix product does this wholesale:

           (A' A)' (A' A) = A'A A'A

If you decompose A using singular values, you get A = U' s V so A' A = V'
s^2 V and A' A A' A = V' s^4 V.  What will happen here is that the vectors
with the largest few singular values will dominate so you can approximate
this result by doing a term level random walk.

     while (many repetitions) {
         Term t = random term
         d = random document containing t
         u1 = random word in d
         d = random document containing u1
         u2 = random word in d
         count([t,u2])
     }

Pairs that have lots of counts will be the related words you want.  Note how
it is very easy to implement this using a doc x term matrix, especially one
implemented using something like Lucene.

>>>> Okay aside from being confused with matrix algebra :-), am confused with the "easy to implement using a doc x term matrix", i.e not sure how a doc-term matrix would work out the similiarity between words, is it not working out the occurrence of words in a doc. Maybe I am misunderstanding...Lets say I have a matrix built, where the docs are the columns, and the words as rows, now my limited understanding from what I have read says that this matrix can be represented as a number of vectors, eg lets say we have one document, with 3 words, then the x/y/z axis will represent each word and its freq of occurence, and hence the point in space forming the vector depicts this word related to that document.

And this can be expanded. Now if we have 2 documents, with 2 more words, we have another point. The distance between them shows how similar they are, and hence how similar the document is too each other. 

So far so good, but I am unsure how this translates in showing how similar the words are themselves, i.e co-occurence, would that not have to be a term-term matrix
[snip]

Tks again, 

Paul

Re: LSI, cosine and others which use vectors

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Jun 23, 2009 at 8:05 PM, Paul Jones <pa...@yahoo.co.uk>wrote:

> tks Ted, but if its a live system, and you have 10 million documents, then
> isn't the computation on the fly going to be a pain, if you add say 1000
> docs per hour or whatever, which is why I was assuming that its a batch
> process.
>

That really depends on what you are doing.  Mostly, it turns out vastly
better.  For instance, if you have 10^7 documents as you say, then you have
10^14 distances to think about.   In fact, most of those distances are large
enough that you can ignore them so it is nice to have a (notional/virtual)
sparse similarity matrix instead.  The cost of computing the entire doc-doc
matrix is substantial and more than you can typically do on the fly, but you
almost never need the entire similarity matrix.  So it is cheaper in many
cases to just compute the part that you need on the fly.

There are exceptions, notably in recommendations where it pays to compute a
large number of item-item similarities, sparsify the result and store it as
a binary vector.

Also I think I have worked out what I meant about the relationships between
> the words themselves, I think I was looking to build a term-term matrix
> instead of a term-doc, whereby I have the freq of occurence of each word
> alongside each other word in a doc.(I guess easy way to start is that the
> two words can co-occur anywhere in the doc).
>

This is an excellent first step for word relationships.

> If done, hopefully the 'distance' between the two vectors should give me a
> relative relationship. I realise lots of problems with this approach. i.e
> how don't know how the words are related...I just know that they are.
>

Indeed.

If you look at your problem again, there are some interesting connections
between what you want to do and matrix algebra.  For instance, let's call
your document x term matrix of counts A.  Then the document level
cooccurrence matrix is A' A.  It often helps to use inverse-document
frequency scaled values instead of  counters here or to post-process the raw
counts to find anomalously large one.  The next step, however, is to not
just find words that co-occur, but to find words that have similar patterns
of cooccurrence with each other.  If two words are interchangeable, it is
reasonable to expect that they wouldn't ever occur together, but you would
expect that they would co-occur with other words in just the same way.  You
can find words that have similar cooccurrence patterns by comparing rows (or
columns) of the cooccurrence matrix with other rows of the smae matrix.
Using a matrix product does this wholesale:

           (A' A)' (A' A) = A'A A'A

If you decompose A using singular values, you get A = U' s V so A' A = V'
s^2 V and A' A A' A = V' s^4 V.  What will happen here is that the vectors
with the largest few singular values will dominate so you can approximate
this result by doing a term level random walk.

     while (many repetitions) {
         Term t = random term
         d = random document containing t
         u1 = random word in d
         d = random document containing u1
         u2 = random word in d
         count([t,u2])
     }

Pairs that have lots of counts will be the related words you want.  Note how
it is very easy to implement this using a doc x term matrix, especially one
implemented using something like Lucene.

(note that you probably want to select documents preferentially that contain
the word of interest in excess of expected rates and you should pick new
words preferentially that occur anomalously often in the document.
Otherwise, this algorithm will focus in on common words)

Re: LSI, cosine and others which use vectors

Posted by Paul Jones <pa...@yahoo.co.uk>.

tks Ted, but if its a live system, and you have 10 million documents, then isn't the computation on the fly going to be a pain, if you add say 1000 docs per hour or whatever, which is why I was assuming that its a batch process.

Also I think I have worked out what I meant about the relationships between the words themselves, I think I was looking to build a term-term matrix instead of a term-doc, whereby I have the freq of occurence of each word alongside each other word in a doc.(I guess easy way to start is that the two words can co-occur anywhere in the doc). If done, hopefully the 'distance' between the two vectors should give me a relative relationship. I realise lots of problems with this approach. i.e how don't know how the words are related...I just know that they are.

Paul

________________________________
From: Ted Dunning <te...@gmail.com>
To: mahout-user@lucene.apache.org
Sent: Wednesday, 24 June, 2009 1:52:41
Subject: Re: LSI, cosine and others which use vectors

There are two kinds of changes here.

The first kind is when a single document changes.  That will change the
distances between that document and others, but it won't change the
distances between two other documents.  Most importantly, it won't change
the distance between queries and other documents.

The second kind of change is due to the first and is relatively
unavoidable.  When a document changes, almost inevitably the corpus word
frequencies will change as a result.  This changes the weightings applied to
particular terms in documents.  When you have many documents of which few
change these changes will be small enough to ignore.

In practice, you don't much care about what has changed because a live
system computes all similarities or distances on the fly based on the
current state.   If the similarities that you have not yet computed change,
you don't care.

On Tue, Jun 23, 2009 at 5:01 PM, Paul Jones <pa...@yahoo.co.uk>wrote:

> Yes another question, am going through a rapid learning curve...
>
> All these vector based systems, which require you to build a term-doc etc,
> are they of any use in a system where the data is changing, i.e lets assume
> the docs are webpages, which are being crawled, and hence updated. Surely if
> there is a vector diagram being formed, then the position of these vectors
> changes based on the changes (size, content) of the entire matrix, or am I
> missing something here.
>
> If the above is correct, then is a actual live project how is this done,
> are distances worked out on a per-day type of basis, and the indexes then
> updated ?
>
> Paul
>
>
>
>

-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: LSI, cosine and others which use vectors

Posted by Ted Dunning <te...@gmail.com>.

There are two kinds of changes here.

The first kind is when a single document changes.  That will change the
distances between that document and others, but it won't change the
distances between two other documents.  Most importantly, it won't change
the distance between queries and other documents.

The second kind of change is due to the first and is relatively
unavoidable.  When a document changes, almost inevitably the corpus word
frequencies will change as a result.  This changes the weightings applied to
particular terms in documents.  When you have many documents of which few
change these changes will be small enough to ignore.

In practice, you don't much care about what has changed because a live
system computes all similarities or distances on the fly based on the
current state.   If the similarities that you have not yet computed change,
you don't care.

On Tue, Jun 23, 2009 at 5:01 PM, Paul Jones <pa...@yahoo.co.uk>wrote:

> Yes another question, am going through a rapid learning curve...
>
> All these vector based systems, which require you to build a term-doc etc,
> are they of any use in a system where the data is changing, i.e lets assume
> the docs are webpages, which are being crawled, and hence updated. Surely if
> there is a vector diagram being formed, then the position of these vectors
> changes based on the changes (size, content) of the entire matrix, or am I
> missing something here.
>
> If the above is correct, then is a actual live project how is this done,
> are distances worked out on a per-day type of basis, and the indexes then
> updated ?
>
> Paul
>
>
>
>

-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)