You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/05/27 16:47:12 UTC

Collocation and Seq2Sparse Questions

Hi,

I'm running the Collocation stuff (https://cwiki.apache.org/confluence/display/MAHOUT/Collocations) and have a few questions.

Here's what I am doing for now:

I have the Reuters stuff as TXT files.  I convert that to a Seq File.  Then I'm running seq 2 sparse:
 ./mahout seq2sparse --input ./content/reuters/seqfiles3 --output ./content/reuters/vectors2  --maxNGramSize 3

I then want to index my content into Solr/Lucene and I wish to supplement the main content with a new field that contains the top collocations for each document.  I see a couple of things that I'm not sure of how to proceed with:

1. I need labels on the vectors so that I can look up/associate my input document with the appropriate vector that was created by Mahout.  It doesn't seem like Seq2Sparse supports NamedVector, so how would I do this?

2. How can I, given a vector, get the top collocations for that Vector, as ranked by LLR?

Perhaps I should be using the CollocDriver directly?

Am I off base in wanting to do something like this? 

Thanks,
Grant

Re: Collocation and Seq2Sparse Questions

Posted by Drew Farris <dr...@gmail.com>.
On Thu, May 27, 2010 at 10:47 AM, Grant Ingersoll <gs...@apache.org>wrote:

>
>
> 1. I need labels on the vectors so that I can look up/associate my input
> document with the appropriate vector that was created by Mahout.  It doesn't
> seem like Seq2Sparse supports NamedVector, so how would I do this?
>

Seq2Sparse produces sequence files of <Key,Vector> -- could you use the key
in lieu of a NamedVector here?

If it helps I've updated MAHOUT-401 with a patch that addresses the
NamedVector business in seq2sparse

Re: Collocation and Seq2Sparse Questions

Posted by Grant Ingersoll <gs...@apache.org>.
On May 27, 2010, at 11:52 AM, Drew Farris wrote:

> 
> Not at all.
> 
> The alternative that's been discussed here in the past would involve some
> custom analyzer work. The general idea is to load the output from the
> CollocDriver into a bloom filter and then when processing documents at
> indexing time, set up a field where you generate shingles and only index
> those that appear in the bloom filter. This way you wind up getting a set of
> ngrams indexed that are ranked high across the entire corpus instead of
> simply the best ones for each document.
> 

I'd be happy with each doc at this point

Re: Collocation and Seq2Sparse Questions

Posted by Grant Ingersoll <gs...@apache.org>.
On May 27, 2010, at 1:05 PM, Ted Dunning wrote:

> A bit off topic, but what you really want is collocations that bring
> different information to the party than the constituent words.  

What I'm after right now are things of the nature of:

Here's what you can do _right now_ (i.e. very little coding) by combining Mahout (and other open source tools, but preferably Mahout) with search at some point in the chain (either indexing or searching).  Collocations seemed like a nice fit since I know phrases and collocations are often a pretty decent win in search without too much work.  Obviously, many other things fit here, too.

I am also trying to answer the question of: Here's what you can do right now with Mahout as part of an "intelligent application", not necessarily search based (but still might use search under the hood).  So, this leans a bit towards more BI-ish type things, like analytics, trend analysis, etc.   So, things like tracking topics, phrases, keywords over time are often useful as well as the more obvious stuff like clustering, classification, etc.

FWIW, the case for Mahout is already probably 5X what it was just 6 months ago.  That is just beautiful.

Suggestions welcome.

> That is, you
> need to detect cases where the "meaning" of the collocation is not
> compositionally predicted by the meanings of the words in the collocation.
> Simple collocation statistics really can't tell you that.  Instead, you
> need to look at the contexts in which the words appear.  Context statistics
> generally require a bit of smoothing, however, so you begin to step outside
> of where LLR type methods will really help you out.  SVD and random indexing

Random indexing?  Am I reading too much into that phrase beyond it's obvious meaning?  If so, reference please.

> are more likely to be what you need.  The question becomes whether the
> semantic vector for the pair is significantly different from the semantic
> vector of either word or the average of the two.  If so, the pair is
> valuable.
> 
> This computation is WAY more intense than collocation counting,
> unfortunately, but LLR can be used to screen for the word pairs that are
> candidates for this.  At that point, the workload is plausible since you can
> use something like an inverted index phrase search to get the statistics you
> need.

As always, Ted, it makes brilliant sense.

-Grant

Re: Collocation and Seq2Sparse Questions

Posted by Ted Dunning <te...@gmail.com>.
A bit off topic, but what you really want is collocations that bring
different information to the party than the constituent words.  That is, you
need to detect cases where the "meaning" of the collocation is not
compositionally predicted by the meanings of the words in the collocation.
 Simple collocation statistics really can't tell you that.  Instead, you
need to look at the contexts in which the words appear.  Context statistics
generally require a bit of smoothing, however, so you begin to step outside
of where LLR type methods will really help you out.  SVD and random indexing
are more likely to be what you need.  The question becomes whether the
semantic vector for the pair is significantly different from the semantic
vector of either word or the average of the two.  If so, the pair is
valuable.

This computation is WAY more intense than collocation counting,
unfortunately, but LLR can be used to screen for the word pairs that are
candidates for this.  At that point, the workload is plausible since you can
use something like an inverted index phrase search to get the statistics you
need.

On Thu, May 27, 2010 at 9:54 AM, Drew Farris <dr...@gmail.com> wrote:

> On Thu, May 27, 2010 at 12:03 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
>
> >
> > I just want to supplement my docs with some "high quality" collocations.
> >  TF-IDF is good enough, just not clear how best to get them out at this
> > point, on a per doc basis.
> >
>
> You could use the CollocDriver to get a sense of the LLR range for your
> corpus and then provide a minLLR as an argument to  seq2sparse -- that
> said,
> it doesn't necessarilly address the issue of collocations with a high LLR
> but are made up of words with a high frequency in the corpus. This might
> not
> be an issue for you however.
>

Re: Collocation and Seq2Sparse Questions

Posted by Drew Farris <dr...@gmail.com>.
On Thu, May 27, 2010 at 12:03 PM, Grant Ingersoll <gs...@apache.org>wrote:

>
> I just want to supplement my docs with some "high quality" collocations.
>  TF-IDF is good enough, just not clear how best to get them out at this
> point, on a per doc basis.
>

You could use the CollocDriver to get a sense of the LLR range for your
corpus and then provide a minLLR as an argument to  seq2sparse -- that said,
it doesn't necessarilly address the issue of collocations with a high LLR
but are made up of words with a high frequency in the corpus. This might not
be an issue for you however.

Re: Collocation and Seq2Sparse Questions

Posted by Ted Dunning <te...@gmail.com>.
For retrieval, I have had very good results in just retaining high LLR
collocations and letting any subsequent processing deal with the weighting.

On the other hand, I just saw this article which tested collocations for
spam detection and got no lift because the individual constituent words were
carrying the weight already.  (
http://www.aueb.gr/users/ion/docs/TR2004_updated.pdf linked from
http://aclweb.org/aclwiki/index.php?title=Spam_filtering_datasets)

On Thu, May 27, 2010 at 9:03 AM, Grant Ingersoll <gs...@apache.org>wrote:

> > There may be use cases for keeping LLR if only for diagnostic purposes.
>
> I just want to supplement my docs with some "high quality" collocations.
>  TF-IDF is good enough, just not clear how best to get them out at this
> point, on a per doc basis.

Re: Collocation and Seq2Sparse Questions

Posted by Grant Ingersoll <gs...@apache.org>.
On May 27, 2010, at 11:58 AM, Ted Dunning wrote:

> Just to forestall some effort on this, LLR is very good for threshold, but
> the value is bad as a score so substituting TF or TFIDF is entirely
> appropriate.

Good to know.

> 
> There may be use cases for keeping LLR if only for diagnostic purposes.

I just want to supplement my docs with some "high quality" collocations.  TF-IDF is good enough, just not clear how best to get them out at this point, on a per doc basis.

> 
> On Thu, May 27, 2010 at 8:52 AM, Drew Farris <dr...@gmail.com> wrote:
> 
>>> 2. How can I, given a vector, get the top collocations for that Vector,
>> as
>>> ranked by LLR?
>>> 
>> 
>> If I recall correctly, the LLR score gets dropped in seq2sparse in favor of
>> TF or TFIDF depending on the nature of the vectors being generated.
>> Meanwhile, CollocDriver simply emits a list of collocations in a collection
>> ranked by llr, so neither is strictly what you are interested in. Is there
>> a
>> good way to include both something like TF >and< LLR in the output of
>> seq2sparse -- would it be necessary to resort to emitting 2 separate sets
>> of
>> vectors?
>> 



Re: Collocation and Seq2Sparse Questions

Posted by Ted Dunning <te...@gmail.com>.
Just to forestall some effort on this, LLR is very good for threshold, but
the value is bad as a score so substituting TF or TFIDF is entirely
appropriate.

There may be use cases for keeping LLR if only for diagnostic purposes.

On Thu, May 27, 2010 at 8:52 AM, Drew Farris <dr...@gmail.com> wrote:

> > 2. How can I, given a vector, get the top collocations for that Vector,
> as
> > ranked by LLR?
> >
>
> If I recall correctly, the LLR score gets dropped in seq2sparse in favor of
> TF or TFIDF depending on the nature of the vectors being generated.
> Meanwhile, CollocDriver simply emits a list of collocations in a collection
> ranked by llr, so neither is strictly what you are interested in. Is there
> a
> good way to include both something like TF >and< LLR in the output of
> seq2sparse -- would it be necessary to resort to emitting 2 separate sets
> of
> vectors?
>

Re: Collocation and Seq2Sparse Questions

Posted by Ted Dunning <te...@gmail.com>.
It uses murmurhash, correct?  If so, it should be good.  If not, we should
upgrade to that.  (I think I remember that it does)

Apparently a big (almost accidental) win in our collections relative to
trove was fewer collisions and thus better memory use and speed.  This can
be attributed to murmurHash, I think.  (I can't remember where I saw these
benchmarks)

On Thu, May 27, 2010 at 1:42 PM, Jake Mannix <ja...@gmail.com> wrote:

> I think a general purpose one like the one in hadoop should be fine,
> personally.  Plus we already have it accessible on the classpath. :)
>

Re: Collocation and Seq2Sparse Questions

Posted by Jake Mannix <ja...@gmail.com>.
I think a general purpose one like the one in hadoop should be fine,
personally.  Plus we already have it accessible on the classpath. :)

On Thu, May 27, 2010 at 1:40 PM, Drew Farris <dr...@gmail.com> wrote:

> Should be easy enough to add.
>
> From what I can tell hadoop's DynamicBloomFilter and hbase's
> DynamicByteBloomFilter have similar origins but have diverged a bit. I
> haven't had time to do anything but skim them at this point, does anyone
> else have a sense as to which would be better to use?
>
>
> http://svn.apache.org/viewvc/hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/DynamicByteBloomFilter.java?view=markup
>
>
> http://svn.apache.org/viewvc/hadoop/common/trunk/src/java/org/apache/hadoop/util/bloom/DynamicBloomFilter.java?view=markup
>
>
> On Thu, May 27, 2010 at 3:25 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Now that we talk about it, it would totally make sense for the
> CollocDriver
> > to optionally spit out a (serialized) BloomFilter at the end of its
> > processing.  You can even do it in parallel and then OR the separate
> pieces
> > together...
> >
> >  -jake
> >
> > On May 27, 2010 12:09 PM, "Drew Farris" <dr...@gmail.com> wrote:
> >
> > On Thu, May 27, 2010 at 2:59 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> > > Ditto this. I though...
> > Not that I know of.
> >
> > There are a couple implementations in hbase too, not sure how similar
> these
> > are to the one in hadoop:
> >
> >
> >
> http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/
> >
> > ByteBloomFilter and DynamicByteBloomFilter
> >
>

Re: Collocation and Seq2Sparse Questions

Posted by Drew Farris <dr...@gmail.com>.
Should be easy enough to add.

>From what I can tell hadoop's DynamicBloomFilter and hbase's
DynamicByteBloomFilter have similar origins but have diverged a bit. I
haven't had time to do anything but skim them at this point, does anyone
else have a sense as to which would be better to use?

http://svn.apache.org/viewvc/hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/DynamicByteBloomFilter.java?view=markup

http://svn.apache.org/viewvc/hadoop/common/trunk/src/java/org/apache/hadoop/util/bloom/DynamicBloomFilter.java?view=markup


On Thu, May 27, 2010 at 3:25 PM, Jake Mannix <ja...@gmail.com> wrote:

> Now that we talk about it, it would totally make sense for the CollocDriver
> to optionally spit out a (serialized) BloomFilter at the end of its
> processing.  You can even do it in parallel and then OR the separate pieces
> together...
>
>  -jake
>
> On May 27, 2010 12:09 PM, "Drew Farris" <dr...@gmail.com> wrote:
>
> On Thu, May 27, 2010 at 2:59 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> > Ditto this. I though...
> Not that I know of.
>
> There are a couple implementations in hbase too, not sure how similar these
> are to the one in hadoop:
>
>
> http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/
>
> ByteBloomFilter and DynamicByteBloomFilter
>

Re: Collocation and Seq2Sparse Questions

Posted by Ted Dunning <te...@gmail.com>.
+1 !

Excellent format for the output.

On Thu, May 27, 2010 at 12:25 PM, Jake Mannix <ja...@gmail.com> wrote:

> Now that we talk about it, it would totally make sense for the CollocDriver
> to optionally spit out a (serialized) BloomFilter at the end of its
> processing.  You can even do it in parallel and then OR the separate pieces
> together...
>

Re: Collocation and Seq2Sparse Questions

Posted by Jake Mannix <ja...@gmail.com>.
Now that we talk about it, it would totally make sense for the CollocDriver
to optionally spit out a (serialized) BloomFilter at the end of its
processing.  You can even do it in parallel and then OR the separate pieces
together...

  -jake

On May 27, 2010 12:09 PM, "Drew Farris" <dr...@gmail.com> wrote:

On Thu, May 27, 2010 at 2:59 PM, Jake Mannix <ja...@gmail.com> wrote:
> Ditto this. I though...
Not that I know of.

There are a couple implementations in hbase too, not sure how similar these
are to the one in hadoop:

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/

ByteBloomFilter and DynamicByteBloomFilter

Re: Collocation and Seq2Sparse Questions

Posted by Drew Farris <dr...@gmail.com>.
On Thu, May 27, 2010 at 2:59 PM, Jake Mannix <ja...@gmail.com> wrote:

> Ditto this.  I thought we already had one in mahout somewhere too?
>

Not that I know of.

There are a couple implementations in hbase too, not sure how similar these
are to the one in hadoop:

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/

ByteBloomFilter and DynamicByteBloomFilter

Re: Collocation and Seq2Sparse Questions

Posted by Jake Mannix <ja...@gmail.com>.
Ditto this.  I thought we already had one in mahout somewhere too?

On May 27, 2010 11:46 AM, "Boris Aleksandrovsky" <ba...@gmail.com> wrote:

You can use one which comes with Hadoop
- org.apache.hadoop.util.bloom.DynamicBloomFilter. It is in the core jar.

On Thu, May 27, 2010 at 11:41 AM, Grant Ingersoll <gsingers@apache.org
>wrote:

> So, do we have a Bloom Filter handy in Mahout? I see a BSD licensed one at
> http://wwwse.inf.tu...

Re: Collocation and Seq2Sparse Questions

Posted by Boris Aleksandrovsky <ba...@gmail.com>.
You can use one which comes with Hadoop
- org.apache.hadoop.util.bloom.DynamicBloomFilter. It is in the core jar.

On Thu, May 27, 2010 at 11:41 AM, Grant Ingersoll <gs...@apache.org>wrote:

> So, do we have a Bloom Filter handy in Mahout?  I see a BSD licensed one at
> http://wwwse.inf.tu-dresden.de/xsiena/bloom_filter, but don't have any
> idea on it's performance.
>
> On May 27, 2010, at 2:27 PM, Jake Mannix wrote:
>
> > Grant,
> >
> >  At LinkedIn, we do something very similar to what Drew is describing
> here
> > as part of our content-based recommender: use effectively the
> CollocDriver
> > to get the highest N (ranked by LLR) collocations in one job, then load
> > those into a bloom filter which is inserted into a simple custom Analyzer
> > which is chained with a ShingleAnalyzer to index (using lucene
> > normalization, not LLR score!) the "good" phrases for each document.
> >
> >  There are various tricks and techniques for cleaning this up better, but
> > even just the above does a pretty good job, and is very little on top of
> > Mahout's current codebase.
> >
> >  -jake
> >
> >
> > On May 27, 2010 8:53 AM, "Drew Farris" <dr...@gmail.com> wrote:
> >
> > On Thu, May 27, 2010 at 10:47 AM, Grant Ingersoll <gsingers@apache.org
> >> wrote:
> >
> >> Hi, > > I'm running the Collocation stuff ( >
> > https://cwiki.apache.org/confluence/display/MAHOUT/...
> > Delroy/Jeff recently ran into this, but I'm having problems finding the
> > thread in the archive that I can link to. I'll open a jira with the patch
> > Jeff posted.
> >
> >> 2. How can I, given a vector, get the top collocations for that Vector,
> as
> >> ranked by LLR? >
> > If I recall correctly, the LLR score gets dropped in seq2sparse in favor
> of
> > TF or TFIDF depending on the nature of the vectors being generated.
> > Meanwhile, CollocDriver simply emits a list of collocations in a
> collection
> > ranked by llr, so neither is strictly what you are interested in. Is
> there a
> > good way to include both something like TF >and< LLR in the output of
> > seq2sparse -- would it be necessary to resort to emitting 2 separate sets
> of
> > vectors?
> >
> > Am I off base in wanting to do something like this? >
> > Not at all.
> >
> > The alternative that's been discussed here in the past would involve some
> > custom analyzer work. The general idea is to load the output from the
> > CollocDriver into a bloom filter and then when processing documents at
> > indexing time, set up a field where you generate shingles and only index
> > those that appear in the bloom filter. This way you wind up getting a set
> of
> > ngrams indexed that are ranked high across the entire corpus instead of
> > simply the best ones for each document.
> >
> > You'll want to take a look at the ngram list emitted from the
> CollocDriver,
> > ngrams composed of high frequecy terms tend to get a high LLR score. For
> > some of the work I've done, filtering out ngrams composed of two or more
> > terms in the StandardAnalyzer's stoplist worked pretty well although
> there
> > always seem to be corpus-specific high frequency terms worth filtering
> out
> > as well.
> >
> > Hope this helps,
> >
> > Drew
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Collocation and Seq2Sparse Questions

Posted by Grant Ingersoll <gs...@apache.org>.
So, do we have a Bloom Filter handy in Mahout?  I see a BSD licensed one at http://wwwse.inf.tu-dresden.de/xsiena/bloom_filter, but don't have any idea on it's performance.

On May 27, 2010, at 2:27 PM, Jake Mannix wrote:

> Grant,
> 
>  At LinkedIn, we do something very similar to what Drew is describing here
> as part of our content-based recommender: use effectively the CollocDriver
> to get the highest N (ranked by LLR) collocations in one job, then load
> those into a bloom filter which is inserted into a simple custom Analyzer
> which is chained with a ShingleAnalyzer to index (using lucene
> normalization, not LLR score!) the "good" phrases for each document.
> 
>  There are various tricks and techniques for cleaning this up better, but
> even just the above does a pretty good job, and is very little on top of
> Mahout's current codebase.
> 
>  -jake
> 
> 
> On May 27, 2010 8:53 AM, "Drew Farris" <dr...@gmail.com> wrote:
> 
> On Thu, May 27, 2010 at 10:47 AM, Grant Ingersoll <gsingers@apache.org
>> wrote:
> 
>> Hi, > > I'm running the Collocation stuff ( >
> https://cwiki.apache.org/confluence/display/MAHOUT/...
> Delroy/Jeff recently ran into this, but I'm having problems finding the
> thread in the archive that I can link to. I'll open a jira with the patch
> Jeff posted.
> 
>> 2. How can I, given a vector, get the top collocations for that Vector, as
>> ranked by LLR? >
> If I recall correctly, the LLR score gets dropped in seq2sparse in favor of
> TF or TFIDF depending on the nature of the vectors being generated.
> Meanwhile, CollocDriver simply emits a list of collocations in a collection
> ranked by llr, so neither is strictly what you are interested in. Is there a
> good way to include both something like TF >and< LLR in the output of
> seq2sparse -- would it be necessary to resort to emitting 2 separate sets of
> vectors?
> 
> Am I off base in wanting to do something like this? >
> Not at all.
> 
> The alternative that's been discussed here in the past would involve some
> custom analyzer work. The general idea is to load the output from the
> CollocDriver into a bloom filter and then when processing documents at
> indexing time, set up a field where you generate shingles and only index
> those that appear in the bloom filter. This way you wind up getting a set of
> ngrams indexed that are ranked high across the entire corpus instead of
> simply the best ones for each document.
> 
> You'll want to take a look at the ngram list emitted from the CollocDriver,
> ngrams composed of high frequecy terms tend to get a high LLR score. For
> some of the work I've done, filtering out ngrams composed of two or more
> terms in the StandardAnalyzer's stoplist worked pretty well although there
> always seem to be corpus-specific high frequency terms worth filtering out
> as well.
> 
> Hope this helps,
> 
> Drew

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


Re: Collocation and Seq2Sparse Questions

Posted by Jake Mannix <ja...@gmail.com>.
Grant,

  At LinkedIn, we do something very similar to what Drew is describing here
as part of our content-based recommender: use effectively the CollocDriver
to get the highest N (ranked by LLR) collocations in one job, then load
those into a bloom filter which is inserted into a simple custom Analyzer
which is chained with a ShingleAnalyzer to index (using lucene
normalization, not LLR score!) the "good" phrases for each document.

  There are various tricks and techniques for cleaning this up better, but
even just the above does a pretty good job, and is very little on top of
Mahout's current codebase.

  -jake


On May 27, 2010 8:53 AM, "Drew Farris" <dr...@gmail.com> wrote:

On Thu, May 27, 2010 at 10:47 AM, Grant Ingersoll <gsingers@apache.org
>wrote:

> Hi, > > I'm running the Collocation stuff ( >
https://cwiki.apache.org/confluence/display/MAHOUT/...
Delroy/Jeff recently ran into this, but I'm having problems finding the
thread in the archive that I can link to. I'll open a jira with the patch
Jeff posted.

> 2. How can I, given a vector, get the top collocations for that Vector, as
> ranked by LLR? >
If I recall correctly, the LLR score gets dropped in seq2sparse in favor of
TF or TFIDF depending on the nature of the vectors being generated.
Meanwhile, CollocDriver simply emits a list of collocations in a collection
ranked by llr, so neither is strictly what you are interested in. Is there a
good way to include both something like TF >and< LLR in the output of
seq2sparse -- would it be necessary to resort to emitting 2 separate sets of
vectors?

Am I off base in wanting to do something like this? >
Not at all.

The alternative that's been discussed here in the past would involve some
custom analyzer work. The general idea is to load the output from the
CollocDriver into a bloom filter and then when processing documents at
indexing time, set up a field where you generate shingles and only index
those that appear in the bloom filter. This way you wind up getting a set of
ngrams indexed that are ranked high across the entire corpus instead of
simply the best ones for each document.

You'll want to take a look at the ngram list emitted from the CollocDriver,
ngrams composed of high frequecy terms tend to get a high LLR score. For
some of the work I've done, filtering out ngrams composed of two or more
terms in the StandardAnalyzer's stoplist worked pretty well although there
always seem to be corpus-specific high frequency terms worth filtering out
as well.

Hope this helps,

Drew

Re: Collocation and Seq2Sparse Questions

Posted by Drew Farris <dr...@gmail.com>.
On Thu, May 27, 2010 at 10:47 AM, Grant Ingersoll <gs...@apache.org>wrote:

> Hi,
>
> I'm running the Collocation stuff (
> https://cwiki.apache.org/confluence/display/MAHOUT/Collocations) and have
> a few questions.
>
> Here's what I am doing for now:
>
> I have the Reuters stuff as TXT files.  I convert that to a Seq File.  Then
> I'm running seq 2 sparse:
>  ./mahout seq2sparse --input ./content/reuters/seqfiles3 --output
> ./content/reuters/vectors2  --maxNGramSize 3
>
> I then want to index my content into Solr/Lucene and I wish to supplement
> the main content with a new field that contains the top collocations for
> each document.  I see a couple of things that I'm not sure of how to proceed
> with:
>
> 1. I need labels on the vectors so that I can look up/associate my input
> document with the appropriate vector that was created by Mahout.  It doesn't
> seem like Seq2Sparse supports NamedVector, so how would I do this?
>

Delroy/Jeff recently ran into this, but I'm having problems finding the
thread in the archive that I can link to. I'll open a jira with the patch
Jeff posted.


> 2. How can I, given a vector, get the top collocations for that Vector, as
> ranked by LLR?
>

If I recall correctly, the LLR score gets dropped in seq2sparse in favor of
TF or TFIDF depending on the nature of the vectors being generated.
Meanwhile, CollocDriver simply emits a list of collocations in a collection
ranked by llr, so neither is strictly what you are interested in. Is there a
good way to include both something like TF >and< LLR in the output of
seq2sparse -- would it be necessary to resort to emitting 2 separate sets of
vectors?

Am I off base in wanting to do something like this?
>

Not at all.

The alternative that's been discussed here in the past would involve some
custom analyzer work. The general idea is to load the output from the
CollocDriver into a bloom filter and then when processing documents at
indexing time, set up a field where you generate shingles and only index
those that appear in the bloom filter. This way you wind up getting a set of
ngrams indexed that are ranked high across the entire corpus instead of
simply the best ones for each document.

You'll want to take a look at the ngram list emitted from the CollocDriver,
ngrams composed of high frequecy terms tend to get a high LLR score. For
some of the work I've done, filtering out ngrams composed of two or more
terms in the StandardAnalyzer's stoplist worked pretty well although there
always seem to be corpus-specific high frequency terms worth filtering out
as well.

Hope this helps,

Drew