You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by zaki rahaman <za...@gmail.com> on 2010/01/05 18:02:00 UTC

Collocations in Mahout?

Pardon my ignorance as this is probably best handled by an NLP package like
GATE or LingPipe, but does Mahout provide anything for collocations? Or does
anyone know of a MapReducible way to calculate something like t-values for
tokens in N-grams? I've got quite a large collection that I have to prune,
filter, and preprocess, but I still expect it to be a significant size.

-- 
Zaki Rahaman

Re: Collocations in Mahout?

Posted by Jake Mannix <ja...@gmail.com>.

On Tue, Jan 5, 2010 at 12:18 PM, Ted Dunning <te...@gmail.com> wrote:

> No.  We really don't.
>
> The most straightforward implementation does a separate pass for computing
> the overall total, for counting the unigrams and then counting the bigrams.
> It is cooler, of course, to count all sizes of ngrams in one pass and
> output
> them to separate files.  Then a second pass can do a map-side join if the
> unigram table is small enough (it usually is) and compute the results.  All
> of this is very straightforward programming and is a great introduction to
> map-reduce programming.
>

Oh you and your map-side join!  That's no fun - to pass my interview you
need
to do it in the case where the unigrams *don't* fit in memory (because lets
say you wanted to compute a variation of LLR for trigrams which used the
sub-bigram counts in the computation).  It's much more fun to do an n-way
self-join with composite keys and secondary sorts! :)

  -jake

Re: Collocations in Mahout?

Posted by Ted Dunning <te...@gmail.com>.

I should point out that with only hundreds to review, you can eliminate
laugh inducing phrases by hand.

If you have hundreds of thousands, it is a different problem.

On Fri, Jan 8, 2010 at 4:44 AM, Shashikant Kore <sh...@gmail.com>wrote:

> ...
> With corpus of million documents, if I calculate LLR score of terms in
> a set of say 50,000 documents, I get hundreds of terms with score more
> than 50, many of which are not "useful."
>
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Collocations in Mahout?

Posted by Ted Dunning <te...@gmail.com>.

Very good practice.

On Fri, Jan 8, 2010 at 11:55 AM, Jake Mannix <ja...@gmail.com> wrote:

>  so doesn't really have a good
> scale-independent measure, only relative - which is why I've always
> just said "gimme the top 0.1% to 1% (ordered by LLR) ngrams"
> out of my set.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Collocations in Mahout?

Posted by Jake Mannix <ja...@gmail.com>.

On Fri, Jan 8, 2010 at 4:44 AM, Shashikant Kore <sh...@gmail.com>wrote:

> On Fri, Jan 8, 2010 at 10:36 AM, Robin Anil <ro...@gmail.com> wrote:
> >
> > One interesting thing I found was that any ngram with LLR <1 is
> practically
> > junk, anything over LLR>50 is pretty awesome. between 1-50, its always
> > debatable. This holds approximately true for large and small datasets.
> >
>
> I don't think the absolute value of LLR score is an indicator of
> importance of a term across all dataset.
>
> With corpus of million documents, if I calculate LLR score of terms in
> a set of say 50,000 documents, I get hundreds of terms with score more
> than 50, many of which are not "useful."
>

In my case, when doing LLR on bigrams on the corpus of all 50M+ LinkedIn
profiles, if you order by LLR descending, they start out *huge* (10^5 or so,
for specialized bigrams like "myocardial infarction" which is about as
non-independent as it gets), and go down gradually from there.

Since the form of the math for LLR for bigrams is a sum of
count * log(probability), the overall size of the corpus is partly a
multiplicative factor in the score, and so doesn't really have a good
scale-independent measure, only relative - which is why I've always
just said "gimme the top 0.1% to 1% (ordered by LLR) ngrams"
out of my set.

  -jake

Re: Collocations in Mahout?

Posted by Ted Dunning <te...@gmail.com>.

Shashikant is correct that LLR becomes more and more sensitive with larger
corpora.

Whether this is good or bad depends on what the use is.  If you are using
these as ML features, hundreds of extra phrases is probably neutral to
slightly helpful.

If these phrases are intended to be user visible, some additional filtering
is likely to be required for very large corpus applications.  This can be
linguistic (sentence and phrase boundary limits) or can be based on
statistical filtering in the particular situation (such as
over-representation in a cluster).

I have used LLR for feature detection and description for fairly large
corpora in the past, but typically had a test for over-representation in
place which prevented dubious phrases from being presented to the user.

On Fri, Jan 8, 2010 at 4:44 AM, Shashikant Kore <sh...@gmail.com>wrote:

> I don't think the absolute value of LLR score is an indicator of
> importance of a term across all dataset.
>
> With corpus of million documents, if I calculate LLR score of terms in
> a set of say 50,000 documents, I get hundreds of terms with score more
> than 50, many of which are not "useful."
>
> Ted, can you please comment on Robin's observation?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Collocations in Mahout?

Posted by Shashikant Kore <sh...@gmail.com>.

On Fri, Jan 8, 2010 at 10:36 AM, Robin Anil <ro...@gmail.com> wrote:
>
> One interesting thing I found was that any ngram with LLR <1 is practically
> junk, anything over LLR>50 is pretty awesome. between 1-50, its always
> debatable. This holds approximately true for large and small datasets.
>

I don't think the absolute value of LLR score is an indicator of
importance of a term across all dataset.

With corpus of million documents, if I calculate LLR score of terms in
a set of say 50,000 documents, I get hundreds of terms with score more
than 50, many of which are not "useful."

Ted, can you please comment on Robin's observation?

--shashi

Re: Collocations in Mahout?

Posted by Ted Dunning <te...@gmail.com>.

The idea comes from Markov language models which model the probability of a
sequence of words as the product of conditional probabilities for the next
word based on the context of the previous n-1 words.  The real virtue of
this kind of model is that it provides a very simple and tractable way to
combine information from overlapping n-grams.

Significant n-grams are then computed by comparing predictions from an order
n-1 model (which involves n-gram and (n-1)-gram counts for estimation) and
an order n-2 model (which involves (n-1)-gram and (n-2)-gram counts).  The
overall comparison can be broken down into smaller comparisons which look at
whether particular words can be predicted significantly better with more
leading context.  These smaller comparisons can be done using familiar 2x2
contingency table analysis via the LLR statistic already in wide use for
finding interesting bigrams.

Even though the mathematical form of such a model looks particularly limited
because it involves only left-context, there is considerably more generality
present than there appears to be.

In many applications, finding interesting n-grams is not necessary since
whatever learning algorithm you are using may be able to sort out which
features (terms, bigrams or n-grams) are useful and which are not.  This
technique works well with words in text (10^5 - 10^6 features) and
reasonably well with bigrams (10^10 - 10^12 possible features), but as you
move into trigrams and beyond, the curse of dimensionality can become
serious.  Some algorithms such as specialist algorithms used by Vowpal
Wabbit or confidence weighted learning are more dependent on the number of
non-zero features d than the ultimate potential number of features D.  This
average number of non-trivial features is multiplied, however, by the order
of n-grams in use which can lead to serious dimensionality problems.

Moreover, limiting the number of longer phrases used is a useful way of
doing semi-supervised learning.  A very large unmarked corpus can be used to
generate interesting phrasal features which are then used for supervised
learning on a much smaller marked corpus.  Because the number of interesting
phrases is strictly limited (possibly to a smaller number than the number of
primitive terms), this average number of non-zero features can be not much
larger than the number of raw terms and the power of the supervised learning
is enhanced by the phrasal features with limited deleterious effect due to
dimensionality.

Does that make the reasoning more clear?

On Fri, Jan 8, 2010 at 9:15 PM, Otis Gospodnetic <otis_gospodnetic@yahoo.com
> wrote:

> I think I missed this.  Could you please explain the n-1 gram thinking and
> why that is better than thinking about n-grams as n-grams?

-- 
Ted Dunning, CTO
DeepDyve

Re: Collocations in Mahout?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

----- Original Message ----
> From: Drew Farris <dr...@gmail.com>
> 
> On Fri, Jan 8, 2010 at 12:06 AM, Robin Anil wrote:
> 
> > I like the Formulation that Drew made, using n-1 grams to generate n-grams.
> 
> I think Ted first mentioned n-1 grams, and I ran with it. It is very
> useful to think about the problem this way.


I think I missed this.  Could you please explain the n-1 gram thinking and why that is better than thinking about n-grams as n-grams?

Thanks,
Otis

> One questions about the concept of n-1 grams however. When n is 3 for
> example, are we really interested in the collocation of bigrams, or
> are we interested in non-overlapping tokens? For example, given the
> tri-gram 'click and clack', should we be looking at 'click and' and
> 'and clack', or are should we be analyzing 'click', 'and clack' or
> 'click and' and 'clack''? I suspect it is the first form because that
> extends easilly to values larger than 3, but it's worth confirming.

Re: Collocations in Mahout?

Posted by Drew Farris <dr...@gmail.com>.

Thanks for the detailed explanation Ted. In light of the first case, I will
provide a parameter that is used to control ngram size and calculate the
values for LLR based on the occurences of leading n-1gram and the following
token.

The second case is pretty interesting too. It would be nice to have
something like this in mahout too. Perhaps it would be useful for auto
evaluating clustering output for example. It sounds like it would be better
achieved in a separate m/r impl.

On Jan 9, 2010 3:46 AM, "Ted Dunning" <te...@gmail.com> wrote:

There are a couple of ways to handle this.

One is to view the text as a limited horizon Markov process and look for
exceptions.  Thus, we might build a bigram language model and look for cases
where trigrams would do better.  That implies we would be looking for cases
where "clack" occurs after "click and" anomalously more than would be
expected from the number of times "clack" appears after "and".  This comes
down to comparing the counts of "clack" and all other words in the context
of "click and" versus "anything-but-click and".  Since "clack" is probably a
small fraction of the words that appear in the second context, but exhibits
an overwhelming over abundance in the context of "click and", we would
conclude that "click and clack" is an important trigram.  The contingency
table is

                        clack    -clack
            click, and    k11      k12
            -click, and   k21      k22

Theoretically speaking, this test is part of a likelihood ratio test that
compares a Markov model against a restricted from of the same Markov model
and is an extension of the simpler test for interesting binomials.

A second approach is to consider all overlapping n-grams that are in or out
of some context like a known category, or a cluster or a data source.  Then
we can do a normal LLR test to find items that are over-represented in some
category, cluster or whatever.   The size of these things doesn't actually
matter all that much.   This technique can be quick because you handle all
lengths of n-grams at the same time as opposed to building things up bit by
bit.   It is limited by the availability of categories that form reasonable
comparison sets.

On Fri, Jan 8, 2010 at 5:13 PM, Drew Farris <dr...@gmail.com> wrote: >
On Fri, Jan 8, 2010 a...
--
Ted Dunning, CTO
DeepDyve

Re: Collocations in Mahout?

Posted by Ted Dunning <te...@gmail.com>.

There are a couple of ways to handle this.

One is to view the text as a limited horizon Markov process and look for
exceptions.  Thus, we might build a bigram language model and look for cases
where trigrams would do better.  That implies we would be looking for cases
where "clack" occurs after "click and" anomalously more than would be
expected from the number of times "clack" appears after "and".  This comes
down to comparing the counts of "clack" and all other words in the context
of "click and" versus "anything-but-click and".  Since "clack" is probably a
small fraction of the words that appear in the second context, but exhibits
an overwhelming over abundance in the context of "click and", we would
conclude that "click and clack" is an important trigram.  The contingency
table is

                         clack    -clack
             click, and    k11      k12
             -click, and   k21      k22

Theoretically speaking, this test is part of a likelihood ratio test that
compares a Markov model against a restricted from of the same Markov model
and is an extension of the simpler test for interesting binomials.

A second approach is to consider all overlapping n-grams that are in or out
of some context like a known category, or a cluster or a data source.  Then
we can do a normal LLR test to find items that are over-represented in some
category, cluster or whatever.   The size of these things doesn't actually
matter all that much.   This technique can be quick because you handle all
lengths of n-grams at the same time as opposed to building things up bit by
bit.   It is limited by the availability of categories that form reasonable
comparison sets.

On Fri, Jan 8, 2010 at 5:13 PM, Drew Farris <dr...@gmail.com> wrote:

> On Fri, Jan 8, 2010 at 12:06 AM, Robin Anil <ro...@gmail.com> wrote:
>
> > I like the Formulation that Drew made, using n-1 grams to generate
> n-grams.
>
> I think Ted first mentioned n-1 grams, and I ran with it. It is very
> useful to think about the problem this way.
>
> One questions about the concept of n-1 grams however. When n is 3 for
> example, are we really interested in the collocation of bigrams, or
> are we interested in non-overlapping tokens? For example, given the
> tri-gram 'click and clack', should we be looking at 'click and' and
> 'and clack', or are should we be analyzing 'click', 'and clack' or
> 'click and' and 'clack''? I suspect it is the first form because that
> extends easilly to values larger than 3, but it's worth confirming.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Collocations in Mahout?

Posted by Drew Farris <dr...@gmail.com>.

On Fri, Jan 8, 2010 at 12:06 AM, Robin Anil <ro...@gmail.com> wrote:

> I like the Formulation that Drew made, using n-1 grams to generate n-grams.

I think Ted first mentioned n-1 grams, and I ran with it. It is very
useful to think about the problem this way.

One questions about the concept of n-1 grams however. When n is 3 for
example, are we really interested in the collocation of bigrams, or
are we interested in non-overlapping tokens? For example, given the
tri-gram 'click and clack', should we be looking at 'click and' and
'and clack', or are should we be analyzing 'click', 'and clack' or
'click and' and 'clack''? I suspect it is the first form because that
extends easilly to values larger than 3, but it's worth confirming.

Re: Collocations in Mahout?

Posted by Robin Anil <ro...@gmail.com>.

On Fri, Jan 8, 2010 at 7:03 AM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Jan 7, 2010, at 7:57 PM, Ted Dunning wrote:
>
> > The pieces are laying around.
> >
> > I had a framework like this for recs and text analysis at Veoh, Jake has
> > something in LinkedIn.
> >
> > But the amount of code is relatively small and probably could be
> rewritten
> > before Jake can get clearance to release anything.
> >
> > The first step is to just count n-grams.  I think that the input should
> be
> > relatively flexible and if you assume parametrized use of Lucene
> analyzers,
> > then all that is necessary is a small step up from word counting.
>
> The classification stuff has this already, in MR form, independent of
> Lucene.
>
> > This
> > should count all n-grams from 0 up to a limit.  It should also allow
> > suppression of output of any counts less than a threshold.  Total number
> of
> > n-grams of each size observed should be accumulated.
>
> I believe it does this, too.  Robin?
>
Yeah, Brute force ngram generation is done by Bayes Classifier. Beware its
practically combinatorial explosion of data. But enough machines can tame it
well.

Take a look at the DictionaryVectorizer . If LLR job could be added in a
chain, I could use that information while creating vectors.
https://issues.apache.org/jira/browse/MAHOUT-237

I like the Formulation that Drew made, using n-1 grams to generate n-grams.
It was the same I used to generate n-grams here,
http://thinking.me/(Himanshu and I built it when I was still in college).
But that was just a php script which iterates over a sample of twitter data
:).
One interesting thing I found was that any ngram with LLR <1 is practically
junk, anything over LLR>50 is pretty awesome. between 1-50, its always
debatable. This holds approximately true for large and small datasets.

I will be really happy if Drew can work on the LLR based bigram generation
code and help me attach it with the rest of the dictionaryVectorizer

Also I would prefer if the the entire mahout code agrees upon a single
 format for document input.  I would suggest we stick to SequenceFiles with
key as docid and value as document content. That way, the creation of
sequence files, we leave it to the user.

Robin

Re: Collocations in Mahout?

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 7, 2010, at 7:57 PM, Ted Dunning wrote:

> The pieces are laying around.
> 
> I had a framework like this for recs and text analysis at Veoh, Jake has
> something in LinkedIn.
> 
> But the amount of code is relatively small and probably could be rewritten
> before Jake can get clearance to release anything.
> 
> The first step is to just count n-grams.  I think that the input should be
> relatively flexible and if you assume parametrized use of Lucene analyzers,
> then all that is necessary is a small step up from word counting.  

The classification stuff has this already, in MR form, independent of Lucene.

> This
> should count all n-grams from 0 up to a limit.  It should also allow
> suppression of output of any counts less than a threshold.  Total number of
> n-grams of each size observed should be accumulated.  

I believe it does this, too.  Robin?

> There should also be
> some provision for counting cooccurrence pairs within windows or between two
> fields.
> 
> The second step is to detect interesting n-grams.  This is done using the
> counts of words and (n-1)-grams and the relevant totals as input for the LLR
> code.
> 
> The final (optional) step is creation of a Bloom filter table.  Options
> should control size of the table and number of probes.
> 
> Building up all these pieces and connecting them is a truly worthy task.
> 
> On Thu, Jan 7, 2010 at 3:44 PM, zaki rahaman <za...@gmail.com> wrote:
> 
>> @Ted, where is the partial framework you're referring to. And yes this is
>> definitely something I would like to work on if pointed in the right
>> direction. I wasn't quite sure though just b/c I remember a long-winded
>> discussion/debate a while back on the listserv about what Mahout's purpose
>> should be. N-gram LLR for collocations seems like a very NLP type of thing
>> to have (obviously it could also be used in other applications as well but
>> by itself its NLP to me) and from my understanding the "consensus" is that
>> Mahout should focus on scalable machine learning.
>> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve

Re: Collocations in Mahout?

Posted by Drew Farris <dr...@gmail.com>.

Just created an issue on JIRA and attached a first attempt at a LLR-based
collocation identifier:

See: https://issues.apache.org/jira/browse/MAHOUT-242

On Fri, Jan 8, 2010 at 8:14 PM, Drew Farris <dr...@gmail.com> wrote:

> On Fri, Jan 8, 2010 at 5:55 PM, zaki rahaman <za...@gmail.com>
> wrote:
> > Hey Drew,
> >
> > Let me know when you post a JIRA/any help you might want?
> >
>
> Hopefully I'll have something up for everyone to look at sometime this
> weekend.
>

Re: Collocations in Mahout?

Posted by Drew Farris <dr...@gmail.com>.

On Fri, Jan 8, 2010 at 5:55 PM, zaki rahaman <za...@gmail.com> wrote:
> Hey Drew,
>
> Let me know when you post a JIRA/any help you might want?
>

Hopefully I'll have something up for everyone to look at sometime this weekend.

Re: Collocations in Mahout?

Posted by zaki rahaman <za...@gmail.com>.

Hey Drew,

Let me know when you post a JIRA/any help you might want?

On Fri, Jan 8, 2010 at 5:03 PM, Drew Farris <dr...@gmail.com> wrote:

> Jake, thanks for the review, running narrative and comments. The
> Analyzer in use should be up to the user, so there will be flexibility
> to mess around with lots of alternative there, but it will be nice to
> provide reasonable defaults and include this sort of discussion in the
> wiki page for the algo. I'll finish up the rest of the code for it and
> post a patch to JIRA.
>
> Robin, I'll take a look at the dictionaryVectorizer, and see how they
> can work together. I think something like SequenceFiles<documentId,
> Text or BytesWritable> make sense as input for this job and it's
> probably easier to work with than what I had to whip up to slurp in
> files whole.
>
> Does anyone know if there is a stream based alternative to Text or
> BytesWritable?
>
> On Thu, Jan 7, 2010 at 11:46 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> > Ok, I lied - I think what you described here is way *faster* than what I
> > was doing, because I wasn't starting with the original corpus, I had
> > something like google's ngram terabyte data (a massive HDFS file with
> > just "ngram ngram-frequency" on each line), which mean I had to do
> > a multi-way join (which is where I needed to do a secondary sort by
> > value).
> >
> > Starting with the corpus itself (the case we're talking about) you have
> > some nice tricks in here:
> >
> > On Thu, Jan 7, 2010 at 6:46 PM, Drew Farris <dr...@gmail.com>
> wrote:
> >>
> >>
> >> The output of that map task is something like:
> >>
> >> k:(n-1)gram v:ngram
> >>
> >
> > This is great right here - it helps you kill two birds with one stone:
> the
> > join
> > and the wordcount phases.
> >
> >
> >> k:ngram,ngram-frequency v:(n-1)gram,(n-1) gram freq
> >>
> >> e.g:
> >> k:the best:1, v:best,2
> >> k:best of,1, v:best,2
> >> k:best of,1, v:of,2
> >> k:of times,1 v:of,2
> >> k:the best,1, v:the,1
> >> k:of times,1 v:1 v:times,1
> >>
> >
> > Yeah, once you're here, you're home free.  This should be really a rather
> > quick set of jobs, even on really big data, and even dealing with it as
> > text.
> >
> >
> >> I'm also wondering about the best way to handle input. Line by line
> >> processing would miss ngrams spanning lines, but full document
> >> processing with the StandardAnalyzer+ShingleFilter wil form ngrams
> >> across sentence boundaries.
> >>
> >
> > These effects are just minor issues: you lose a little bit of signal on
> > line endings, and you pick up some noise catching ngrams across
> > sentence boundaries, but it's fractional compared to your whole set.
> > Don't try and to be too fancy and cram tons of lines together.  If your
> > data comes in different chunks than just one huge HDFS text file, you
> > could certainly chunk it into bigger chunks (10, 100, 1000 lines, maybe)
> > to reduce the newline error if necessary, but it's probably not needed.
> > The sentence boundary part gets washed out in the LLR step anyways
> > (because they'll almost always turn out to have a low LLR score).
> >
> > What I've found I've had to do sometimes, is something with stop words.
> > If you don't use stop words at all, you end up getting a lot of
> relatively
> > high LLR scoring ngrams like "up into", "he would", and in general
> pairings
> > of a relatively rare unigram with a pronoun or preposition.  Maybe there
> are
> > other ways of avoiding that, but I've found that you do need to take some
> > care with the stop words (but removing them altogether leads to some
> > weird looking ngrams if you want to display them somewhere).
> >
> >
> >> I'm interested in whether there's a more efficient way to structure
> >> the M/R passes. It feels a little funny to no-op a whole map cycle. It
> >> would almost be better if one could chain two reduces together.
> >>
> >
> > Beware premature optimization - try this on a nice big monster set on
> > a real cluster, and see how long it takes.  I have a feeling you'll be
> > pleasantly surprised.  But even before that - show us a patch, maybe
> > someone will have easy low-hanging fruit optimization tricks.
> >
> >  -jake
> >
>



-- 
Zaki Rahaman

Re: Collocations in Mahout?

Posted by Drew Farris <dr...@gmail.com>.

Jake, thanks for the review, running narrative and comments. The
Analyzer in use should be up to the user, so there will be flexibility
to mess around with lots of alternative there, but it will be nice to
provide reasonable defaults and include this sort of discussion in the
wiki page for the algo. I'll finish up the rest of the code for it and
post a patch to JIRA.

Robin, I'll take a look at the dictionaryVectorizer, and see how they
can work together. I think something like SequenceFiles<documentId,
Text or BytesWritable> make sense as input for this job and it's
probably easier to work with than what I had to whip up to slurp in
files whole.

Does anyone know if there is a stream based alternative to Text or
BytesWritable?

On Thu, Jan 7, 2010 at 11:46 PM, Jake Mannix <ja...@gmail.com> wrote:
> Ok, I lied - I think what you described here is way *faster* than what I
> was doing, because I wasn't starting with the original corpus, I had
> something like google's ngram terabyte data (a massive HDFS file with
> just "ngram ngram-frequency" on each line), which mean I had to do
> a multi-way join (which is where I needed to do a secondary sort by
> value).
>
> Starting with the corpus itself (the case we're talking about) you have
> some nice tricks in here:
>
> On Thu, Jan 7, 2010 at 6:46 PM, Drew Farris <dr...@gmail.com> wrote:
>>
>>
>> The output of that map task is something like:
>>
>> k:(n-1)gram v:ngram
>>
>
> This is great right here - it helps you kill two birds with one stone: the
> join
> and the wordcount phases.
>
>
>> k:ngram,ngram-frequency v:(n-1)gram,(n-1) gram freq
>>
>> e.g:
>> k:the best:1, v:best,2
>> k:best of,1, v:best,2
>> k:best of,1, v:of,2
>> k:of times,1 v:of,2
>> k:the best,1, v:the,1
>> k:of times,1 v:1 v:times,1
>>
>
> Yeah, once you're here, you're home free.  This should be really a rather
> quick set of jobs, even on really big data, and even dealing with it as
> text.
>
>
>> I'm also wondering about the best way to handle input. Line by line
>> processing would miss ngrams spanning lines, but full document
>> processing with the StandardAnalyzer+ShingleFilter wil form ngrams
>> across sentence boundaries.
>>
>
> These effects are just minor issues: you lose a little bit of signal on
> line endings, and you pick up some noise catching ngrams across
> sentence boundaries, but it's fractional compared to your whole set.
> Don't try and to be too fancy and cram tons of lines together.  If your
> data comes in different chunks than just one huge HDFS text file, you
> could certainly chunk it into bigger chunks (10, 100, 1000 lines, maybe)
> to reduce the newline error if necessary, but it's probably not needed.
> The sentence boundary part gets washed out in the LLR step anyways
> (because they'll almost always turn out to have a low LLR score).
>
> What I've found I've had to do sometimes, is something with stop words.
> If you don't use stop words at all, you end up getting a lot of relatively
> high LLR scoring ngrams like "up into", "he would", and in general pairings
> of a relatively rare unigram with a pronoun or preposition.  Maybe there are
> other ways of avoiding that, but I've found that you do need to take some
> care with the stop words (but removing them altogether leads to some
> weird looking ngrams if you want to display them somewhere).
>
>
>> I'm interested in whether there's a more efficient way to structure
>> the M/R passes. It feels a little funny to no-op a whole map cycle. It
>> would almost be better if one could chain two reduces together.
>>
>
> Beware premature optimization - try this on a nice big monster set on
> a real cluster, and see how long it takes.  I have a feeling you'll be
> pleasantly surprised.  But even before that - show us a patch, maybe
> someone will have easy low-hanging fruit optimization tricks.
>
>  -jake
>

Re: Collocations in Mahout?

Posted by Jake Mannix <ja...@gmail.com>.

Ok, I lied - I think what you described here is way *faster* than what I
was doing, because I wasn't starting with the original corpus, I had
something like google's ngram terabyte data (a massive HDFS file with
just "ngram ngram-frequency" on each line), which mean I had to do
a multi-way join (which is where I needed to do a secondary sort by
value).

Starting with the corpus itself (the case we're talking about) you have
some nice tricks in here:

On Thu, Jan 7, 2010 at 6:46 PM, Drew Farris <dr...@gmail.com> wrote:
>
>
> The output of that map task is something like:
>
> k:(n-1)gram v:ngram
>

This is great right here - it helps you kill two birds with one stone: the
join
and the wordcount phases.

> k:ngram,ngram-frequency v:(n-1)gram,(n-1) gram freq
>
> e.g:
> k:the best:1, v:best,2
> k:best of,1, v:best,2
> k:best of,1, v:of,2
> k:of times,1 v:of,2
> k:the best,1, v:the,1
> k:of times,1 v:1 v:times,1
>

Yeah, once you're here, you're home free.  This should be really a rather
quick set of jobs, even on really big data, and even dealing with it as
text.

> I'm also wondering about the best way to handle input. Line by line
> processing would miss ngrams spanning lines, but full document
> processing with the StandardAnalyzer+ShingleFilter wil form ngrams
> across sentence boundaries.
>

These effects are just minor issues: you lose a little bit of signal on
line endings, and you pick up some noise catching ngrams across
sentence boundaries, but it's fractional compared to your whole set.
Don't try and to be too fancy and cram tons of lines together.  If your
data comes in different chunks than just one huge HDFS text file, you
could certainly chunk it into bigger chunks (10, 100, 1000 lines, maybe)
to reduce the newline error if necessary, but it's probably not needed.
The sentence boundary part gets washed out in the LLR step anyways
(because they'll almost always turn out to have a low LLR score).

What I've found I've had to do sometimes, is something with stop words.
If you don't use stop words at all, you end up getting a lot of relatively
high LLR scoring ngrams like "up into", "he would", and in general pairings
of a relatively rare unigram with a pronoun or preposition.  Maybe there are
other ways of avoiding that, but I've found that you do need to take some
care with the stop words (but removing them altogether leads to some
weird looking ngrams if you want to display them somewhere).

> I'm interested in whether there's a more efficient way to structure
> the M/R passes. It feels a little funny to no-op a whole map cycle. It
> would almost be better if one could chain two reduces together.
>

Beware premature optimization - try this on a nice big monster set on
a real cluster, and see how long it takes.  I have a feeling you'll be
pleasantly surprised.  But even before that - show us a patch, maybe
someone will have easy low-hanging fruit optimization tricks.

  -jake

Re: Collocations in Mahout?

Posted by Jake Mannix <ja...@gmail.com>.

Sounds like you've got almost exactly the MR passes I do when I do this,
although I'm not sure because I'm reading this on my phone... I'll look at
it closer when I get home.

I do remember I had at least one (maybe two) identity mappers in my
sequence.  You don't need any custom comparator to do a secondary sort,
which I did... not sure where you are getting around not needing that...

  -jake

On Jan 7, 2010 6:47 PM, "Drew Farris" <dr...@gmail.com> wrote:

This conversation has been pretty inspiring, thanks everyone.

I spent some time thinking about the steps involved in the M/R LLR
job. I'm a bit of a greenhorn when it comes to this, so it would be
great to see what you all think. Here's what I've been able to piece
together so far:

To perform the LLR calculation, we need 4 values for the combinations
of (n-1)grams in ngrams in the input data: A+B, A+!B, B+!A, !A+!B. For
the ngram 'the best',with A=the, B=best, this would mean:

A+B = the number of times the ngram 'the best' appears
A+!B = the number of times 'the' appears in an ngram without 'best'
!A+B = the number of times 'best' appears in an ngram without 'the'
!A+!B = the number of ngrams that contain neither A or B (are not 'the
best')

It is also necessary to have N,
N = the total number of ngrams

(Ted's blog post referenced in the LLR class really helped me
understand this concretely, thanks!)

Input into the job is done so that the first mapper gets a single
entire text document for each map call. This gets runs it through the
Analyzer+Lucene ShingleFilter combo.

The output of that map task is something like:

k:(n-1)gram v:ngram

For the input: 'the best of times'
The mapper output is:

k:the v:the best
k:best v:the best
k:best v:best of
k:of    v:best of
k:of    v:of times
k:times v:of times

As an aside, the shingles were 'the best', 'best of', 'of times') --
we preserve the total number of ngrams/shingles (N). In this case, it
would be 3

In the reducer, we count the number of ngrams each (n-1)gram appears
in and the number of times each ngram appears. (I wind up counting
each ngram 'n' times, however - There's probably a way around that).
Once we have this info, we can output the following from the reducer:

k:ngram,ngram-frequency v:(n-1)gram,(n-1) gram freq

e.g:
k:the best:1, v:best,2
k:best of,1, v:best,2
k:best of,1, v:of,2
k:of times,1 v:of,2
k:the best,1, v:the,1
k:of times,1 v:1 v:times,1

The next mapper could just be a no-op, because we have the data in the
right shape to do the LLR in the next reduction pass:

k: the best:1, v:best,2; v:the,1
k: best of:1, v:best,2; v:of,2
k: of times:1, v:of,2; v: times:1

(n-1)grams sorted

nf = The ngram frequency
ln1f = left (n-1)gram frequency
rn1f = right (n-1)gram frequency
N = Total number of ngrams

A+B = nf
A+!B = ln1f - nf
!A+B = rn1f - nf
!A+!B = N - (ln1f + rn1f - nf)

With these we calculate LLR using the class on o.a.m.stats and the
reducer output can be

k:LLR v:ngram

Does this work, or did I miss something critical?  I've only thought
this through for n=2, but I suspect it extends to other cases. I'm
curious as to whether it has a problem when both 'best of' and 'of
best' occur in the text, but haven't though that through yet.

I have the first map/reduce pass implemented using the hadoop 0.19
api, but the code is brutally inefficient in a couple spots,
especially because it keeps the *grams as strings. There should likely
be a pass to convert them to integers or something to be more compact,
right?

I'm also wondering about the best way to handle input. Line by line
processing would miss ngrams spanning lines, but full document
processing with the StandardAnalyzer+ShingleFilter wil form ngrams
across sentence boundaries.

I followed the WholeFileInputFormat example from the Hadoop in Action
book to slurp in data, but I don't like the idea of reading the entire
file into a buffer and then wrapping that up into something that can
be accessed via the Reader consumed by Lucene's Analyzer. I would much
rather pass a Reader/Stream into the mapper if possible.

I'm interested in whether there's a more efficient way to structure
the M/R passes. It feels a little funny to no-op a whole map cycle. It
would almost be better if one could chain two reduces together.

On Thu, Jan 7, 2010 at 7:57 PM, Ted Dunning <te...@gmail.com> wrote: >
The pieces are laying ...

Re: Collocations in Mahout?

Posted by Drew Farris <dr...@gmail.com>.

This conversation has been pretty inspiring, thanks everyone.

I spent some time thinking about the steps involved in the M/R LLR
job. I'm a bit of a greenhorn when it comes to this, so it would be
great to see what you all think. Here's what I've been able to piece
together so far:

To perform the LLR calculation, we need 4 values for the combinations
of (n-1)grams in ngrams in the input data: A+B, A+!B, B+!A, !A+!B. For
the ngram 'the best',with A=the, B=best, this would mean:

A+B = the number of times the ngram 'the best' appears
A+!B = the number of times 'the' appears in an ngram without 'best'
!A+B = the number of times 'best' appears in an ngram without 'the'
!A+!B = the number of ngrams that contain neither A or B (are not 'the best')

It is also necessary to have N,
N = the total number of ngrams

(Ted's blog post referenced in the LLR class really helped me
understand this concretely, thanks!)

Input into the job is done so that the first mapper gets a single
entire text document for each map call. This gets runs it through the
Analyzer+Lucene ShingleFilter combo.

The output of that map task is something like:

k:(n-1)gram v:ngram

For the input: 'the best of times'
The mapper output is:

k:the v:the best
k:best v:the best
k:best v:best of
k:of    v:best of
k:of    v:of times
k:times v:of times

As an aside, the shingles were 'the best', 'best of', 'of times') --
we preserve the total number of ngrams/shingles (N). In this case, it
would be 3

In the reducer, we count the number of ngrams each (n-1)gram appears
in and the number of times each ngram appears. (I wind up counting
each ngram 'n' times, however - There's probably a way around that).
Once we have this info, we can output the following from the reducer:

k:ngram,ngram-frequency v:(n-1)gram,(n-1) gram freq

e.g:
k:the best:1, v:best,2
k:best of,1, v:best,2
k:best of,1, v:of,2
k:of times,1 v:of,2
k:the best,1, v:the,1
k:of times,1 v:1 v:times,1

The next mapper could just be a no-op, because we have the data in the
right shape to do the LLR in the next reduction pass:

k: the best:1, v:best,2; v:the,1
k: best of:1, v:best,2; v:of,2
k: of times:1, v:of,2; v: times:1

(n-1)grams sorted

nf = The ngram frequency
ln1f = left (n-1)gram frequency
rn1f = right (n-1)gram frequency
N = Total number of ngrams

A+B = nf
A+!B = ln1f - nf
!A+B = rn1f - nf
!A+!B = N - (ln1f + rn1f - nf)

With these we calculate LLR using the class on o.a.m.stats and the
reducer output can be

k:LLR v:ngram

Does this work, or did I miss something critical?  I've only thought
this through for n=2, but I suspect it extends to other cases. I'm
curious as to whether it has a problem when both 'best of' and 'of
best' occur in the text, but haven't though that through yet.

I have the first map/reduce pass implemented using the hadoop 0.19
api, but the code is brutally inefficient in a couple spots,
especially because it keeps the *grams as strings. There should likely
be a pass to convert them to integers or something to be more compact,
right?

I'm also wondering about the best way to handle input. Line by line
processing would miss ngrams spanning lines, but full document
processing with the StandardAnalyzer+ShingleFilter wil form ngrams
across sentence boundaries.

I followed the WholeFileInputFormat example from the Hadoop in Action
book to slurp in data, but I don't like the idea of reading the entire
file into a buffer and then wrapping that up into something that can
be accessed via the Reader consumed by Lucene's Analyzer. I would much
rather pass a Reader/Stream into the mapper if possible.

I'm interested in whether there's a more efficient way to structure
the M/R passes. It feels a little funny to no-op a whole map cycle. It
would almost be better if one could chain two reduces together.

On Thu, Jan 7, 2010 at 7:57 PM, Ted Dunning <te...@gmail.com> wrote:
> The pieces are laying around.
>
> I had a framework like this for recs and text analysis at Veoh, Jake has
> something in LinkedIn.
>
> But the amount of code is relatively small and probably could be rewritten
> before Jake can get clearance to release anything.
>
> The first step is to just count n-grams.  I think that the input should be
> relatively flexible and if you assume parametrized use of Lucene analyzers,
> then all that is necessary is a small step up from word counting.  This
> should count all n-grams from 0 up to a limit.  It should also allow
> suppression of output of any counts less than a threshold.  Total number of
> n-grams of each size observed should be accumulated.  There should also be
> some provision for counting cooccurrence pairs within windows or between two
> fields.
>
> The second step is to detect interesting n-grams.  This is done using the
> counts of words and (n-1)-grams and the relevant totals as input for the LLR
> code.
>
> The final (optional) step is creation of a Bloom filter table.  Options
> should control size of the table and number of probes.
>
> Building up all these pieces and connecting them is a truly worthy task.
>
> On Thu, Jan 7, 2010 at 3:44 PM, zaki rahaman <za...@gmail.com> wrote:
>
>> @Ted, where is the partial framework you're referring to. And yes this is
>> definitely something I would like to work on if pointed in the right
>> direction. I wasn't quite sure though just b/c I remember a long-winded
>> discussion/debate a while back on the listserv about what Mahout's purpose
>> should be. N-gram LLR for collocations seems like a very NLP type of thing
>> to have (obviously it could also be used in other applications as well but
>> by itself its NLP to me) and from my understanding the "consensus" is that
>> Mahout should focus on scalable machine learning.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Collocations in Mahout?

Posted by Ted Dunning <te...@gmail.com>.

The pieces are laying around.

I had a framework like this for recs and text analysis at Veoh, Jake has
something in LinkedIn.

But the amount of code is relatively small and probably could be rewritten
before Jake can get clearance to release anything.

The first step is to just count n-grams.  I think that the input should be
relatively flexible and if you assume parametrized use of Lucene analyzers,
then all that is necessary is a small step up from word counting.  This
should count all n-grams from 0 up to a limit.  It should also allow
suppression of output of any counts less than a threshold.  Total number of
n-grams of each size observed should be accumulated.  There should also be
some provision for counting cooccurrence pairs within windows or between two
fields.

The second step is to detect interesting n-grams.  This is done using the
counts of words and (n-1)-grams and the relevant totals as input for the LLR
code.

The final (optional) step is creation of a Bloom filter table.  Options
should control size of the table and number of probes.

Building up all these pieces and connecting them is a truly worthy task.

On Thu, Jan 7, 2010 at 3:44 PM, zaki rahaman <za...@gmail.com> wrote:

> @Ted, where is the partial framework you're referring to. And yes this is
> definitely something I would like to work on if pointed in the right
> direction. I wasn't quite sure though just b/c I remember a long-winded
> discussion/debate a while back on the listserv about what Mahout's purpose
> should be. N-gram LLR for collocations seems like a very NLP type of thing
> to have (obviously it could also be used in other applications as well but
> by itself its NLP to me) and from my understanding the "consensus" is that
> Mahout should focus on scalable machine learning.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Collocations in Mahout?

Posted by Ted Dunning <te...@gmail.com>.

It definitely belongs.

And besides lots and lots of the data in large scale machine learning looks
like text.  Friends on linkedIn, history of traffic violations for
insurance, list of users who have clicked on an ad, the list goes on
forever.

Basically "text" is an ordered sequence of symbols and you encounter that
all over the place.  Cooccurrence at the window and the document level is
very widely applicable.

On Thu, Jan 7, 2010 at 4:03 PM, Otis Gospodnetic <otis_gospodnetic@yahoo.com
> wrote:

> NLP does fall under the Mahout umbrella, I'd say.  Future subproject
> perhaps?

-- 
Ted Dunning, CTO
DeepDyve

Re: Collocations in Mahout?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Zaki,

NLP does fall under the Mahout umbrella, I'd say.  Future subproject perhaps?

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: zaki rahaman <za...@gmail.com>
> To: mahout-user@lucene.apache.org
> Sent: Thu, January 7, 2010 6:44:22 PM
> Subject: Re: Collocations in Mahout?
> 
> Ideally yea, I think it would be nice to be able to pass in a custom
> analyzer or at least be able to provide some options... I saw the
> LogLikelihood class Grant was referring to in math.stats but I don't seem to
> see any M/R LLR piece, at least not something that's nicely abstracted and
> extracted out.
> 
> @Ted, where is the partial framework you're referring to. And yes this is
> definitely something I would like to work on if pointed in the right
> direction. I wasn't quite sure though just b/c I remember a long-winded
> discussion/debate a while back on the listserv about what Mahout's purpose
> should be. N-gram LLR for collocations seems like a very NLP type of thing
> to have (obviously it could also be used in other applications as well but
> by itself its NLP to me) and from my understanding the "consensus" is that
> Mahout should focus on scalable machine learning.
> 
> On Wed, Jan 6, 2010 at 4:04 PM, Grant Ingersoll wrote:
> 
> >
> > On Jan 6, 2010, at 3:52 PM, Drew Farris wrote:
> >
> > > On Wed, Jan 6, 2010 at 3:35 PM, Grant Ingersoll 
> > wrote:
> > >>
> > >> On Jan 5, 2010, at 3:18 PM, Ted Dunning wrote:
> > >>
> > >>> No.  We really don't.
> > >>
> > >> FWIW, I checked in math/o.a.m.math.stats.LogLikelihood w/ some based LLR
> > stuff that we use in utils.lucene.ClusterLabels.  Would be great to see this
> > stuff expanded.
> > >>
> > >
> > > So, doing something like this would involve some number of M/R passes
> > > to do the ngram generation, counting and calculate LLR using
> > > o.a.m.math.stats.LogLikelihood, but what to do about tokenization?
> > >
> > > I've seen the approach of using a list of filenames as input to the
> > > first mapper, which slurps in and tokenizes / generating ngrams for
> > > the text of each file, but is there something that works better?
> > >
> > > Would Lucene's StandardAnalyzer be sufficient for generating tokens?
> >
> > Why not be able to pass in the Analyzer?  I think the classifier stuff
> > does, assuming it takes a no-arg constructor, which many do.  It's the one
> > place, however, where I think we could benefit from something like Spring or
> > Guice.
> 
> 
> 
> 
> -- 
> Zaki Rahaman

Re: Collocations in Mahout?

Posted by zaki rahaman <za...@gmail.com>.

Ideally yea, I think it would be nice to be able to pass in a custom
analyzer or at least be able to provide some options... I saw the
LogLikelihood class Grant was referring to in math.stats but I don't seem to
see any M/R LLR piece, at least not something that's nicely abstracted and
extracted out.

@Ted, where is the partial framework you're referring to. And yes this is
definitely something I would like to work on if pointed in the right
direction. I wasn't quite sure though just b/c I remember a long-winded
discussion/debate a while back on the listserv about what Mahout's purpose
should be. N-gram LLR for collocations seems like a very NLP type of thing
to have (obviously it could also be used in other applications as well but
by itself its NLP to me) and from my understanding the "consensus" is that
Mahout should focus on scalable machine learning.

On Wed, Jan 6, 2010 at 4:04 PM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Jan 6, 2010, at 3:52 PM, Drew Farris wrote:
>
> > On Wed, Jan 6, 2010 at 3:35 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
> >>
> >> On Jan 5, 2010, at 3:18 PM, Ted Dunning wrote:
> >>
> >>> No.  We really don't.
> >>
> >> FWIW, I checked in math/o.a.m.math.stats.LogLikelihood w/ some based LLR
> stuff that we use in utils.lucene.ClusterLabels.  Would be great to see this
> stuff expanded.
> >>
> >
> > So, doing something like this would involve some number of M/R passes
> > to do the ngram generation, counting and calculate LLR using
> > o.a.m.math.stats.LogLikelihood, but what to do about tokenization?
> >
> > I've seen the approach of using a list of filenames as input to the
> > first mapper, which slurps in and tokenizes / generating ngrams for
> > the text of each file, but is there something that works better?
> >
> > Would Lucene's StandardAnalyzer be sufficient for generating tokens?
>
> Why not be able to pass in the Analyzer?  I think the classifier stuff
> does, assuming it takes a no-arg constructor, which many do.  It's the one
> place, however, where I think we could benefit from something like Spring or
> Guice.

-- 
Zaki Rahaman

Re: Collocations in Mahout?

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 6, 2010, at 3:52 PM, Drew Farris wrote:

> On Wed, Jan 6, 2010 at 3:35 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> 
>> On Jan 5, 2010, at 3:18 PM, Ted Dunning wrote:
>> 
>>> No.  We really don't.
>> 
>> FWIW, I checked in math/o.a.m.math.stats.LogLikelihood w/ some based LLR stuff that we use in utils.lucene.ClusterLabels.  Would be great to see this stuff expanded.
>> 
> 
> So, doing something like this would involve some number of M/R passes
> to do the ngram generation, counting and calculate LLR using
> o.a.m.math.stats.LogLikelihood, but what to do about tokenization?
> 
> I've seen the approach of using a list of filenames as input to the
> first mapper, which slurps in and tokenizes / generating ngrams for
> the text of each file, but is there something that works better?
> 
> Would Lucene's StandardAnalyzer be sufficient for generating tokens?

Why not be able to pass in the Analyzer?  I think the classifier stuff does, assuming it takes a no-arg constructor, which many do.  It's the one place, however, where I think we could benefit from something like Spring or Guice.

Re: Collocations in Mahout?

Posted by Drew Farris <dr...@gmail.com>.

On Wed, Jan 6, 2010 at 3:35 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Jan 5, 2010, at 3:18 PM, Ted Dunning wrote:
>
>> No.  We really don't.
>
> FWIW, I checked in math/o.a.m.math.stats.LogLikelihood w/ some based LLR stuff that we use in utils.lucene.ClusterLabels.  Would be great to see this stuff expanded.
>

So, doing something like this would involve some number of M/R passes
to do the ngram generation, counting and calculate LLR using
o.a.m.math.stats.LogLikelihood, but what to do about tokenization?

I've seen the approach of using a list of filenames as input to the
first mapper, which slurps in and tokenizes / generating ngrams for
the text of each file, but is there something that works better?

Would Lucene's StandardAnalyzer be sufficient for generating tokens?

Re: Collocations in Mahout?

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 5, 2010, at 3:18 PM, Ted Dunning wrote:

> No.  We really don't.

FWIW, I checked in math/o.a.m.math.stats.LogLikelihood w/ some based LLR stuff that we use in utils.lucene.ClusterLabels.  Would be great to see this stuff expanded.

> 
> The most straightforward implementation does a separate pass for computing
> the overall total, for counting the unigrams and then counting the bigrams.
> It is cooler, of course, to count all sizes of ngrams in one pass and output
> them to separate files.  Then a second pass can do a map-side join if the
> unigram table is small enough (it usually is) and compute the results.  All
> of this is very straightforward programming and is a great introduction to
> map-reduce programming.
> 
> On Tue, Jan 5, 2010 at 12:09 PM, Jake Mannix <ja...@gmail.com> wrote:
> 
>> Ted, we don't have a MR job to scan through a corpus and ouptut [ngram :
>> LLR]
>> key-value pairs, do we?  I've got one we use at LinkedIn that I could try
>> and pull
>> out if we don't have one.
>> 
>> (I actually used to give this MR job as an interview question, because it's
>> a cute
>> problem you can work out the basics of in not too long).
>> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve

Re: Collocations in Mahout?

Posted by Ted Dunning <te...@gmail.com>.

No.  We really don't.

The most straightforward implementation does a separate pass for computing
the overall total, for counting the unigrams and then counting the bigrams.
It is cooler, of course, to count all sizes of ngrams in one pass and output
them to separate files.  Then a second pass can do a map-side join if the
unigram table is small enough (it usually is) and compute the results.  All
of this is very straightforward programming and is a great introduction to
map-reduce programming.

On Tue, Jan 5, 2010 at 12:09 PM, Jake Mannix <ja...@gmail.com> wrote:

> Ted, we don't have a MR job to scan through a corpus and ouptut [ngram :
> LLR]
> key-value pairs, do we?  I've got one we use at LinkedIn that I could try
> and pull
> out if we don't have one.
>
> (I actually used to give this MR job as an interview question, because it's
> a cute
> problem you can work out the basics of in not too long).
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Collocations in Mahout?

Posted by Jake Mannix <ja...@gmail.com>.

Yeah, doing this kind of thing would be great for us to have, and not hard
given what
we already have.

Ted, we don't have a MR job to scan through a corpus and ouptut [ngram :
LLR]
key-value pairs, do we?  I've got one we use at LinkedIn that I could try
and pull
out if we don't have one.

(I actually used to give this MR job as an interview question, because it's
a cute
problem you can work out the basics of in not too long).

With one job producing the list of best collocations, depending on how many
you
want to keep, there a a couple of strategies for then joining that data into
your
original corpus...

  -jake

On Tue, Jan 5, 2010 at 11:58 AM, Ted Dunning <te...@gmail.com> wrote:

> We do have partial framework for this including log-likelihood ratio test
> computation.
>
> For the most part, we don't have anything that specifically counts bigrams
> and words and arranges the counts in the right order for application, but
> that is relatively easy to write for map-reduce.
>
> I would be happy to provide pointers on the tricks I have seen to make that
> easy to do if you wanted to actually type the semi-colons and such.
>
> On Tue, Jan 5, 2010 at 9:02 AM, zaki rahaman <za...@gmail.com>
> wrote:
>
> > Pardon my ignorance as this is probably best handled by an NLP package
> like
> > GATE or LingPipe, but does Mahout provide anything for collocations? Or
> > does
> > anyone know of a MapReducible way to calculate something like t-values
> for
> > tokens in N-grams? I've got quite a large collection that I have to
> prune,
> > filter, and preprocess, but I still expect it to be a significant size.
> >
> > --
> > Zaki Rahaman
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Collocations in Mahout?

Posted by Ted Dunning <te...@gmail.com>.

We do have partial framework for this including log-likelihood ratio test
computation.

For the most part, we don't have anything that specifically counts bigrams
and words and arranges the counts in the right order for application, but
that is relatively easy to write for map-reduce.

I would be happy to provide pointers on the tricks I have seen to make that
easy to do if you wanted to actually type the semi-colons and such.

On Tue, Jan 5, 2010 at 9:02 AM, zaki rahaman <za...@gmail.com> wrote:

> Pardon my ignorance as this is probably best handled by an NLP package like
> GATE or LingPipe, but does Mahout provide anything for collocations? Or
> does
> anyone know of a MapReducible way to calculate something like t-values for
> tokens in N-grams? I've got quite a large collection that I have to prune,
> filter, and preprocess, but I still expect it to be a significant size.
>
> --
> Zaki Rahaman
>

-- 
Ted Dunning, CTO
DeepDyve