You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by John Conwell <jo...@iamjohn.me> on 2012/01/24 21:14:24 UTC

Token filtering and LDA quality

I'm trying to find out if there are any standard best practices for
document tokenization when prepping your data for LDA in order to get a
higher quality topic model, and to understand how the feature space affects
topic model quality.

For example, will the topic model be "better" if there is a more rich
feature space by not stemming terms, or is it better to have a more
normalized feature space by applying stemming?

Is it better to filter out stop words, or keep them in?

Is it better to include bi and/or tri grams of highly correlated terms in
the feature space?

In essence what characteristics of the feature space that LDA uses for
input will create a higher quality topic model.

Thanks,
JohnC

Re: Token filtering and LDA quality

Posted by Ted Dunning <te...@gmail.com>.
One of the strong arguments FOR latent Dirichlet allocation (aka LDA) is
that it explicitly maintains a model of counts.  Normalizing the document
throws that critical information away and is a really bad idea.

On Thu, Jan 26, 2012 at 7:46 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Thu, Jan 26, 2012 at 4:37 PM, John Conwell <jo...@iamjohn.me> wrote:
>
> > One more question.  What about vector normalization when you vectorize
> your
> > documents.  Would this help with topic model quality?
> >
>
> No, unless you have reason to feel that document length is definitely *not*
> an indicator of how much topical information is being provided.  So if
> you're
> building topic models off of webpages, and a page has only 20 words on it,
> do you *want* it to have the same impact on the overall topic model as a
> big page with 2000 words on it?  Maybe you do, if you've got a good reason,
> but I can't think of a domain-independent reason to do that.
>
>
> >
> > On Tue, Jan 24, 2012 at 4:11 PM, John Conwell <jo...@iamjohn.me> wrote:
> >
> > > Thanks for all the feedback!  You've been a big help.
> > >
> > >
> > > On Tue, Jan 24, 2012 at 4:04 PM, Jake Mannix <jake.mannix@gmail.com
> > >wrote:
> > >
> > >> On Tue, Jan 24, 2012 at 3:41 PM, John Conwell <jo...@iamjohn.me>
> wrote:
> > >>
> > >> > Hey Jake,
> > >> > Thanks for the tips.  That will definitely help.
> > >> >
> > >> > One more question, do you know if the topic model quality will be
> > >> affected
> > >> > by the document length?
> > >>
> > >>
> > >> Yes, very much so.
> > >>
> > >>
> > >> >  I'm thinking lengths ranging from tweets (~20 words),
> > >>
> > >>
> > >> Tweets suck.  Trust me on this. ;)
> > >>
> > >>
> > >> > to emails (hundreds of words),
> > >>
> > >>
> > >> Fantastic size.
> > >>
> > >>
> > >> > to whitepapers (thousands of words)
> > >> >
> > >>
> > >> Can be pretty great too.
> > >>
> > >>
> > >> > to books (boat loads of words).
> > >>
> > >>
> > >> This is too long.  There will be tons and tons of topics in a book,
> > often.
> > >> But, frankly, I have not tried with huge documents personally, so I
> > can't
> > >> say from experience that it won't work.  I'd just not be terribly
> > >> surprised
> > >> if it didn't work well at all.  If I had a bunch of books I wanted to
> > run
> > >> LDA
> > >> on, I'd maybe treat each page or each chapter as a separate document.
> > >>
> > >>  -jake
> > >>
> > >>  What lengths'ish would degrade topic model
> > >> > quality.
> > >> >
> > >> > I would think tweets would kind'a suck, but what about longer docs?
> > >>  Should
> > >> > they be segmented into sub-documents?
> > >> >
> > >> > Thanks,
> > >> > JohnC
> > >> >
> > >> >
> > >> > On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <
> jake.mannix@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Hi John,
> > >> > >
> > >> > >  I'm not an expert in the field, but I have done a bit of work
> > >> building
> > >> > > topic
> > >> > > models with LDA, and here are some of the "tricks" I've used:
> > >> > >
> > >> > >  1) yes remove stop words, in fact remove all words occurring in
> > more
> > >> > than
> > >> > > (say) half (or more conservatively, 90%) of your documents, as
> > >> they'll be
> > >> > > noise
> > >> > > and just dominate your topics.
> > >> > >
> > >> > >  2) more features is better, if you have the memory for it (note
> > that
> > >> > > mahout's
> > >> > > LDA currently holds numTopics * numFeatures in memory in the
> mapper
> > >> > tasks,
> > >> > > which means that you are usually bounded to a few hundred thousand
> > >> > > features,
> > >> > > maybe up as high as a million, currently).  So don't stem, and
> throw
> > >> in
> > >> > > commonly occurring (or more importantly: high log-likelihood)
> > bigrams
> > >> and
> > >> > > trigrams as independent features.
> > >> > >
> > >> > >  3) violate the underlying assumption of LDA, that you're talking
> > >> about
> > >> > > "token
> > >> > > occurrences", and weight your vectors not as "tf", but "tf*idf",
> > which
> > >> > > makes rarer
> > >> > > features more prominent, which ends up making your topics look a
> lot
> > >> > nicer.
> > >> > >
> > >> > > Those are the main tricks I can think of right now.
> > >> > >
> > >> > > If you're using Mahout trunk, try the new LDA impl:
> > >> > >
> > >> > >  $MAHOUT_HOME/bin/mahout cvb0 --help
> > >> > >
> > >> > > It operates on the same kind of input as the last one (ie. a
> corpus
> > >> which
> > >> > > is
> > >> > > a SequenceFile<IntWritable, VectorWritable>).
> > >> > >
> > >> > >  -jake
> > >> > >
> > >> > > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <jo...@iamjohn.me>
> > >> wrote:
> > >> > >
> > >> > > > I'm trying to find out if there are any standard best practices
> > for
> > >> > > > document tokenization when prepping your data for LDA in order
> to
> > >> get a
> > >> > > > higher quality topic model, and to understand how the feature
> > space
> > >> > > affects
> > >> > > > topic model quality.
> > >> > > >
> > >> > > > For example, will the topic model be "better" if there is a more
> > >> rich
> > >> > > > feature space by not stemming terms, or is it better to have a
> > more
> > >> > > > normalized feature space by applying stemming?
> > >> > > >
> > >> > > > Is it better to filter out stop words, or keep them in?
> > >> > > >
> > >> > > > Is it better to include bi and/or tri grams of highly correlated
> > >> terms
> > >> > in
> > >> > > > the feature space?
> > >> > > >
> > >> > > > In essence what characteristics of the feature space that LDA
> uses
> > >> for
> > >> > > > input will create a higher quality topic model.
> > >> > > >
> > >> > > > Thanks,
> > >> > > > JohnC
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> >
> > >> > Thanks,
> > >> > John C
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > >
> > > Thanks,
> > > John C
> > >
> > >
> >
> >
> > --
> >
> > Thanks,
> > John C
> >
>

Re: Token filtering and LDA quality

Posted by Jake Mannix <ja...@gmail.com>.
On Thu, Jan 26, 2012 at 4:37 PM, John Conwell <jo...@iamjohn.me> wrote:

> One more question.  What about vector normalization when you vectorize your
> documents.  Would this help with topic model quality?
>

No, unless you have reason to feel that document length is definitely *not*
an indicator of how much topical information is being provided.  So if
you're
building topic models off of webpages, and a page has only 20 words on it,
do you *want* it to have the same impact on the overall topic model as a
big page with 2000 words on it?  Maybe you do, if you've got a good reason,
but I can't think of a domain-independent reason to do that.


>
> On Tue, Jan 24, 2012 at 4:11 PM, John Conwell <jo...@iamjohn.me> wrote:
>
> > Thanks for all the feedback!  You've been a big help.
> >
> >
> > On Tue, Jan 24, 2012 at 4:04 PM, Jake Mannix <jake.mannix@gmail.com
> >wrote:
> >
> >> On Tue, Jan 24, 2012 at 3:41 PM, John Conwell <jo...@iamjohn.me> wrote:
> >>
> >> > Hey Jake,
> >> > Thanks for the tips.  That will definitely help.
> >> >
> >> > One more question, do you know if the topic model quality will be
> >> affected
> >> > by the document length?
> >>
> >>
> >> Yes, very much so.
> >>
> >>
> >> >  I'm thinking lengths ranging from tweets (~20 words),
> >>
> >>
> >> Tweets suck.  Trust me on this. ;)
> >>
> >>
> >> > to emails (hundreds of words),
> >>
> >>
> >> Fantastic size.
> >>
> >>
> >> > to whitepapers (thousands of words)
> >> >
> >>
> >> Can be pretty great too.
> >>
> >>
> >> > to books (boat loads of words).
> >>
> >>
> >> This is too long.  There will be tons and tons of topics in a book,
> often.
> >> But, frankly, I have not tried with huge documents personally, so I
> can't
> >> say from experience that it won't work.  I'd just not be terribly
> >> surprised
> >> if it didn't work well at all.  If I had a bunch of books I wanted to
> run
> >> LDA
> >> on, I'd maybe treat each page or each chapter as a separate document.
> >>
> >>  -jake
> >>
> >>  What lengths'ish would degrade topic model
> >> > quality.
> >> >
> >> > I would think tweets would kind'a suck, but what about longer docs?
> >>  Should
> >> > they be segmented into sub-documents?
> >> >
> >> > Thanks,
> >> > JohnC
> >> >
> >> >
> >> > On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <ja...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi John,
> >> > >
> >> > >  I'm not an expert in the field, but I have done a bit of work
> >> building
> >> > > topic
> >> > > models with LDA, and here are some of the "tricks" I've used:
> >> > >
> >> > >  1) yes remove stop words, in fact remove all words occurring in
> more
> >> > than
> >> > > (say) half (or more conservatively, 90%) of your documents, as
> >> they'll be
> >> > > noise
> >> > > and just dominate your topics.
> >> > >
> >> > >  2) more features is better, if you have the memory for it (note
> that
> >> > > mahout's
> >> > > LDA currently holds numTopics * numFeatures in memory in the mapper
> >> > tasks,
> >> > > which means that you are usually bounded to a few hundred thousand
> >> > > features,
> >> > > maybe up as high as a million, currently).  So don't stem, and throw
> >> in
> >> > > commonly occurring (or more importantly: high log-likelihood)
> bigrams
> >> and
> >> > > trigrams as independent features.
> >> > >
> >> > >  3) violate the underlying assumption of LDA, that you're talking
> >> about
> >> > > "token
> >> > > occurrences", and weight your vectors not as "tf", but "tf*idf",
> which
> >> > > makes rarer
> >> > > features more prominent, which ends up making your topics look a lot
> >> > nicer.
> >> > >
> >> > > Those are the main tricks I can think of right now.
> >> > >
> >> > > If you're using Mahout trunk, try the new LDA impl:
> >> > >
> >> > >  $MAHOUT_HOME/bin/mahout cvb0 --help
> >> > >
> >> > > It operates on the same kind of input as the last one (ie. a corpus
> >> which
> >> > > is
> >> > > a SequenceFile<IntWritable, VectorWritable>).
> >> > >
> >> > >  -jake
> >> > >
> >> > > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <jo...@iamjohn.me>
> >> wrote:
> >> > >
> >> > > > I'm trying to find out if there are any standard best practices
> for
> >> > > > document tokenization when prepping your data for LDA in order to
> >> get a
> >> > > > higher quality topic model, and to understand how the feature
> space
> >> > > affects
> >> > > > topic model quality.
> >> > > >
> >> > > > For example, will the topic model be "better" if there is a more
> >> rich
> >> > > > feature space by not stemming terms, or is it better to have a
> more
> >> > > > normalized feature space by applying stemming?
> >> > > >
> >> > > > Is it better to filter out stop words, or keep them in?
> >> > > >
> >> > > > Is it better to include bi and/or tri grams of highly correlated
> >> terms
> >> > in
> >> > > > the feature space?
> >> > > >
> >> > > > In essence what characteristics of the feature space that LDA uses
> >> for
> >> > > > input will create a higher quality topic model.
> >> > > >
> >> > > > Thanks,
> >> > > > JohnC
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > Thanks,
> >> > John C
> >> >
> >>
> >
> >
> >
> > --
> >
> > Thanks,
> > John C
> >
> >
>
>
> --
>
> Thanks,
> John C
>

Re: Token filtering and LDA quality

Posted by John Conwell <jo...@iamjohn.me>.
One more question.  What about vector normalization when you vectorize your
documents.  Would this help with topic model quality?

On Tue, Jan 24, 2012 at 4:11 PM, John Conwell <jo...@iamjohn.me> wrote:

> Thanks for all the feedback!  You've been a big help.
>
>
> On Tue, Jan 24, 2012 at 4:04 PM, Jake Mannix <ja...@gmail.com>wrote:
>
>> On Tue, Jan 24, 2012 at 3:41 PM, John Conwell <jo...@iamjohn.me> wrote:
>>
>> > Hey Jake,
>> > Thanks for the tips.  That will definitely help.
>> >
>> > One more question, do you know if the topic model quality will be
>> affected
>> > by the document length?
>>
>>
>> Yes, very much so.
>>
>>
>> >  I'm thinking lengths ranging from tweets (~20 words),
>>
>>
>> Tweets suck.  Trust me on this. ;)
>>
>>
>> > to emails (hundreds of words),
>>
>>
>> Fantastic size.
>>
>>
>> > to whitepapers (thousands of words)
>> >
>>
>> Can be pretty great too.
>>
>>
>> > to books (boat loads of words).
>>
>>
>> This is too long.  There will be tons and tons of topics in a book, often.
>> But, frankly, I have not tried with huge documents personally, so I can't
>> say from experience that it won't work.  I'd just not be terribly
>> surprised
>> if it didn't work well at all.  If I had a bunch of books I wanted to run
>> LDA
>> on, I'd maybe treat each page or each chapter as a separate document.
>>
>>  -jake
>>
>>  What lengths'ish would degrade topic model
>> > quality.
>> >
>> > I would think tweets would kind'a suck, but what about longer docs?
>>  Should
>> > they be segmented into sub-documents?
>> >
>> > Thanks,
>> > JohnC
>> >
>> >
>> > On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <ja...@gmail.com>
>> > wrote:
>> >
>> > > Hi John,
>> > >
>> > >  I'm not an expert in the field, but I have done a bit of work
>> building
>> > > topic
>> > > models with LDA, and here are some of the "tricks" I've used:
>> > >
>> > >  1) yes remove stop words, in fact remove all words occurring in more
>> > than
>> > > (say) half (or more conservatively, 90%) of your documents, as
>> they'll be
>> > > noise
>> > > and just dominate your topics.
>> > >
>> > >  2) more features is better, if you have the memory for it (note that
>> > > mahout's
>> > > LDA currently holds numTopics * numFeatures in memory in the mapper
>> > tasks,
>> > > which means that you are usually bounded to a few hundred thousand
>> > > features,
>> > > maybe up as high as a million, currently).  So don't stem, and throw
>> in
>> > > commonly occurring (or more importantly: high log-likelihood) bigrams
>> and
>> > > trigrams as independent features.
>> > >
>> > >  3) violate the underlying assumption of LDA, that you're talking
>> about
>> > > "token
>> > > occurrences", and weight your vectors not as "tf", but "tf*idf", which
>> > > makes rarer
>> > > features more prominent, which ends up making your topics look a lot
>> > nicer.
>> > >
>> > > Those are the main tricks I can think of right now.
>> > >
>> > > If you're using Mahout trunk, try the new LDA impl:
>> > >
>> > >  $MAHOUT_HOME/bin/mahout cvb0 --help
>> > >
>> > > It operates on the same kind of input as the last one (ie. a corpus
>> which
>> > > is
>> > > a SequenceFile<IntWritable, VectorWritable>).
>> > >
>> > >  -jake
>> > >
>> > > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <jo...@iamjohn.me>
>> wrote:
>> > >
>> > > > I'm trying to find out if there are any standard best practices for
>> > > > document tokenization when prepping your data for LDA in order to
>> get a
>> > > > higher quality topic model, and to understand how the feature space
>> > > affects
>> > > > topic model quality.
>> > > >
>> > > > For example, will the topic model be "better" if there is a more
>> rich
>> > > > feature space by not stemming terms, or is it better to have a more
>> > > > normalized feature space by applying stemming?
>> > > >
>> > > > Is it better to filter out stop words, or keep them in?
>> > > >
>> > > > Is it better to include bi and/or tri grams of highly correlated
>> terms
>> > in
>> > > > the feature space?
>> > > >
>> > > > In essence what characteristics of the feature space that LDA uses
>> for
>> > > > input will create a higher quality topic model.
>> > > >
>> > > > Thanks,
>> > > > JohnC
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> >
>> > Thanks,
>> > John C
>> >
>>
>
>
>
> --
>
> Thanks,
> John C
>
>


-- 

Thanks,
John C

Re: Token filtering and LDA quality

Posted by John Conwell <jo...@iamjohn.me>.
Thanks for all the feedback!  You've been a big help.

On Tue, Jan 24, 2012 at 4:04 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Tue, Jan 24, 2012 at 3:41 PM, John Conwell <jo...@iamjohn.me> wrote:
>
> > Hey Jake,
> > Thanks for the tips.  That will definitely help.
> >
> > One more question, do you know if the topic model quality will be
> affected
> > by the document length?
>
>
> Yes, very much so.
>
>
> >  I'm thinking lengths ranging from tweets (~20 words),
>
>
> Tweets suck.  Trust me on this. ;)
>
>
> > to emails (hundreds of words),
>
>
> Fantastic size.
>
>
> > to whitepapers (thousands of words)
> >
>
> Can be pretty great too.
>
>
> > to books (boat loads of words).
>
>
> This is too long.  There will be tons and tons of topics in a book, often.
> But, frankly, I have not tried with huge documents personally, so I can't
> say from experience that it won't work.  I'd just not be terribly surprised
> if it didn't work well at all.  If I had a bunch of books I wanted to run
> LDA
> on, I'd maybe treat each page or each chapter as a separate document.
>
>  -jake
>
>  What lengths'ish would degrade topic model
> > quality.
> >
> > I would think tweets would kind'a suck, but what about longer docs?
>  Should
> > they be segmented into sub-documents?
> >
> > Thanks,
> > JohnC
> >
> >
> > On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> >
> > > Hi John,
> > >
> > >  I'm not an expert in the field, but I have done a bit of work building
> > > topic
> > > models with LDA, and here are some of the "tricks" I've used:
> > >
> > >  1) yes remove stop words, in fact remove all words occurring in more
> > than
> > > (say) half (or more conservatively, 90%) of your documents, as they'll
> be
> > > noise
> > > and just dominate your topics.
> > >
> > >  2) more features is better, if you have the memory for it (note that
> > > mahout's
> > > LDA currently holds numTopics * numFeatures in memory in the mapper
> > tasks,
> > > which means that you are usually bounded to a few hundred thousand
> > > features,
> > > maybe up as high as a million, currently).  So don't stem, and throw in
> > > commonly occurring (or more importantly: high log-likelihood) bigrams
> and
> > > trigrams as independent features.
> > >
> > >  3) violate the underlying assumption of LDA, that you're talking about
> > > "token
> > > occurrences", and weight your vectors not as "tf", but "tf*idf", which
> > > makes rarer
> > > features more prominent, which ends up making your topics look a lot
> > nicer.
> > >
> > > Those are the main tricks I can think of right now.
> > >
> > > If you're using Mahout trunk, try the new LDA impl:
> > >
> > >  $MAHOUT_HOME/bin/mahout cvb0 --help
> > >
> > > It operates on the same kind of input as the last one (ie. a corpus
> which
> > > is
> > > a SequenceFile<IntWritable, VectorWritable>).
> > >
> > >  -jake
> > >
> > > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <jo...@iamjohn.me>
> wrote:
> > >
> > > > I'm trying to find out if there are any standard best practices for
> > > > document tokenization when prepping your data for LDA in order to
> get a
> > > > higher quality topic model, and to understand how the feature space
> > > affects
> > > > topic model quality.
> > > >
> > > > For example, will the topic model be "better" if there is a more rich
> > > > feature space by not stemming terms, or is it better to have a more
> > > > normalized feature space by applying stemming?
> > > >
> > > > Is it better to filter out stop words, or keep them in?
> > > >
> > > > Is it better to include bi and/or tri grams of highly correlated
> terms
> > in
> > > > the feature space?
> > > >
> > > > In essence what characteristics of the feature space that LDA uses
> for
> > > > input will create a higher quality topic model.
> > > >
> > > > Thanks,
> > > > JohnC
> > > >
> > >
> >
> >
> >
> > --
> >
> > Thanks,
> > John C
> >
>



-- 

Thanks,
John C

Re: Token filtering and LDA quality

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, Jan 24, 2012 at 3:41 PM, John Conwell <jo...@iamjohn.me> wrote:

> Hey Jake,
> Thanks for the tips.  That will definitely help.
>
> One more question, do you know if the topic model quality will be affected
> by the document length?


Yes, very much so.


>  I'm thinking lengths ranging from tweets (~20 words),


Tweets suck.  Trust me on this. ;)


> to emails (hundreds of words),


Fantastic size.


> to whitepapers (thousands of words)
>

Can be pretty great too.


> to books (boat loads of words).


This is too long.  There will be tons and tons of topics in a book, often.
But, frankly, I have not tried with huge documents personally, so I can't
say from experience that it won't work.  I'd just not be terribly surprised
if it didn't work well at all.  If I had a bunch of books I wanted to run
LDA
on, I'd maybe treat each page or each chapter as a separate document.

  -jake

 What lengths'ish would degrade topic model
> quality.
>
> I would think tweets would kind'a suck, but what about longer docs?  Should
> they be segmented into sub-documents?
>
> Thanks,
> JohnC
>
>
> On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Hi John,
> >
> >  I'm not an expert in the field, but I have done a bit of work building
> > topic
> > models with LDA, and here are some of the "tricks" I've used:
> >
> >  1) yes remove stop words, in fact remove all words occurring in more
> than
> > (say) half (or more conservatively, 90%) of your documents, as they'll be
> > noise
> > and just dominate your topics.
> >
> >  2) more features is better, if you have the memory for it (note that
> > mahout's
> > LDA currently holds numTopics * numFeatures in memory in the mapper
> tasks,
> > which means that you are usually bounded to a few hundred thousand
> > features,
> > maybe up as high as a million, currently).  So don't stem, and throw in
> > commonly occurring (or more importantly: high log-likelihood) bigrams and
> > trigrams as independent features.
> >
> >  3) violate the underlying assumption of LDA, that you're talking about
> > "token
> > occurrences", and weight your vectors not as "tf", but "tf*idf", which
> > makes rarer
> > features more prominent, which ends up making your topics look a lot
> nicer.
> >
> > Those are the main tricks I can think of right now.
> >
> > If you're using Mahout trunk, try the new LDA impl:
> >
> >  $MAHOUT_HOME/bin/mahout cvb0 --help
> >
> > It operates on the same kind of input as the last one (ie. a corpus which
> > is
> > a SequenceFile<IntWritable, VectorWritable>).
> >
> >  -jake
> >
> > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <jo...@iamjohn.me> wrote:
> >
> > > I'm trying to find out if there are any standard best practices for
> > > document tokenization when prepping your data for LDA in order to get a
> > > higher quality topic model, and to understand how the feature space
> > affects
> > > topic model quality.
> > >
> > > For example, will the topic model be "better" if there is a more rich
> > > feature space by not stemming terms, or is it better to have a more
> > > normalized feature space by applying stemming?
> > >
> > > Is it better to filter out stop words, or keep them in?
> > >
> > > Is it better to include bi and/or tri grams of highly correlated terms
> in
> > > the feature space?
> > >
> > > In essence what characteristics of the feature space that LDA uses for
> > > input will create a higher quality topic model.
> > >
> > > Thanks,
> > > JohnC
> > >
> >
>
>
>
> --
>
> Thanks,
> John C
>

Re: Token filtering and LDA quality

Posted by John Conwell <jo...@iamjohn.me>.
Hey Jake,
Thanks for the tips.  That will definitely help.

One more question, do you know if the topic model quality will be affected
by the document length?  I'm thinking lengths ranging from tweets (~20
words), to emails (hundreds of words), to whitepapers (thousands of words)
to books (boat loads of words).  What lengths'ish would degrade topic model
quality.

I would think tweets would kind'a suck, but what about longer docs?  Should
they be segmented into sub-documents?

Thanks,
JohnC


On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <ja...@gmail.com> wrote:

> Hi John,
>
>  I'm not an expert in the field, but I have done a bit of work building
> topic
> models with LDA, and here are some of the "tricks" I've used:
>
>  1) yes remove stop words, in fact remove all words occurring in more than
> (say) half (or more conservatively, 90%) of your documents, as they'll be
> noise
> and just dominate your topics.
>
>  2) more features is better, if you have the memory for it (note that
> mahout's
> LDA currently holds numTopics * numFeatures in memory in the mapper tasks,
> which means that you are usually bounded to a few hundred thousand
> features,
> maybe up as high as a million, currently).  So don't stem, and throw in
> commonly occurring (or more importantly: high log-likelihood) bigrams and
> trigrams as independent features.
>
>  3) violate the underlying assumption of LDA, that you're talking about
> "token
> occurrences", and weight your vectors not as "tf", but "tf*idf", which
> makes rarer
> features more prominent, which ends up making your topics look a lot nicer.
>
> Those are the main tricks I can think of right now.
>
> If you're using Mahout trunk, try the new LDA impl:
>
>  $MAHOUT_HOME/bin/mahout cvb0 --help
>
> It operates on the same kind of input as the last one (ie. a corpus which
> is
> a SequenceFile<IntWritable, VectorWritable>).
>
>  -jake
>
> On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <jo...@iamjohn.me> wrote:
>
> > I'm trying to find out if there are any standard best practices for
> > document tokenization when prepping your data for LDA in order to get a
> > higher quality topic model, and to understand how the feature space
> affects
> > topic model quality.
> >
> > For example, will the topic model be "better" if there is a more rich
> > feature space by not stemming terms, or is it better to have a more
> > normalized feature space by applying stemming?
> >
> > Is it better to filter out stop words, or keep them in?
> >
> > Is it better to include bi and/or tri grams of highly correlated terms in
> > the feature space?
> >
> > In essence what characteristics of the feature space that LDA uses for
> > input will create a higher quality topic model.
> >
> > Thanks,
> > JohnC
> >
>



-- 

Thanks,
John C

Re: Token filtering and LDA quality

Posted by Jake Mannix <ja...@gmail.com>.
Hi John,

  I'm not an expert in the field, but I have done a bit of work building
topic
models with LDA, and here are some of the "tricks" I've used:

  1) yes remove stop words, in fact remove all words occurring in more than
(say) half (or more conservatively, 90%) of your documents, as they'll be
noise
and just dominate your topics.

  2) more features is better, if you have the memory for it (note that
mahout's
LDA currently holds numTopics * numFeatures in memory in the mapper tasks,
which means that you are usually bounded to a few hundred thousand features,
maybe up as high as a million, currently).  So don't stem, and throw in
commonly occurring (or more importantly: high log-likelihood) bigrams and
trigrams as independent features.

  3) violate the underlying assumption of LDA, that you're talking about
"token
occurrences", and weight your vectors not as "tf", but "tf*idf", which
makes rarer
features more prominent, which ends up making your topics look a lot nicer.

Those are the main tricks I can think of right now.

If you're using Mahout trunk, try the new LDA impl:

  $MAHOUT_HOME/bin/mahout cvb0 --help

It operates on the same kind of input as the last one (ie. a corpus which is
a SequenceFile<IntWritable, VectorWritable>).

  -jake

On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <jo...@iamjohn.me> wrote:

> I'm trying to find out if there are any standard best practices for
> document tokenization when prepping your data for LDA in order to get a
> higher quality topic model, and to understand how the feature space affects
> topic model quality.
>
> For example, will the topic model be "better" if there is a more rich
> feature space by not stemming terms, or is it better to have a more
> normalized feature space by applying stemming?
>
> Is it better to filter out stop words, or keep them in?
>
> Is it better to include bi and/or tri grams of highly correlated terms in
> the feature space?
>
> In essence what characteristics of the feature space that LDA uses for
> input will create a higher quality topic model.
>
> Thanks,
> JohnC
>