You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Neel Sheyal <la...@gmail.com> on 2011/03/03 14:29:59 UTC

BagofWords and StopList

Hi
       I need to do text clustering but in the context of natural
language processing. Consequently, word ordering becomes important.
Initially, I will be doing the nGram model (with n =3).

In Mahout, the Vector and SequenceFileFormat representation does not
take into consideration contextual information (as I understand). I
know I might need to modify  both of them but is there a bagofwords
and stoplist that I may use?

Thanks,
Neel Sheyal

Re: BagofWords and StopList

Posted by Neel Sheyal <la...@gmail.com>.
> . Not sure what you mean by "contextual information",
The quality of clustering is dictated by the quality of the vectors. I
want to create vectors that that treat statements like "I am a stupid
guy" different from "I am a stupid guy because I am careless".  If the
term stupid was a dimension, my method will give less weight to the
second statement  than the first.

Thanks
Neel

RE: BagofWords and StopList

Posted by Jeff Eastman <je...@Narus.com>.
Check the user list a couple days ago for "LDA Mahout" for a similar thread. The seq2sparse routine will handle n-grams and has a -maxDFPercent option which will handle common terms much like a stoplist would. You can also specify your own analyzer which could use whatever stoplist you want. Not sure what you mean by "contextual information", but the document term vectors produced by seq2sparse wrap the vector in a NamedVector with the document name. That's about the extent of context which we currently support.

-----Original Message-----
From: Neel Sheyal [mailto:latencybuster@gmail.com] 
Sent: Thursday, March 03, 2011 5:30 AM
To: dev@mahout.apache.org
Subject: BagofWords and StopList

Hi
       I need to do text clustering but in the context of natural
language processing. Consequently, word ordering becomes important.
Initially, I will be doing the nGram model (with n =3).

In Mahout, the Vector and SequenceFileFormat representation does not
take into consideration contextual information (as I understand). I
know I might need to modify  both of them but is there a bagofwords
and stoplist that I may use?

Thanks,
Neel Sheyal