You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Neel Sheyal <la...@gmail.com> on 2011/03/03 14:29:59 UTC
BagofWords and StopList
Hi
I need to do text clustering but in the context of natural
language processing. Consequently, word ordering becomes important.
Initially, I will be doing the nGram model (with n =3).
In Mahout, the Vector and SequenceFileFormat representation does not
take into consideration contextual information (as I understand). I
know I might need to modify both of them but is there a bagofwords
and stoplist that I may use?
Thanks,
Neel Sheyal
Re: BagofWords and StopList
Posted by Neel Sheyal <la...@gmail.com>.
> . Not sure what you mean by "contextual information",
The quality of clustering is dictated by the quality of the vectors. I
want to create vectors that that treat statements like "I am a stupid
guy" different from "I am a stupid guy because I am careless". If the
term stupid was a dimension, my method will give less weight to the
second statement than the first.
Thanks
Neel
RE: BagofWords and StopList
Posted by Jeff Eastman <je...@Narus.com>.
Check the user list a couple days ago for "LDA Mahout" for a similar thread. The seq2sparse routine will handle n-grams and has a -maxDFPercent option which will handle common terms much like a stoplist would. You can also specify your own analyzer which could use whatever stoplist you want. Not sure what you mean by "contextual information", but the document term vectors produced by seq2sparse wrap the vector in a NamedVector with the document name. That's about the extent of context which we currently support.
-----Original Message-----
From: Neel Sheyal [mailto:latencybuster@gmail.com]
Sent: Thursday, March 03, 2011 5:30 AM
To: dev@mahout.apache.org
Subject: BagofWords and StopList
Hi
I need to do text clustering but in the context of natural
language processing. Consequently, word ordering becomes important.
Initially, I will be doing the nGram model (with n =3).
In Mahout, the Vector and SequenceFileFormat representation does not
take into consideration contextual information (as I understand). I
know I might need to modify both of them but is there a bagofwords
and stoplist that I may use?
Thanks,
Neel Sheyal