You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Zach Zeman <Za...@carfax.com> on 2013/02/05 16:04:53 UTC

Document Categorizer Custom Feature Generators

It turns out that the BagOfWords feature generator is insufficient for the problem I've been trying to solve using the DocumentCategorizerME. What I need is something that performs like the TokenClassFeatureGenerator, but it does not appear that AdaptiveFeatureGenerator's are usable with the categorizer. I'm not entirely sure about that last point, but I'm perfectly willing to implement my own version of a token class feature generator if it is.

However, when I was looking at how to implement a FeatureGenerator, I noticed that the text that enters the extractFeatures method has already been broken up by whitespace. So, is the featureGenerator the correct place to change how my incoming training text is being broken up into features? Or is there another process that I've missed which is more appropriate?

Thanks for any help you guys can provide. I've found OpenNLP very useful overall, but this part is really confusing me.

-Zach

Re: Document Categorizer Custom Feature Generators

Posted by James Kosin <ja...@gmail.com>.

On 2/5/2013 10:04 AM, Zach Zeman wrote:
> It turns out that the BagOfWords feature generator is insufficient for the problem I've been trying to solve using the DocumentCategorizerME. What I need is something that performs like the TokenClassFeatureGenerator, but it does not appear that AdaptiveFeatureGenerator's are usable with the categorizer. I'm not entirely sure about that last point, but I'm perfectly willing to implement my own version of a token class feature generator if it is.
>
> However, when I was looking at how to implement a FeatureGenerator, I noticed that the text that enters the extractFeatures method has already been broken up by whitespace. So, is the featureGenerator the correct place to change how my incoming training text is being broken up into features? Or is there another process that I've missed which is more appropriate?
>
> Thanks for any help you guys can provide. I've found OpenNLP very useful overall, but this part is really confusing me.
>
> -Zach
>
Zach,

Usually the text is first processed using the sentence detector and then 
passed to the tokenizer before processing further.  Other methods would 
need retraining using your own training sets and formats.

James