You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Lahiru Sandakith Gallege <sa...@gmail.com> on 2014/08/01 16:32:38 UTC

Improving OpenNLP doccat model accuracy and performance

Hi,

I have a model trained using OpenNLP doccat programmatically and I am
thinking in which ways I should approach improving my model performance? I
have around 70 labels and 12000 entries in my both training and test
dataset. In my experiments, I am using 90% to 10% training to test data
randomly. Currently my model accuracy is around 60% - 70%.

Here are the questions that I have.

* Will dropping stop words could improve the model accuracy. I did that and
seems it could but did not see a significant improvement. ?
* Does the trained model get skewed if irregular inclusion of spaces or
tabs are present in the training or test data? E.g., "label" "This car
 is made around  2007"
* Does the spaces between label and data should be constant? (Hope the
doccat engine trim() them)? But wanted to make sure?
* Is there a way to configure not to dump the console output from the model?

If possible, Please let me know.
Thanks In Advance.
Lahiru

-- 
Regards
Lahiru Sandakith Gallege