You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Lahiru Sandakith Gallege <sa...@gmail.com> on 2014/08/01 16:32:38 UTC
Improving OpenNLP doccat model accuracy and performance
Hi,
I have a model trained using OpenNLP doccat programmatically and I am
thinking in which ways I should approach improving my model performance? I
have around 70 labels and 12000 entries in my both training and test
dataset. In my experiments, I am using 90% to 10% training to test data
randomly. Currently my model accuracy is around 60% - 70%.
Here are the questions that I have.
* Will dropping stop words could improve the model accuracy. I did that and
seems it could but did not see a significant improvement. ?
* Does the trained model get skewed if irregular inclusion of spaces or
tabs are present in the training or test data? E.g., "label" "This car
is made around 2007"
* Does the spaces between label and data should be constant? (Hope the
doccat engine trim() them)? But wanted to make sure?
* Is there a way to configure not to dump the console output from the model?
If possible, Please let me know.
Thanks In Advance.
Lahiru
--
Regards
Lahiru Sandakith Gallege