You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Fraser Bowen <fr...@westernacher.com> on 2018/03/07 14:34:50 UTC

Using synthetic data for the Name Finder

Hello OpenNLP community,

We are using the OpenNLP Name Finder to train models on a domain specific German dataset. However, since upgrading from version 1.6.0 to 1.8.4, I have noticed that the Name Finder model is much better, but no longer robust.

Using the small amount of data we have, the new version improves upon the F-score on our test set.

However, in order to boost the small amount of training data that I have, I have generated some "synthetic" data. It's imaginable that this "unclean" data would confuse the model, but in 1.6.0, it would improve the F-score. This is no longer the case in 1.8.4: any manipulations to the data appear to confuse the model and cause it to find many false positives.

I'd like to understand a little better what has changed between these two versions, but the release notes aren't very descriptive. Has anybody else experienced any wild changes with the new version?

Many thanks in advance!
Fraser

Re: Using synthetic data for the Name Finder

Posted by Joern Kottmann <ko...@gmail.com>.
Hello,

this is probably the change of the default from maxent to perceptron,
on many data sets the perceptron outperforms the maxent and therefore
it was decided to make it the default for newly trained models.
Take a look at the lang/ml folder in the distribution, it has a params
file to train with maxent instead, and works with the -params cli
argument.

HTH,
Jörn

On Wed, Mar 7, 2018 at 3:34 PM, Fraser Bowen
<fr...@westernacher.com> wrote:
> Hello OpenNLP community,
>
> We are using the OpenNLP Name Finder to train models on a domain specific German dataset. However, since upgrading from version 1.6.0 to 1.8.4, I have noticed that the Name Finder model is much better, but no longer robust.
>
> Using the small amount of data we have, the new version improves upon the F-score on our test set.
>
> However, in order to boost the small amount of training data that I have, I have generated some "synthetic" data. It's imaginable that this "unclean" data would confuse the model, but in 1.6.0, it would improve the F-score. This is no longer the case in 1.8.4: any manipulations to the data appear to confuse the model and cause it to find many false positives.
>
> I'd like to understand a little better what has changed between these two versions, but the release notes aren't very descriptive. Has anybody else experienced any wild changes with the new version?
>
> Many thanks in advance!
> Fraser