You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Joern Kottmann <ko...@gmail.com> on 2014/10/28 18:19:25 UTC

What should we do with the SF models?

Hi all,

OpenNLP always came with a couple of trained models which were ready to
use for a few languages. The performance a user encounters with those
models heavily depends on their input text.

Especially the English name finder models which were trained on MUC 6/7
data perform very poorly these days if run on current news articles and
even worse on data which is not in the news domain.

Anyway, we often get judged on how well OpenNLP works just based on the
performance of those models (or maybe people who compare their NLP
systems against OpenNLP just love to have OpenNLP perform badly).

I think we are now at a point with those models were it is questionable
if having them is still an advantage for OpenNLP. The SourceForge page
is often blocked due to traffic limitations. We definitely have to act
somehow.

The old models have definitely some historic value and are used for
testing the release.

What should we do?

We could take them offline and advice our users to train their own
models on one of the various corpora we support. We could also do both
and place a prominent link to our corpora documentation on the download
page and in a less visible place a link to he historic SF models.

Jörn

Re: What should we do with the SF models?

Posted by Tommaso Teofili <to...@gmail.com>.

In my opinion the long term goal should be to work on training new, Apache2
licensed, ones and make them available to our users; it probably make sense
to take the SF models offline in any case because as long as they are there
people will keep downloading and using them, as that's just much easier
than training new ones.
As a short term goal I agree we should give more visibility to instructions
on how to train new models using existing corpora.

My 2 cents,
Tommaso


2014-10-28 20:37 GMT+01:00 Gustavo Knuppe <gu...@gmail.com>:

> I believe that models are important for users, since not every user has
> access to appropriate data files to train basic models.
>
> My suggestion is to use an alternative service to host these models,
> like github, torrent or other file share service...
>
> Github is a good option since they don't have any quota or bandwidth
> limitation.
>
> Gustvo K.
>
> 2014-10-28 15:19 GMT-02:00 Joern Kottmann <ko...@gmail.com>:
>
> > Hi all,
> >
> > OpenNLP always came with a couple of trained models which were ready to
> > use for a few languages. The performance a user encounters with those
> > models heavily depends on their input text.
> >
> > Especially the English name finder models which were trained on MUC 6/7
> > data perform very poorly these days if run on current news articles and
> > even worse on data which is not in the news domain.
> >
> > Anyway, we often get judged on how well OpenNLP works just based on the
> > performance of those models (or maybe people who compare their NLP
> > systems against OpenNLP just love to have OpenNLP perform badly).
> >
> > I think we are now at a point with those models were it is questionable
> > if having them is still an advantage for OpenNLP. The SourceForge page
> > is often blocked due to traffic limitations. We definitely have to act
> > somehow.
> >
> > The old models have definitely some historic value and are used for
> > testing the release.
> >
> > What should we do?
> >
> > We could take them offline and advice our users to train their own
> > models on one of the various corpora we support. We could also do both
> > and place a prominent link to our corpora documentation on the download
> > page and in a less visible place a link to he historic SF models.
> >
> > Jörn
> >
> >
>

Re: What should we do with the SF models?

Posted by Rodrigo Agerri <ra...@apache.org>.

In my opinion the models should be documented. In some cases it is said the
training corpus used but in others it's not.  We should also said which
features were used and the results obtained on which dataset. If default
features are used we should also said so.

If we cannot provide such info we should also add a disclaimer about it.

Furthermore I can also provide some models trained with usual corpora for
pos, namefinder and parse.

Cheers

R

Re: What should we do with the SF models?

Posted by Gustavo Knuppe <gu...@gmail.com>.

I believe that models are important for users, since not every user has
access to appropriate data files to train basic models.

My suggestion is to use an alternative service to host these models,
like github, torrent or other file share service...

Github is a good option since they don't have any quota or bandwidth
limitation.

Gustvo K.

2014-10-28 15:19 GMT-02:00 Joern Kottmann <ko...@gmail.com>:

> Hi all,
>
> OpenNLP always came with a couple of trained models which were ready to
> use for a few languages. The performance a user encounters with those
> models heavily depends on their input text.
>
> Especially the English name finder models which were trained on MUC 6/7
> data perform very poorly these days if run on current news articles and
> even worse on data which is not in the news domain.
>
> Anyway, we often get judged on how well OpenNLP works just based on the
> performance of those models (or maybe people who compare their NLP
> systems against OpenNLP just love to have OpenNLP perform badly).
>
> I think we are now at a point with those models were it is questionable
> if having them is still an advantage for OpenNLP. The SourceForge page
> is often blocked due to traffic limitations. We definitely have to act
> somehow.
>
> The old models have definitely some historic value and are used for
> testing the release.
>
> What should we do?
>
> We could take them offline and advice our users to train their own
> models on one of the various corpora we support. We could also do both
> and place a prominent link to our corpora documentation on the download
> page and in a less visible place a link to he historic SF models.
>
> Jörn
>
>