You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Joern Kottmann <ko...@gmail.com> on 2017/02/04 18:05:53 UTC

Fwd: Training models for OpenNLP on the OntoNotes corpus

---------- Forwarded message ----------
From: "Joern Kottmann" <jo...@apache.org>
Date: Feb 3, 2017 11:51 AM
Subject: Training models for OpenNLP on the OntoNotes corpus
To: <le...@apache.org>
Cc:

Hello all,

the Apache OpenNLP library is a machine learning based toolkit for the
processing of natural language text.It supports the most common NLP tasks,
such as tokenization, sentence segmentation, part-of-speech tagging, named
entity extraction, chunking and parsing.

Many of the competing solutions offer pre-trained models on various data
sources to their users. We came to the conclusion that we have to do the
same to stay relevant.

These corpora we would like to train on usually are copyright protected or
have a license which restrict the use.

I would like to know what the opinion here on legal-discuss is to train
models based on the OntoNotes corpus [1]. Their license can be found here
[2].

The training process does the following with the corpus as input:

- Generates string based features (e.g. about word shape, n-grams, various
combinations, etc.), those features to not contain longer parts of the
corpus text

- Computes weights for those features based on the corpus

The features and weights are stored together in what we call a model and
this model we wish to distribute under AL 2.0 at Apache OpenNLP.

Would it be ok to do that? Are there any concerns?

Thanks,

Jörn


[1] https://catalog.ldc.upenn.edu/LDC2013T19

[2] https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf