You are viewing a plain text version of this content. The canonical link for it is here.

Posted to legal-discuss@apache.org by Joern Kottmann <jo...@apache.org> on 2017/02/03 10:51:22 UTC

Training models for OpenNLP on the OntoNotes corpus

Hello all,

the Apache OpenNLP library is a machine learning based toolkit for the
processing of natural language text.It supports the most common NLP tasks,
such as tokenization, sentence segmentation, part-of-speech tagging, named
entity extraction, chunking and parsing.

Many of the competing solutions offer pre-trained models on various data
sources to their users. We came to the conclusion that we have to do the
same to stay relevant.

These corpora we would like to train on usually are copyright protected or
have a license which restrict the use.

I would like to know what the opinion here on legal-discuss is to train
models based on the OntoNotes corpus [1]. Their license can be found here
[2].

The training process does the following with the corpus as input:

- Generates string based features (e.g. about word shape, n-grams, various
combinations, etc.), those features to not contain longer parts of the
corpus text

- Computes weights for those features based on the corpus

The features and weights are stored together in what we call a model and
this model we wish to distribute under AL 2.0 at Apache OpenNLP.

Would it be ok to do that? Are there any concerns?

Thanks,

Jörn


[1] https://catalog.ldc.upenn.edu/LDC2013T19

[2] https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Re: Training models for OpenNLP on the OntoNotes corpus

Posted by Richard Eckart de Castilho <re...@apache.org>.

Hi Jörn,

thanks for the result! It would also be great if you could let us know when you find a new suitable resource :)

One that might be suitable is the The Georgetown University Multilayer Corpus [1], at least the parts from Wikinews and Wikivoyage which are licensed CC-BY.

Cheers,

-- Richard

[1] https://corpling.uis.georgetown.edu/gum/#license

> On 17.02.2017, at 11:24, Joern Kottmann <ko...@gmail.com> wrote:
> 
> Hello all,
> 
> they replied to me and said the main issue is that their data (or models trained on it) cannot be licensed under any agreements other than their own. So this is the case for their research-only and commercial license. 
> 
> Therefore training on LDC data (even if a member with the commercial license would do it) and releasing the model under AL 2.0 (or any other Open Source license) is not allowed.
> On the other hand they seem to tolerate that Open Source projects are doing that, when you google for models trained on their data you can find many examples.
> 
> We will have to look for new sources of data to train our models on.
> 
> Thanks to everyone for helping with this issue.
> 
> Jörn


---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Training models for OpenNLP on the OntoNotes corpus

Posted by Joern Kottmann <ko...@gmail.com>.

Hello all,

they replied to me and said the main issue is that their data (or models
trained on it) cannot be licensed under any agreements other than their
own. So this is the case for their research-only and commercial license.

Therefore training on LDC data (even if a member with the commercial
license would do it) and releasing the model under AL 2.0 (or any other
Open Source license) is not allowed.
On the other hand they seem to tolerate that Open Source projects are doing
that, when you google for models trained on their data you can find many
examples.

We will have to look for new sources of data to train our models on.

Thanks to everyone for helping with this issue.

Jörn


On Fri, Feb 17, 2017 at 9:06 AM, Peter Kluegl <pk...@gmail.com> wrote:

> Hi Joern,
>
>
> can you share the answer if you get one? I'd really appreciate it :-)
>
>
> Best,
>
>
> Peter
>
> Am 09.02.2017 um 15:56 schrieb Joern Kottmann:
>
> Hello,
>
> right, I agree with you, let me ask them.
>
> Thanks,
> Jörn
>
> On Wed, Feb 8, 2017 at 7:07 AM, Henri Yandell <ba...@apache.org> wrote:
>
>> The license says:
>>
>>     "In the event that User's use of the LDC Databases results in the
>> development of a commercial product, User must join...pay fees...".
>>
>> While I don't think LDC have necessarily considered Apache's use of their
>> product, and the license text doesn't appear to be considering a situation
>> where the two User definitions are different individuals (ie: Apache the
>> first, our users the second); I don't think it's clear that LDC are in
>> favour of our using their product and you should contact them to get
>> clarification that we can use their product to develop an Apache 2.0
>> licensed product which may subsequently be used in our user's commercial
>> products.
>>
>> Hen
>>
>> On Tue, Feb 7, 2017 at 7:01 AM, Joern Kottmann <ko...@gmail.com>
>> wrote:
>>
>>> Thanks for your answer!
>>>
>>> We would not distribute the content itself in any way. The training
>>> process will reduce the input-copyright protected material into n-grams
>>> (which will have at most a length of 2). That work should not be
>>> copyright-protect able by the original copyright holder since we don't take
>>> anything out that is long enough to be able to be copyright protected.
>>>
>>> There was a case in the EU that might be relevant for this:
>>> https://en.wikipedia.org/wiki/Infopaq_International_A/S_v_Da
>>> nske_Dagblades_Forening
>>>
>>> Jörn
>>>
>>> On Mon, Feb 6, 2017 at 5:35 PM, Henri Yandell <ba...@apache.org> wrote:
>>>
>>>> I don't believe this acceptable.
>>>>
>>>> It's a non-commercial license that would restrict the uses of the
>>>> subsequent Apache product.
>>>>
>>>> Note that the license would also need signing (i.e. it's not something
>>>> we can use off the shelf).
>>>>
>>>> One approach would be to contact LDC to let them know our interest in
>>>> using, but make sure they understand that the output would be going into a
>>>> product under the Apache 2.0 license and that they understand our concern.
>>>>
>>>> Hen
>>>>
>>>> On Fri, Feb 3, 2017 at 2:51 AM, Joern Kottmann <jo...@apache.org>
>>>> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> the Apache OpenNLP library is a machine learning based toolkit for the
>>>>> processing of natural language text.It supports the most common NLP tasks,
>>>>> such as tokenization, sentence segmentation, part-of-speech tagging, named
>>>>> entity extraction, chunking and parsing.
>>>>>
>>>>> Many of the competing solutions offer pre-trained models on various
>>>>> data sources to their users. We came to the conclusion that we have to do
>>>>> the same to stay relevant.
>>>>>
>>>>> These corpora we would like to train on usually are copyright
>>>>> protected or have a license which restrict the use.
>>>>>
>>>>> I would like to know what the opinion here on legal-discuss is to
>>>>> train models based on the OntoNotes corpus [1]. Their license can be found
>>>>> here [2].
>>>>>
>>>>> The training process does the following with the corpus as input:
>>>>>
>>>>> - Generates string based features (e.g. about word shape, n-grams,
>>>>> various combinations, etc.), those features to not contain longer parts of
>>>>> the corpus text
>>>>>
>>>>> - Computes weights for those features based on the corpus
>>>>>
>>>>> The features and weights are stored together in what we call a model
>>>>> and this model we wish to distribute under AL 2.0 at Apache OpenNLP.
>>>>>
>>>>> Would it be ok to do that? Are there any concerns?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jörn
>>>>>
>>>>>
>>>>> [1] https://catalog.ldc.upenn.edu/LDC2013T19
>>>>>
>>>>> [2] https://catalog.ldc.upenn.edu/license/ldc-non-members-agreem
>>>>> ent.pdf
>>>>>
>>>>
>>>>
>>>
>>
>
>

Re: Training models for OpenNLP on the OntoNotes corpus

Posted by Peter Kluegl <pk...@gmail.com>.

Hi Joern,


can you share the answer if you get one? I'd really appreciate it :-)


Best,


Peter


Am 09.02.2017 um 15:56 schrieb Joern Kottmann:
> Hello,
>
> right, I agree with you, let me ask them.
>
> Thanks,
> J�rn
>
> On Wed, Feb 8, 2017 at 7:07 AM, Henri Yandell <bayard@apache.org
> <ma...@apache.org>> wrote:
>
>     The license says:
>
>         "In the event that User's use of the LDC Databases results in
>     the development of a commercial product, User must join...pay
>     fees...".
>
>     While I don't think LDC have necessarily considered Apache's use
>     of their product, and the license text doesn't appear to be
>     considering a situation where the two User definitions are
>     different individuals (ie: Apache the first, our users the
>     second); I don't think it's clear that LDC are in favour of our
>     using their product and you should contact them to get
>     clarification that we can use their product to develop an Apache
>     2.0 licensed product which may subsequently be used in our user's
>     commercial products.
>
>     Hen
>
>     On Tue, Feb 7, 2017 at 7:01 AM, Joern Kottmann <kottmann@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Thanks for your answer!
>
>         We would not distribute the content itself in any way. The
>         training process will reduce the input-copyright protected
>         material into n-grams (which will have at most a length of 2).
>         That work should not be copyright-protect able by the original
>         copyright holder since we don't take anything out that is long
>         enough to be able to be copyright protected.
>
>         There was a case in the EU that might be relevant for this:
>         https://en.wikipedia.org/wiki/Infopaq_International_A/S_v_Danske_Dagblades_Forening
>         <https://en.wikipedia.org/wiki/Infopaq_International_A/S_v_Danske_Dagblades_Forening>
>
>         J�rn
>
>         On Mon, Feb 6, 2017 at 5:35 PM, Henri Yandell
>         <bayard@apache.org <ma...@apache.org>> wrote:
>
>             I don't believe this acceptable.
>
>             It's a non-commercial license that would restrict the uses
>             of the subsequent Apache product.
>
>             Note that the license would also need signing (i.e. it's
>             not something we can use off the shelf).
>
>             One approach would be to contact LDC to let them know our
>             interest in using, but make sure they understand that the
>             output would be going into a product under the Apache 2.0
>             license and that they understand our concern.
>
>             Hen
>
>             On Fri, Feb 3, 2017 at 2:51 AM, Joern Kottmann
>             <joern@apache.org <ma...@apache.org>> wrote:
>
>                 Hello all,
>
>                 the Apache OpenNLP library is a machine learning based
>                 toolkit for the processing of natural language text.It
>                 supports the most common NLP tasks, such as
>                 tokenization, sentence segmentation, part-of-speech
>                 tagging, named entity extraction, chunking and parsing.
>
>                 Many of the competing solutions offer pre-trained
>                 models on various data sources to their users. We came
>                 to the conclusion that we have to do the same to stay
>                 relevant.
>
>                 These corpora we would like to train on usually are
>                 copyright protected or have a license which restrict
>                 the use.
>
>                 I would like to know what the opinion here on
>                 legal-discuss is to train models based on the
>                 OntoNotes corpus [1]. Their license can be found here [2].
>
>                 The training process does the following with the
>                 corpus as input:
>
>                 - Generates string based features (e.g. about word
>                 shape, n-grams, various combinations, etc.), those
>                 features to not contain longer parts of the corpus text
>
>                 - Computes weights for those features based on the corpus
>
>                 The features and weights are stored together in what
>                 we call a model and this model we wish to distribute
>                 under AL 2.0 at Apache OpenNLP.
>
>                 Would it be ok to do that? Are there any concerns?
>
>                 Thanks,
>
>                 J�rn
>
>
>                 [1] https://catalog.ldc.upenn.edu/LDC2013T19
>                 <https://catalog.ldc.upenn.edu/LDC2013T19>
>
>                 [2]
>                 https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
>                 <https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf>
>
>
>
>
>

Re: Training models for OpenNLP on the OntoNotes corpus

Posted by Joern Kottmann <ko...@gmail.com>.

Hello,

right, I agree with you, let me ask them.

Thanks,
Jörn

On Wed, Feb 8, 2017 at 7:07 AM, Henri Yandell <ba...@apache.org> wrote:

> The license says:
>
>     "In the event that User's use of the LDC Databases results in the
> development of a commercial product, User must join...pay fees...".
>
> While I don't think LDC have necessarily considered Apache's use of their
> product, and the license text doesn't appear to be considering a situation
> where the two User definitions are different individuals (ie: Apache the
> first, our users the second); I don't think it's clear that LDC are in
> favour of our using their product and you should contact them to get
> clarification that we can use their product to develop an Apache 2.0
> licensed product which may subsequently be used in our user's commercial
> products.
>
> Hen
>
> On Tue, Feb 7, 2017 at 7:01 AM, Joern Kottmann <ko...@gmail.com> wrote:
>
>> Thanks for your answer!
>>
>> We would not distribute the content itself in any way. The training
>> process will reduce the input-copyright protected material into n-grams
>> (which will have at most a length of 2). That work should not be
>> copyright-protect able by the original copyright holder since we don't take
>> anything out that is long enough to be able to be copyright protected.
>>
>> There was a case in the EU that might be relevant for this:
>> https://en.wikipedia.org/wiki/Infopaq_International_A/S_v_Da
>> nske_Dagblades_Forening
>>
>> Jörn
>>
>> On Mon, Feb 6, 2017 at 5:35 PM, Henri Yandell <ba...@apache.org> wrote:
>>
>>> I don't believe this acceptable.
>>>
>>> It's a non-commercial license that would restrict the uses of the
>>> subsequent Apache product.
>>>
>>> Note that the license would also need signing (i.e. it's not something
>>> we can use off the shelf).
>>>
>>> One approach would be to contact LDC to let them know our interest in
>>> using, but make sure they understand that the output would be going into a
>>> product under the Apache 2.0 license and that they understand our concern.
>>>
>>> Hen
>>>
>>> On Fri, Feb 3, 2017 at 2:51 AM, Joern Kottmann <jo...@apache.org> wrote:
>>>
>>>> Hello all,
>>>>
>>>> the Apache OpenNLP library is a machine learning based toolkit for the
>>>> processing of natural language text.It supports the most common NLP tasks,
>>>> such as tokenization, sentence segmentation, part-of-speech tagging, named
>>>> entity extraction, chunking and parsing.
>>>>
>>>> Many of the competing solutions offer pre-trained models on various
>>>> data sources to their users. We came to the conclusion that we have to do
>>>> the same to stay relevant.
>>>>
>>>> These corpora we would like to train on usually are copyright protected
>>>> or have a license which restrict the use.
>>>>
>>>> I would like to know what the opinion here on legal-discuss is to train
>>>> models based on the OntoNotes corpus [1]. Their license can be found here
>>>> [2].
>>>>
>>>> The training process does the following with the corpus as input:
>>>>
>>>> - Generates string based features (e.g. about word shape, n-grams,
>>>> various combinations, etc.), those features to not contain longer parts of
>>>> the corpus text
>>>>
>>>> - Computes weights for those features based on the corpus
>>>>
>>>> The features and weights are stored together in what we call a model
>>>> and this model we wish to distribute under AL 2.0 at Apache OpenNLP.
>>>>
>>>> Would it be ok to do that? Are there any concerns?
>>>>
>>>> Thanks,
>>>>
>>>> Jörn
>>>>
>>>>
>>>> [1] https://catalog.ldc.upenn.edu/LDC2013T19
>>>>
>>>> [2] https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
>>>>
>>>
>>>
>>
>

Re: Training models for OpenNLP on the OntoNotes corpus

Posted by Henri Yandell <ba...@apache.org>.

The license says:

    "In the event that User's use of the LDC Databases results in the
development of a commercial product, User must join...pay fees...".

While I don't think LDC have necessarily considered Apache's use of their
product, and the license text doesn't appear to be considering a situation
where the two User definitions are different individuals (ie: Apache the
first, our users the second); I don't think it's clear that LDC are in
favour of our using their product and you should contact them to get
clarification that we can use their product to develop an Apache 2.0
licensed product which may subsequently be used in our user's commercial
products.

Hen

On Tue, Feb 7, 2017 at 7:01 AM, Joern Kottmann <ko...@gmail.com> wrote:

> Thanks for your answer!
>
> We would not distribute the content itself in any way. The training
> process will reduce the input-copyright protected material into n-grams
> (which will have at most a length of 2). That work should not be
> copyright-protect able by the original copyright holder since we don't take
> anything out that is long enough to be able to be copyright protected.
>
> There was a case in the EU that might be relevant for this:
> https://en.wikipedia.org/wiki/Infopaq_International_A/S_v_
> Danske_Dagblades_Forening
>
> Jörn
>
> On Mon, Feb 6, 2017 at 5:35 PM, Henri Yandell <ba...@apache.org> wrote:
>
>> I don't believe this acceptable.
>>
>> It's a non-commercial license that would restrict the uses of the
>> subsequent Apache product.
>>
>> Note that the license would also need signing (i.e. it's not something we
>> can use off the shelf).
>>
>> One approach would be to contact LDC to let them know our interest in
>> using, but make sure they understand that the output would be going into a
>> product under the Apache 2.0 license and that they understand our concern.
>>
>> Hen
>>
>> On Fri, Feb 3, 2017 at 2:51 AM, Joern Kottmann <jo...@apache.org> wrote:
>>
>>> Hello all,
>>>
>>> the Apache OpenNLP library is a machine learning based toolkit for the
>>> processing of natural language text.It supports the most common NLP tasks,
>>> such as tokenization, sentence segmentation, part-of-speech tagging, named
>>> entity extraction, chunking and parsing.
>>>
>>> Many of the competing solutions offer pre-trained models on various data
>>> sources to their users. We came to the conclusion that we have to do the
>>> same to stay relevant.
>>>
>>> These corpora we would like to train on usually are copyright protected
>>> or have a license which restrict the use.
>>>
>>> I would like to know what the opinion here on legal-discuss is to train
>>> models based on the OntoNotes corpus [1]. Their license can be found here
>>> [2].
>>>
>>> The training process does the following with the corpus as input:
>>>
>>> - Generates string based features (e.g. about word shape, n-grams,
>>> various combinations, etc.), those features to not contain longer parts of
>>> the corpus text
>>>
>>> - Computes weights for those features based on the corpus
>>>
>>> The features and weights are stored together in what we call a model and
>>> this model we wish to distribute under AL 2.0 at Apache OpenNLP.
>>>
>>> Would it be ok to do that? Are there any concerns?
>>>
>>> Thanks,
>>>
>>> Jörn
>>>
>>>
>>> [1] https://catalog.ldc.upenn.edu/LDC2013T19
>>>
>>> [2] https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
>>>
>>
>>
>

Re: Training models for OpenNLP on the OntoNotes corpus

Posted by Joern Kottmann <ko...@gmail.com>.

Thanks for your answer!

We would not distribute the content itself in any way. The training process
will reduce the input-copyright protected material into n-grams (which will
have at most a length of 2). That work should not be copyright-protect able
by the original copyright holder since we don't take anything out that is
long enough to be able to be copyright protected.

There was a case in the EU that might be relevant for this:
https://en.wikipedia.org/wiki/Infopaq_International_A/S_v_Danske_Dagblades_Forening

Jörn

On Mon, Feb 6, 2017 at 5:35 PM, Henri Yandell <ba...@apache.org> wrote:

> I don't believe this acceptable.
>
> It's a non-commercial license that would restrict the uses of the
> subsequent Apache product.
>
> Note that the license would also need signing (i.e. it's not something we
> can use off the shelf).
>
> One approach would be to contact LDC to let them know our interest in
> using, but make sure they understand that the output would be going into a
> product under the Apache 2.0 license and that they understand our concern.
>
> Hen
>
> On Fri, Feb 3, 2017 at 2:51 AM, Joern Kottmann <jo...@apache.org> wrote:
>
>> Hello all,
>>
>> the Apache OpenNLP library is a machine learning based toolkit for the
>> processing of natural language text.It supports the most common NLP tasks,
>> such as tokenization, sentence segmentation, part-of-speech tagging, named
>> entity extraction, chunking and parsing.
>>
>> Many of the competing solutions offer pre-trained models on various data
>> sources to their users. We came to the conclusion that we have to do the
>> same to stay relevant.
>>
>> These corpora we would like to train on usually are copyright protected
>> or have a license which restrict the use.
>>
>> I would like to know what the opinion here on legal-discuss is to train
>> models based on the OntoNotes corpus [1]. Their license can be found here
>> [2].
>>
>> The training process does the following with the corpus as input:
>>
>> - Generates string based features (e.g. about word shape, n-grams,
>> various combinations, etc.), those features to not contain longer parts of
>> the corpus text
>>
>> - Computes weights for those features based on the corpus
>>
>> The features and weights are stored together in what we call a model and
>> this model we wish to distribute under AL 2.0 at Apache OpenNLP.
>>
>> Would it be ok to do that? Are there any concerns?
>>
>> Thanks,
>>
>> Jörn
>>
>>
>> [1] https://catalog.ldc.upenn.edu/LDC2013T19
>>
>> [2] https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
>>
>
>

Re: Training models for OpenNLP on the OntoNotes corpus

Posted by Henri Yandell <ba...@apache.org>.

I don't believe this acceptable.

It's a non-commercial license that would restrict the uses of the
subsequent Apache product.

Note that the license would also need signing (i.e. it's not something we
can use off the shelf).

One approach would be to contact LDC to let them know our interest in
using, but make sure they understand that the output would be going into a
product under the Apache 2.0 license and that they understand our concern.

Hen

On Fri, Feb 3, 2017 at 2:51 AM, Joern Kottmann <jo...@apache.org> wrote:

> Hello all,
>
> the Apache OpenNLP library is a machine learning based toolkit for the
> processing of natural language text.It supports the most common NLP tasks,
> such as tokenization, sentence segmentation, part-of-speech tagging, named
> entity extraction, chunking and parsing.
>
> Many of the competing solutions offer pre-trained models on various data
> sources to their users. We came to the conclusion that we have to do the
> same to stay relevant.
>
> These corpora we would like to train on usually are copyright protected or
> have a license which restrict the use.
>
> I would like to know what the opinion here on legal-discuss is to train
> models based on the OntoNotes corpus [1]. Their license can be found here
> [2].
>
> The training process does the following with the corpus as input:
>
> - Generates string based features (e.g. about word shape, n-grams, various
> combinations, etc.), those features to not contain longer parts of the
> corpus text
>
> - Computes weights for those features based on the corpus
>
> The features and weights are stored together in what we call a model and
> this model we wish to distribute under AL 2.0 at Apache OpenNLP.
>
> Would it be ok to do that? Are there any concerns?
>
> Thanks,
>
> Jörn
>
>
> [1] https://catalog.ldc.upenn.edu/LDC2013T19
>
> [2] https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
>

Fwd: Training models for OpenNLP on the OntoNotes corpus

Posted by Joern Kottmann <ko...@gmail.com>.

---------- Forwarded message ----------
From: "Joern Kottmann" <jo...@apache.org>
Date: Feb 3, 2017 11:51 AM
Subject: Training models for OpenNLP on the OntoNotes corpus
To: <le...@apache.org>
Cc:

Hello all,

the Apache OpenNLP library is a machine learning based toolkit for the
processing of natural language text.It supports the most common NLP tasks,
such as tokenization, sentence segmentation, part-of-speech tagging, named
entity extraction, chunking and parsing.

Many of the competing solutions offer pre-trained models on various data
sources to their users. We came to the conclusion that we have to do the
same to stay relevant.

These corpora we would like to train on usually are copyright protected or
have a license which restrict the use.

I would like to know what the opinion here on legal-discuss is to train
models based on the OntoNotes corpus [1]. Their license can be found here
[2].

The training process does the following with the corpus as input:

- Generates string based features (e.g. about word shape, n-grams, various
combinations, etc.), those features to not contain longer parts of the
corpus text

- Computes weights for those features based on the corpus

The features and weights are stored together in what we call a model and
this model we wish to distribute under AL 2.0 at Apache OpenNLP.

Would it be ok to do that? Are there any concerns?

Thanks,

Jörn

[1] https://catalog.ldc.upenn.edu/LDC2013T19

[2] https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf