You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Christian Moen <ch...@gmail.com> on 2013/01/25 10:36:30 UTC

OpenNLP models - scores, corpora and licenses

Hello,

I'm exploring the possibility of using OpenNLP in commercial software.  As part of this, I'd like to assess the quality of some of the models available on http://opennlp.sourceforge.net/models-1.5/ and also learn more about the applicable license terms.

My primary interest for now are the English models for Tokenizer, Sentence Detector and POS Tagger.

The documentation on http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html provides scores for various models as part of evaluation run examples.  Do these scores generally reflect those of the models on the SourceForge download page?  Are further details on model quality, source corpora, features used, etc. available?

I've seen posts to this list explain that "the models are subject to the licensing restrictions of the copyright holders of the corpus used to train them." as a general comment.  I understand that the models on SourceForge aren't part of any Apache OpenNLP release, but I'd very much appreciate if someone in the know could provide further insights into licensing terms applicable.  I'd be glad to be wrong about this, but my understanding is that the models can't be used commercially.

Many thanks for any insight.


Christian


Re: OpenNLP models - scores, corpora and licenses

Posted by Vuong Dao Nghe <vu...@yahoo.com>.
Interesting question! I also want to hear the answers related to this topic. Anybody have any idea ?

Thanks!
Thomas

________________________________
 From: Christian Moen <ch...@gmail.com>
To: users@opennlp.apache.org 
Sent: Friday, January 25, 2013 5:36 PM
Subject: OpenNLP models - scores, corpora and licenses
 
Hello,

I'm exploring the possibility of using OpenNLP in commercial software.  As part of this, I'd like to assess the quality of some of the models available on http://opennlp.sourceforge.net/models-1.5/ and also learn more about the applicable license terms.

My primary interest for now are the English models for Tokenizer, Sentence Detector and POS Tagger.

The documentation on http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html provides scores for various models as part of evaluation run examples.  Do these scores generally reflect those of the models on the SourceForge download page?  Are further details on model quality, source corpora, features used, etc. available?

I've seen posts to this list explain that "the models are subject to the licensing restrictions of the copyright holders of the corpus used to train them." as a general comment.  I understand that the models on SourceForge aren't part of any Apache OpenNLP release, but I'd very much appreciate if someone in the know could provide further insights into licensing terms applicable.  I'd be glad to be wrong about this, but my understanding is that the models can't be used commercially.

Many thanks for any insight.


Christian

Re: OpenNLP models - scores, corpora and licenses

Posted by Nicolas Hernandez <ni...@gmail.com>.
Here you may find a reported experience where the author used DBPedia
and wikipedia

[1] http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html

On Tue, Jan 29, 2013 at 4:43 PM, Christian Moen <cm...@atilika.com> wrote:
> Hello,
>
> We've done some experiments trying to synthesise a NER corpus from Wikipedia using various heuristics and link-structure analyses.  However, our models didn't turn out very good when scored against a gold standard tagged by humans.  I'm sure there are many improvements we could consider, but we didn't find pursuing this any further all that promising.  Basically, there were too many issues to consider to make the corpus of good quality.  I believe academic research in the field had similar challenges.  This was a quite fun little study, though.
>
>
> Christian Moen
> アティリカ株式会社
> http://www.atilika.com
>
> On Jan 28, 2013, at 5:20 PM, Svetoslav Marinov <sv...@findwise.com> wrote:
>
>> Wikipedia is not a good source for training. I've tried that but not all
>> entities in a text a tagged. Sometimes just the first occurrence of an
>> entity is tagged and the rest are not, or partially. To me the tagging
>> seemed so random that it does not pass eny criteria for a good corpus. And
>> then comes the question of how to distinguish people from places from
>> events or any other entities.
>>
>> For me, in order to use Wikipedia, one will need to do a lot of extra
>> processing before some decent quality is achieved.
>>
>> Svetoslav
>>
>>
>>
>> On 2013-01-28 05:31, "Lance Norskog" <go...@gmail.com> wrote:
>>
>>> Yes. The wikipedia XML has person/place/etc. tags in all of the article
>>> text.
>>>
>>> On 01/27/2013 08:15 PM, John Stewart wrote:
>>>> Lance, could you say more?  Do you mean WP tagging as training data for
>>>> the
>>>> NER task?
>>>>
>>>> Thanks,
>>>>
>>>> jds
>>>>
>>>>
>>>> On Sun, Jan 27, 2013 at 11:07 PM, Lance Norskog <go...@gmail.com>
>>>> wrote:
>>>>
>>>>> The Wikipedia tagging should provide very good training sets. Has
>>>>> anybody
>>>>> tried using them?
>>>>>
>>>>>
>>>>> On 01/25/2013 02:14 AM, Jörn Kottmann wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> well, the main problem with the models on SourceForge is that they
>>>>>> were
>>>>>> trained on news data
>>>>>> from the 90s and do not perform very well on todays news articles or
>>>>>> out
>>>>>> of domain data (anything else).
>>>>>>
>>>>>> When I speak here and there to our users I always get the impression
>>>>>> that
>>>>>> most people are still happy
>>>>>> with the performance of the Tokenizer, Sentence Splitter and POS
>>>>>> Tagger,
>>>>>> many are disappointed about the
>>>>>> Name Finder models, anyway the name finder works well if trained on
>>>>>> your
>>>>>> own data.
>>>>>>
>>>>>> Maybe the OntoNotes Corpus is something worth looking into.
>>>>>>
>>>>>> The licensing is a gray area, you can probably get away using the
>>>>>> models
>>>>>> in commercial software. The corpus
>>>>>> producers often restrict the usage of their corpus for research
>>>>>> purposes
>>>>>> only. The question is if they can enforce
>>>>>> these restrictive terms also on statistical models build on the data,
>>>>>> since the model probably don't violate the
>>>>>> copyright. Sorry for not having a better answer, you probably need to
>>>>>> ask
>>>>>> a lawyer.
>>>>>>
>>>>>> The evaluations in the documentation are often just samples to
>>>>>> illustrate
>>>>>> how to use the tools.
>>>>>> Have a look at at the test plans in our wiki, we record the
>>>>>> performance
>>>>>> of OpenNLP there for every release we make.
>>>>>>
>>>>>> The models are mostly trained with default feature generation, have a
>>>>>> look at the documentation and our code
>>>>>> to get more details about it. The feature are not yet documented well,
>>>>>> but a documentation patch to fix this
>>>>>> would be very welcome!
>>>>>>
>>>>>> HTH,
>>>>>> Jörn
>>>>>>
>>>>>> On 01/25/2013 10:36 AM, Christian Moen wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I'm exploring the possibility of using OpenNLP in commercial
>>>>>>> software.
>>>>>>>  As part of this, I'd like to assess the quality of some of the
>>>>>>> models
>>>>>>> available on
>>>>>>> http://opennlp.sourceforge.**net/models-1.5/<http://opennlp.sourceforg
>>>>>>> e.net/models-1.5/>and also learn more about the applicable license
>>>>>>> terms.
>>>>>>>
>>>>>>> My primary interest for now are the English models for Tokenizer,
>>>>>>> Sentence Detector and POS Tagger.
>>>>>>>
>>>>>>> The documentation on
>>>>>>> http://opennlp.apache.org/**documentation/1.5.2-**
>>>>>>>
>>>>>>> incubating/manual/opennlp.html<http://opennlp.apache.org/documentation
>>>>>>> /1.5.2-incubating/manual/opennlp.html>provides scores for various
>>>>>>> models as part of evaluation run examples.  Do
>>>>>>> these scores generally reflect those of the models on the SourceForge
>>>>>>> download page?  Are further details on model quality, source corpora,
>>>>>>> features used, etc. available?
>>>>>>>
>>>>>>> I've seen posts to this list explain that "the models are subject to
>>>>>>> the
>>>>>>> licensing restrictions of the copyright holders of the corpus used
>>>>>>> to train
>>>>>>> them." as a general comment.  I understand that the models on
>>>>>>> SourceForge
>>>>>>> aren't part of any Apache OpenNLP release, but I'd very much
>>>>>>> appreciate if
>>>>>>> someone in the know could provide further insights into licensing
>>>>>>> terms
>>>>>>> applicable.  I'd be glad to be wrong about this, but my
>>>>>>> understanding is
>>>>>>> that the models can't be used commercially.
>>>>>>>
>>>>>>> Many thanks for any insight.
>>>>>>>
>>>>>>>
>>>>>>> Christian
>>>>>>>
>>>>>>>
>>>>>>>
>>>
>>>
>>
>>
>



-- 
Dr. Nicolas Hernandez
Associate Professor (Maître de Conférences)
Université de Nantes - LINA CNRS UMR 6241
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
+33 (0)2 51 12 53 94
+33 (0)2 40 30 60 67

Re: OpenNLP models - scores, corpora and licenses

Posted by Christian Moen <cm...@atilika.com>.
Hello,

We've done some experiments trying to synthesise a NER corpus from Wikipedia using various heuristics and link-structure analyses.  However, our models didn't turn out very good when scored against a gold standard tagged by humans.  I'm sure there are many improvements we could consider, but we didn't find pursuing this any further all that promising.  Basically, there were too many issues to consider to make the corpus of good quality.  I believe academic research in the field had similar challenges.  This was a quite fun little study, though.


Christian Moen
アティリカ株式会社
http://www.atilika.com

On Jan 28, 2013, at 5:20 PM, Svetoslav Marinov <sv...@findwise.com> wrote:

> Wikipedia is not a good source for training. I've tried that but not all
> entities in a text a tagged. Sometimes just the first occurrence of an
> entity is tagged and the rest are not, or partially. To me the tagging
> seemed so random that it does not pass eny criteria for a good corpus. And
> then comes the question of how to distinguish people from places from
> events or any other entities.
> 
> For me, in order to use Wikipedia, one will need to do a lot of extra
> processing before some decent quality is achieved.
> 
> Svetoslav
> 
> 
> 
> On 2013-01-28 05:31, "Lance Norskog" <go...@gmail.com> wrote:
> 
>> Yes. The wikipedia XML has person/place/etc. tags in all of the article
>> text.
>> 
>> On 01/27/2013 08:15 PM, John Stewart wrote:
>>> Lance, could you say more?  Do you mean WP tagging as training data for
>>> the
>>> NER task?
>>> 
>>> Thanks,
>>> 
>>> jds
>>> 
>>> 
>>> On Sun, Jan 27, 2013 at 11:07 PM, Lance Norskog <go...@gmail.com>
>>> wrote:
>>> 
>>>> The Wikipedia tagging should provide very good training sets. Has
>>>> anybody
>>>> tried using them?
>>>> 
>>>> 
>>>> On 01/25/2013 02:14 AM, Jörn Kottmann wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> well, the main problem with the models on SourceForge is that they
>>>>> were
>>>>> trained on news data
>>>>> from the 90s and do not perform very well on todays news articles or
>>>>> out
>>>>> of domain data (anything else).
>>>>> 
>>>>> When I speak here and there to our users I always get the impression
>>>>> that
>>>>> most people are still happy
>>>>> with the performance of the Tokenizer, Sentence Splitter and POS
>>>>> Tagger,
>>>>> many are disappointed about the
>>>>> Name Finder models, anyway the name finder works well if trained on
>>>>> your
>>>>> own data.
>>>>> 
>>>>> Maybe the OntoNotes Corpus is something worth looking into.
>>>>> 
>>>>> The licensing is a gray area, you can probably get away using the
>>>>> models
>>>>> in commercial software. The corpus
>>>>> producers often restrict the usage of their corpus for research
>>>>> purposes
>>>>> only. The question is if they can enforce
>>>>> these restrictive terms also on statistical models build on the data,
>>>>> since the model probably don't violate the
>>>>> copyright. Sorry for not having a better answer, you probably need to
>>>>> ask
>>>>> a lawyer.
>>>>> 
>>>>> The evaluations in the documentation are often just samples to
>>>>> illustrate
>>>>> how to use the tools.
>>>>> Have a look at at the test plans in our wiki, we record the
>>>>> performance
>>>>> of OpenNLP there for every release we make.
>>>>> 
>>>>> The models are mostly trained with default feature generation, have a
>>>>> look at the documentation and our code
>>>>> to get more details about it. The feature are not yet documented well,
>>>>> but a documentation patch to fix this
>>>>> would be very welcome!
>>>>> 
>>>>> HTH,
>>>>> Jörn
>>>>> 
>>>>> On 01/25/2013 10:36 AM, Christian Moen wrote:
>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I'm exploring the possibility of using OpenNLP in commercial
>>>>>> software.
>>>>>>  As part of this, I'd like to assess the quality of some of the
>>>>>> models
>>>>>> available on 
>>>>>> http://opennlp.sourceforge.**net/models-1.5/<http://opennlp.sourceforg
>>>>>> e.net/models-1.5/>and also learn more about the applicable license
>>>>>> terms.
>>>>>> 
>>>>>> My primary interest for now are the English models for Tokenizer,
>>>>>> Sentence Detector and POS Tagger.
>>>>>> 
>>>>>> The documentation on
>>>>>> http://opennlp.apache.org/**documentation/1.5.2-**
>>>>>> 
>>>>>> incubating/manual/opennlp.html<http://opennlp.apache.org/documentation
>>>>>> /1.5.2-incubating/manual/opennlp.html>provides scores for various
>>>>>> models as part of evaluation run examples.  Do
>>>>>> these scores generally reflect those of the models on the SourceForge
>>>>>> download page?  Are further details on model quality, source corpora,
>>>>>> features used, etc. available?
>>>>>> 
>>>>>> I've seen posts to this list explain that "the models are subject to
>>>>>> the
>>>>>> licensing restrictions of the copyright holders of the corpus used
>>>>>> to train
>>>>>> them." as a general comment.  I understand that the models on
>>>>>> SourceForge
>>>>>> aren't part of any Apache OpenNLP release, but I'd very much
>>>>>> appreciate if
>>>>>> someone in the know could provide further insights into licensing
>>>>>> terms
>>>>>> applicable.  I'd be glad to be wrong about this, but my
>>>>>> understanding is
>>>>>> that the models can't be used commercially.
>>>>>> 
>>>>>> Many thanks for any insight.
>>>>>> 
>>>>>> 
>>>>>> Christian
>>>>>> 
>>>>>> 
>>>>>> 
>> 
>> 
> 
> 


Re: OpenNLP models - scores, corpora and licenses

Posted by Svetoslav Marinov <sv...@findwise.com>.
Wikipedia is not a good source for training. I've tried that but not all
entities in a text a tagged. Sometimes just the first occurrence of an
entity is tagged and the rest are not, or partially. To me the tagging
seemed so random that it does not pass eny criteria for a good corpus. And
then comes the question of how to distinguish people from places from
events or any other entities.

For me, in order to use Wikipedia, one will need to do a lot of extra
processing before some decent quality is achieved.

Svetoslav



On 2013-01-28 05:31, "Lance Norskog" <go...@gmail.com> wrote:

>Yes. The wikipedia XML has person/place/etc. tags in all of the article
>text.
>
>On 01/27/2013 08:15 PM, John Stewart wrote:
>> Lance, could you say more?  Do you mean WP tagging as training data for
>>the
>> NER task?
>>
>> Thanks,
>>
>> jds
>>
>>
>> On Sun, Jan 27, 2013 at 11:07 PM, Lance Norskog <go...@gmail.com>
>>wrote:
>>
>>> The Wikipedia tagging should provide very good training sets. Has
>>>anybody
>>> tried using them?
>>>
>>>
>>> On 01/25/2013 02:14 AM, Jörn Kottmann wrote:
>>>
>>>> Hello,
>>>>
>>>> well, the main problem with the models on SourceForge is that they
>>>>were
>>>> trained on news data
>>>> from the 90s and do not perform very well on todays news articles or
>>>>out
>>>> of domain data (anything else).
>>>>
>>>> When I speak here and there to our users I always get the impression
>>>>that
>>>> most people are still happy
>>>> with the performance of the Tokenizer, Sentence Splitter and POS
>>>>Tagger,
>>>> many are disappointed about the
>>>> Name Finder models, anyway the name finder works well if trained on
>>>>your
>>>> own data.
>>>>
>>>> Maybe the OntoNotes Corpus is something worth looking into.
>>>>
>>>> The licensing is a gray area, you can probably get away using the
>>>>models
>>>> in commercial software. The corpus
>>>> producers often restrict the usage of their corpus for research
>>>>purposes
>>>> only. The question is if they can enforce
>>>> these restrictive terms also on statistical models build on the data,
>>>> since the model probably don't violate the
>>>> copyright. Sorry for not having a better answer, you probably need to
>>>>ask
>>>> a lawyer.
>>>>
>>>> The evaluations in the documentation are often just samples to
>>>>illustrate
>>>> how to use the tools.
>>>> Have a look at at the test plans in our wiki, we record the
>>>>performance
>>>> of OpenNLP there for every release we make.
>>>>
>>>> The models are mostly trained with default feature generation, have a
>>>> look at the documentation and our code
>>>> to get more details about it. The feature are not yet documented well,
>>>> but a documentation patch to fix this
>>>> would be very welcome!
>>>>
>>>> HTH,
>>>> Jörn
>>>>
>>>> On 01/25/2013 10:36 AM, Christian Moen wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm exploring the possibility of using OpenNLP in commercial
>>>>>software.
>>>>>   As part of this, I'd like to assess the quality of some of the
>>>>>models
>>>>> available on 
>>>>>http://opennlp.sourceforge.**net/models-1.5/<http://opennlp.sourceforg
>>>>>e.net/models-1.5/>and also learn more about the applicable license
>>>>>terms.
>>>>>
>>>>> My primary interest for now are the English models for Tokenizer,
>>>>> Sentence Detector and POS Tagger.
>>>>>
>>>>> The documentation on
>>>>>http://opennlp.apache.org/**documentation/1.5.2-**
>>>>> 
>>>>>incubating/manual/opennlp.html<http://opennlp.apache.org/documentation
>>>>>/1.5.2-incubating/manual/opennlp.html>provides scores for various
>>>>>models as part of evaluation run examples.  Do
>>>>> these scores generally reflect those of the models on the SourceForge
>>>>> download page?  Are further details on model quality, source corpora,
>>>>> features used, etc. available?
>>>>>
>>>>> I've seen posts to this list explain that "the models are subject to
>>>>>the
>>>>> licensing restrictions of the copyright holders of the corpus used
>>>>>to train
>>>>> them." as a general comment.  I understand that the models on
>>>>>SourceForge
>>>>> aren't part of any Apache OpenNLP release, but I'd very much
>>>>>appreciate if
>>>>> someone in the know could provide further insights into licensing
>>>>>terms
>>>>> applicable.  I'd be glad to be wrong about this, but my
>>>>>understanding is
>>>>> that the models can't be used commercially.
>>>>>
>>>>> Many thanks for any insight.
>>>>>
>>>>>
>>>>> Christian
>>>>>
>>>>>
>>>>>
>
>



Re: OpenNLP models - scores, corpora and licenses

Posted by Lance Norskog <go...@gmail.com>.
Yes. The wikipedia XML has person/place/etc. tags in all of the article 
text.

On 01/27/2013 08:15 PM, John Stewart wrote:
> Lance, could you say more?  Do you mean WP tagging as training data for the
> NER task?
>
> Thanks,
>
> jds
>
>
> On Sun, Jan 27, 2013 at 11:07 PM, Lance Norskog <go...@gmail.com> wrote:
>
>> The Wikipedia tagging should provide very good training sets. Has anybody
>> tried using them?
>>
>>
>> On 01/25/2013 02:14 AM, Jörn Kottmann wrote:
>>
>>> Hello,
>>>
>>> well, the main problem with the models on SourceForge is that they were
>>> trained on news data
>>> from the 90s and do not perform very well on todays news articles or out
>>> of domain data (anything else).
>>>
>>> When I speak here and there to our users I always get the impression that
>>> most people are still happy
>>> with the performance of the Tokenizer, Sentence Splitter and POS Tagger,
>>> many are disappointed about the
>>> Name Finder models, anyway the name finder works well if trained on your
>>> own data.
>>>
>>> Maybe the OntoNotes Corpus is something worth looking into.
>>>
>>> The licensing is a gray area, you can probably get away using the models
>>> in commercial software. The corpus
>>> producers often restrict the usage of their corpus for research purposes
>>> only. The question is if they can enforce
>>> these restrictive terms also on statistical models build on the data,
>>> since the model probably don't violate the
>>> copyright. Sorry for not having a better answer, you probably need to ask
>>> a lawyer.
>>>
>>> The evaluations in the documentation are often just samples to illustrate
>>> how to use the tools.
>>> Have a look at at the test plans in our wiki, we record the performance
>>> of OpenNLP there for every release we make.
>>>
>>> The models are mostly trained with default feature generation, have a
>>> look at the documentation and our code
>>> to get more details about it. The feature are not yet documented well,
>>> but a documentation patch to fix this
>>> would be very welcome!
>>>
>>> HTH,
>>> Jörn
>>>
>>> On 01/25/2013 10:36 AM, Christian Moen wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm exploring the possibility of using OpenNLP in commercial software.
>>>>   As part of this, I'd like to assess the quality of some of the models
>>>> available on http://opennlp.sourceforge.**net/models-1.5/<http://opennlp.sourceforge.net/models-1.5/>and also learn more about the applicable license terms.
>>>>
>>>> My primary interest for now are the English models for Tokenizer,
>>>> Sentence Detector and POS Tagger.
>>>>
>>>> The documentation on http://opennlp.apache.org/**documentation/1.5.2-**
>>>> incubating/manual/opennlp.html<http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html>provides scores for various models as part of evaluation run examples.  Do
>>>> these scores generally reflect those of the models on the SourceForge
>>>> download page?  Are further details on model quality, source corpora,
>>>> features used, etc. available?
>>>>
>>>> I've seen posts to this list explain that "the models are subject to the
>>>> licensing restrictions of the copyright holders of the corpus used to train
>>>> them." as a general comment.  I understand that the models on SourceForge
>>>> aren't part of any Apache OpenNLP release, but I'd very much appreciate if
>>>> someone in the know could provide further insights into licensing terms
>>>> applicable.  I'd be glad to be wrong about this, but my understanding is
>>>> that the models can't be used commercially.
>>>>
>>>> Many thanks for any insight.
>>>>
>>>>
>>>> Christian
>>>>
>>>>
>>>>


Re: OpenNLP models - scores, corpora and licenses

Posted by John Stewart <ca...@gmail.com>.
Lance, could you say more?  Do you mean WP tagging as training data for the
NER task?

Thanks,

jds


On Sun, Jan 27, 2013 at 11:07 PM, Lance Norskog <go...@gmail.com> wrote:

> The Wikipedia tagging should provide very good training sets. Has anybody
> tried using them?
>
>
> On 01/25/2013 02:14 AM, Jörn Kottmann wrote:
>
>> Hello,
>>
>> well, the main problem with the models on SourceForge is that they were
>> trained on news data
>> from the 90s and do not perform very well on todays news articles or out
>> of domain data (anything else).
>>
>> When I speak here and there to our users I always get the impression that
>> most people are still happy
>> with the performance of the Tokenizer, Sentence Splitter and POS Tagger,
>> many are disappointed about the
>> Name Finder models, anyway the name finder works well if trained on your
>> own data.
>>
>> Maybe the OntoNotes Corpus is something worth looking into.
>>
>> The licensing is a gray area, you can probably get away using the models
>> in commercial software. The corpus
>> producers often restrict the usage of their corpus for research purposes
>> only. The question is if they can enforce
>> these restrictive terms also on statistical models build on the data,
>> since the model probably don't violate the
>> copyright. Sorry for not having a better answer, you probably need to ask
>> a lawyer.
>>
>> The evaluations in the documentation are often just samples to illustrate
>> how to use the tools.
>> Have a look at at the test plans in our wiki, we record the performance
>> of OpenNLP there for every release we make.
>>
>> The models are mostly trained with default feature generation, have a
>> look at the documentation and our code
>> to get more details about it. The feature are not yet documented well,
>> but a documentation patch to fix this
>> would be very welcome!
>>
>> HTH,
>> Jörn
>>
>> On 01/25/2013 10:36 AM, Christian Moen wrote:
>>
>>> Hello,
>>>
>>> I'm exploring the possibility of using OpenNLP in commercial software.
>>>  As part of this, I'd like to assess the quality of some of the models
>>> available on http://opennlp.sourceforge.**net/models-1.5/<http://opennlp.sourceforge.net/models-1.5/>and also learn more about the applicable license terms.
>>>
>>> My primary interest for now are the English models for Tokenizer,
>>> Sentence Detector and POS Tagger.
>>>
>>> The documentation on http://opennlp.apache.org/**documentation/1.5.2-**
>>> incubating/manual/opennlp.html<http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html>provides scores for various models as part of evaluation run examples.  Do
>>> these scores generally reflect those of the models on the SourceForge
>>> download page?  Are further details on model quality, source corpora,
>>> features used, etc. available?
>>>
>>> I've seen posts to this list explain that "the models are subject to the
>>> licensing restrictions of the copyright holders of the corpus used to train
>>> them." as a general comment.  I understand that the models on SourceForge
>>> aren't part of any Apache OpenNLP release, but I'd very much appreciate if
>>> someone in the know could provide further insights into licensing terms
>>> applicable.  I'd be glad to be wrong about this, but my understanding is
>>> that the models can't be used commercially.
>>>
>>> Many thanks for any insight.
>>>
>>>
>>> Christian
>>>
>>>
>>>
>>
>

Re: OpenNLP models - scores, corpora and licenses

Posted by Lance Norskog <go...@gmail.com>.
The Wikipedia tagging should provide very good training sets. Has 
anybody tried using them?

On 01/25/2013 02:14 AM, Jörn Kottmann wrote:
> Hello,
>
> well, the main problem with the models on SourceForge is that they 
> were trained on news data
> from the 90s and do not perform very well on todays news articles or 
> out of domain data (anything else).
>
> When I speak here and there to our users I always get the impression 
> that most people are still happy
> with the performance of the Tokenizer, Sentence Splitter and POS 
> Tagger, many are disappointed about the
> Name Finder models, anyway the name finder works well if trained on 
> your own data.
>
> Maybe the OntoNotes Corpus is something worth looking into.
>
> The licensing is a gray area, you can probably get away using the 
> models in commercial software. The corpus
> producers often restrict the usage of their corpus for research 
> purposes only. The question is if they can enforce
> these restrictive terms also on statistical models build on the data, 
> since the model probably don't violate the
> copyright. Sorry for not having a better answer, you probably need to 
> ask a lawyer.
>
> The evaluations in the documentation are often just samples to 
> illustrate how to use the tools.
> Have a look at at the test plans in our wiki, we record the 
> performance of OpenNLP there for every release we make.
>
> The models are mostly trained with default feature generation, have a 
> look at the documentation and our code
> to get more details about it. The feature are not yet documented well, 
> but a documentation patch to fix this
> would be very welcome!
>
> HTH,
> Jörn
>
> On 01/25/2013 10:36 AM, Christian Moen wrote:
>> Hello,
>>
>> I'm exploring the possibility of using OpenNLP in commercial 
>> software.  As part of this, I'd like to assess the quality of some of 
>> the models available on http://opennlp.sourceforge.net/models-1.5/ 
>> and also learn more about the applicable license terms.
>>
>> My primary interest for now are the English models for Tokenizer, 
>> Sentence Detector and POS Tagger.
>>
>> The documentation on 
>> http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html 
>> provides scores for various models as part of evaluation run 
>> examples.  Do these scores generally reflect those of the models on 
>> the SourceForge download page?  Are further details on model quality, 
>> source corpora, features used, etc. available?
>>
>> I've seen posts to this list explain that "the models are subject to 
>> the licensing restrictions of the copyright holders of the corpus 
>> used to train them." as a general comment.  I understand that the 
>> models on SourceForge aren't part of any Apache OpenNLP release, but 
>> I'd very much appreciate if someone in the know could provide further 
>> insights into licensing terms applicable.  I'd be glad to be wrong 
>> about this, but my understanding is that the models can't be used 
>> commercially.
>>
>> Many thanks for any insight.
>>
>>
>> Christian
>>
>>
>


Re: OpenNLP models - scores, corpora and licenses

Posted by Jörn Kottmann <ko...@gmail.com>.
Hello,

well, the main problem with the models on SourceForge is that they were 
trained on news data
from the 90s and do not perform very well on todays news articles or out 
of domain data (anything else).

When I speak here and there to our users I always get the impression 
that most people are still happy
with the performance of the Tokenizer, Sentence Splitter and POS Tagger, 
many are disappointed about the
Name Finder models, anyway the name finder works well if trained on your 
own data.

Maybe the OntoNotes Corpus is something worth looking into.

The licensing is a gray area, you can probably get away using the models 
in commercial software. The corpus
producers often restrict the usage of their corpus for research purposes 
only. The question is if they can enforce
these restrictive terms also on statistical models build on the data, 
since the model probably don't violate the
copyright. Sorry for not having a better answer, you probably need to 
ask a lawyer.

The evaluations in the documentation are often just samples to 
illustrate how to use the tools.
Have a look at at the test plans in our wiki, we record the performance 
of OpenNLP there for every release we make.

The models are mostly trained with default feature generation, have a 
look at the documentation and our code
to get more details about it. The feature are not yet documented well, 
but a documentation patch to fix this
would be very welcome!

HTH,
Jörn

On 01/25/2013 10:36 AM, Christian Moen wrote:
> Hello,
>
> I'm exploring the possibility of using OpenNLP in commercial software.  As part of this, I'd like to assess the quality of some of the models available on http://opennlp.sourceforge.net/models-1.5/ and also learn more about the applicable license terms.
>
> My primary interest for now are the English models for Tokenizer, Sentence Detector and POS Tagger.
>
> The documentation on http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html provides scores for various models as part of evaluation run examples.  Do these scores generally reflect those of the models on the SourceForge download page?  Are further details on model quality, source corpora, features used, etc. available?
>
> I've seen posts to this list explain that "the models are subject to the licensing restrictions of the copyright holders of the corpus used to train them." as a general comment.  I understand that the models on SourceForge aren't part of any Apache OpenNLP release, but I'd very much appreciate if someone in the know could provide further insights into licensing terms applicable.  I'd be glad to be wrong about this, but my understanding is that the models can't be used commercially.
>
> Many thanks for any insight.
>
>
> Christian
>
>