You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Joern Kottmann <ko...@gmail.com> on 2017/07/10 17:52:04 UTC

Releasing a Language Detection Model

Hello all,

since Apache OpenNLP 1.8.1 we have a new language detection component
which like all our components has to be trained. I think we should
release a pre-build model for it trained on the Leipzig corpus. This
will allow the majority of our users to get started very quickly with
language detection without the need to figure out on how to train it.

How should this project release models?

Jörn

Re: Releasing a Language Detection Model

Posted by William Colen <wi...@gmail.com>.

Regarding lang detect, we will release one model with +100 languages.
Anyone will be able to reproduce the training or improve according to their
needs. For example, one can reduce the corpus to work only with Latin
languages if that is their need and maybe it can work better in some
applications.

Today we require a model to be used at least by the OpenNLP version that
built it. For example, if a model was created by OpenNLP 1.7.1 we can run
it OpenNLP 1.8.0 but not with 1.6.0. We can keep it that way. I don't see a
reason to update the models every release, but it can help testing (F1,
accuracy etc can't change between between releases).

Not clear to me how the default models would work as well. The idea is not
bad but to make it work properly is hard. I don't think we should handle
this as a library anyway, at least not now.

2017-07-10 22:45 GMT-03:00 <dr...@apache.org>:

> +1 for releasing models
>
> as for the rest not sure how I feel.  Is there just one model for the
> Language Detector? I don’t want this to become a versioning issue
> langDect.bin version 1 goes with 1.8.1, but 2 goes with 1.8.2.  Can anyone
> download the Leipzig corpus? Being able to reproduce the model is very
> powerful, because if you have additional data you can add it to the Leipzig
> corpus to improve your model.
>
> I am not a big fan of default models, because it is frustrating as a using
> when unexpected things happen (like if you thing you are telling it to use
> your model, but it uses the default).  However, if the code is verbose
> enough, this is really not a valid concern.  I would want to see the use
> case develop.
> Daniel
>
>
> > On Jul 10, 2017, at 8:58 PM, Aliaksandr Autayeu <al...@autayeu.com>
> wrote:
> >
> > Great idea!
> >
> > +1 for releasing models.
> >
> > +1 to publish models in jars on Maven Central. This is the fastest way to
> > have somebody started. Moreover, having an extensible mechanism for
> others
> > to do it on their own is really helpful. I did this with extJWNL for
> > packaging WordNet data files. It is also convenient for packaging own
> > custom dictionaries and providing them via repositories. It reuses
> existing
> > infrastructure for things like versioning and distribution. Model
> metadata
> > has to be thought through though. Oh, what a mouthful...
> >
> > +1 for separate download ("no dependency manager" cases)
> >
> > +1 to publish data\scripts\provenance. The more reproducible it is, the
> > better.
> >
> > +1 for some mechanism of loading models from classpath.
> >
> > ~ +1 to maybe explore classpath for a "default" model for API (code) use
> > cases. Perhaps similarly to Dictionary.getDefaultResourceInstance() from
> > extJWNL. But this has to be well thought through as design mistakes here
> > might release some demons from jar hell. I didn't face it, but I'm not
> sure
> > the extJWNL design is best as I didn't do much research on alternatives.
> > And I'd think twice before adding model jars to main binary distribution.
> >
> > +1 to store only the model-building-code in SCM repo. I would not bloat
> the
> > SCM with binaries. Maven repositories though not ideal, are better for
> this
> > than SCM (and there specialized tools like jFrog).
> >
> > ~ -1 about changing CLI to use models from classpath. There was no
> > proposal, but my understanding that it would be some sort of classpath://
> > URL - please correct or clarify. I'd like to see the proposal and use
> cases
> > where it is more convenient than current way of just pointing to the
> file.
> > Perhaps it depends. Our models are already zips with manifests. Jars are
> > zips too. Perhaps changing the model packaging layout to make it more
> > "jar-like" or augmenting it with metadata for searching default models
> from
> > classpath for the above cases of distributing through Maven repositories
> > and loading from code, but perhaps leaving CLI as is - even if your model
> > is technically on the classpath, in most cases you can point to a jar in
> > the file system and thus leave CLI like it is now. It seems that dealing
> > with classpath is more suitable (convenient, safer, customary, ...) for
> > developers fiddling with code than for users fiddling with command-line.
> >
> > +1 for mirroring source corpora. The more reproducible things are the
> > better. But costs (infrastructure) and licenses (this looks like
> > redistribution which is not always allowed) might be the case.
> >
> > I'd also propose to augment model metadata with (optional) information
> > about source corpora, provenance, as much reproduction information as
> > possible, etc. Mostly for easier reproduction and provenance tracking. In
> > my experience I had challenges recalling what y-d-u-en.bin was trained
> on,
> > on which revision of that corpus, which part or subset, language, and
> > whether it had also other annotations (and respective models) for
> > connecting all the possible models from that corpora (e.g.
> > sent-tok-pos-chunk-...).
> >
> > Aliaksandr
> >
> > On 10 July 2017 at 17:41, Jeff Zemerick <jz...@apache.org> wrote:
> >
> >> +1 to an opennlp-models jar on Maven Central that contains the models.
> >> +1 to having the models available for download separately (if easily
> >> possible) for users who know what they want.
> >> +1 to having the training data shared somewhere with scripts to generate
> >> the models. It will help protect against losing data as William
> mentioned.
> >> I don't think we should depend on others to reliably host the data. I'll
> >> volunteer to help script the model generation to run on a fleet of EC2
> >> instances if it helps.
> >>
> >> If the user does not provide a model to use on the CLI, can the CLI
> tools
> >> look on the classpath for a model whose name fits the needed model (like
> >> en-ner-person.bin) and if found use it automatically?
> >>
> >> Jeff
> >>
> >>
> >>
> >> On Mon, Jul 10, 2017 at 5:06 PM, Chris Mattmann <ma...@apache.org>
> >> wrote:
> >>
> >>> +1. In terms of releasing models, maybe an opennlp-models package, and
> >> then
> >>> using Maven structure of src/main/resources/<package prefix dirs>/*.bin
> >> for
> >>> putting the models.
> >>>
> >>> Then using an assembly descriptor to compile the above into a
> *-bin.jar?
> >>>
> >>> Cheers,
> >>> Chris
> >>>
> >>>
> >>>
> >>>
> >>> On 7/10/17, 4:09 PM, "Joern Kottmann" <ko...@gmail.com> wrote:
> >>>
> >>>    My opinion about this is that we should offer the model as maven
> >>>    dependency for users who just want to use it in their projects, and
> >>>    also offer models for download for people to quickly try out
> OpenNLP.
> >>>    If the models can be downloaded, a new users could very quickly test
> >>>    it via the command line.
> >>>
> >>>    I don't really have any thoughts yet on how we should organize it,
> it
> >>>    would probably be nice to have some place where we can share all the
> >>>    training data, and then have the scripts to produce the models
> >> checked
> >>>    in. It should be easy to retrain all the models in case we do a
> major
> >>>    release.
> >>>
> >>>    In case a corpus is vanishing we should drop support for it, must be
> >>>    obsolete then.
> >>>
> >>>    Jörn
> >>>
> >>>    On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org>
> >>> wrote:
> >>>> We need to address things such as sharing the evaluation results
> >> and
> >>> how to
> >>>> reproduce the training.
> >>>>
> >>>> There are several possibilities for that, but there are points to
> >>> consider:
> >>>>
> >>>> Will we store the model itself in a SCM repository or only the code
> >>> that
> >>>> can build it?
> >>>> Will we deploy the models to a Maven Central repository? It is good
> >>> for
> >>>> people using the Java API but not for command line interface,
> >> should
> >>> we
> >>>> change the CLI to handle models in the classpath?
> >>>> Should we keep a copy of the training model or always download from
> >>> the
> >>>> original provider? We can't guarantee that the corpus will be there
> >>>> forever, not only because it changed license, but simple because
> >> the
> >>>> provider is not keeping the server up anymore.
> >>>>
> >>>> William
> >>>>
> >>>>
> >>>>
> >>>> 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
> >>>>
> >>>>> Hello all,
> >>>>>
> >>>>> since Apache OpenNLP 1.8.1 we have a new language detection
> >>> component
> >>>>> which like all our components has to be trained. I think we should
> >>>>> release a pre-build model for it trained on the Leipzig corpus.
> >> This
> >>>>> will allow the majority of our users to get started very quickly
> >>> with
> >>>>> language detection without the need to figure out on how to train
> >>> it.
> >>>>>
> >>>>> How should this project release models?
> >>>>>
> >>>>> Jörn
> >>>>>
> >>>
> >>>
> >>>
> >>>
> >>
>
>

Re: Releasing a Language Detection Model

Posted by Joern Kottmann <ko...@gmail.com>.

I am also not for default models. We are a library and people use it
inside other software products, that is the place where meaningful
defaults can be defined. Maybe our lang model works very well, you
take that, hard code it and forget for the next couple of years about
it, or it doesn't work and you train your own set of models and swap
them depending on your input data source.

And then there are solutions out there people can use to define
configuration for their software projects, such as spring or typesafe.
And probably something new one day. I am +1 to ensure that OpenNLP is
easy to use with the most common ones and to accept PRs to increase
ease of use.

Jörn

On Tue, Jul 11, 2017 at 3:45 AM,  <dr...@apache.org> wrote:
> +1 for releasing models
>
> as for the rest not sure how I feel.  Is there just one model for the Language Detector? I don’t want this to become a versioning issue langDect.bin version 1 goes with 1.8.1, but 2 goes with 1.8.2.  Can anyone download the Leipzig corpus? Being able to reproduce the model is very powerful, because if you have additional data you can add it to the Leipzig corpus to improve your model.
>
> I am not a big fan of default models, because it is frustrating as a using when unexpected things happen (like if you thing you are telling it to use your model, but it uses the default).  However, if the code is verbose enough, this is really not a valid concern.  I would want to see the use case develop.
> Daniel
>
>
>> On Jul 10, 2017, at 8:58 PM, Aliaksandr Autayeu <al...@autayeu.com> wrote:
>>
>> Great idea!
>>
>> +1 for releasing models.
>>
>> +1 to publish models in jars on Maven Central. This is the fastest way to
>> have somebody started. Moreover, having an extensible mechanism for others
>> to do it on their own is really helpful. I did this with extJWNL for
>> packaging WordNet data files. It is also convenient for packaging own
>> custom dictionaries and providing them via repositories. It reuses existing
>> infrastructure for things like versioning and distribution. Model metadata
>> has to be thought through though. Oh, what a mouthful...
>>
>> +1 for separate download ("no dependency manager" cases)
>>
>> +1 to publish data\scripts\provenance. The more reproducible it is, the
>> better.
>>
>> +1 for some mechanism of loading models from classpath.
>>
>> ~ +1 to maybe explore classpath for a "default" model for API (code) use
>> cases. Perhaps similarly to Dictionary.getDefaultResourceInstance() from
>> extJWNL. But this has to be well thought through as design mistakes here
>> might release some demons from jar hell. I didn't face it, but I'm not sure
>> the extJWNL design is best as I didn't do much research on alternatives.
>> And I'd think twice before adding model jars to main binary distribution.
>>
>> +1 to store only the model-building-code in SCM repo. I would not bloat the
>> SCM with binaries. Maven repositories though not ideal, are better for this
>> than SCM (and there specialized tools like jFrog).
>>
>> ~ -1 about changing CLI to use models from classpath. There was no
>> proposal, but my understanding that it would be some sort of classpath://
>> URL - please correct or clarify. I'd like to see the proposal and use cases
>> where it is more convenient than current way of just pointing to the file.
>> Perhaps it depends. Our models are already zips with manifests. Jars are
>> zips too. Perhaps changing the model packaging layout to make it more
>> "jar-like" or augmenting it with metadata for searching default models from
>> classpath for the above cases of distributing through Maven repositories
>> and loading from code, but perhaps leaving CLI as is - even if your model
>> is technically on the classpath, in most cases you can point to a jar in
>> the file system and thus leave CLI like it is now. It seems that dealing
>> with classpath is more suitable (convenient, safer, customary, ...) for
>> developers fiddling with code than for users fiddling with command-line.
>>
>> +1 for mirroring source corpora. The more reproducible things are the
>> better. But costs (infrastructure) and licenses (this looks like
>> redistribution which is not always allowed) might be the case.
>>
>> I'd also propose to augment model metadata with (optional) information
>> about source corpora, provenance, as much reproduction information as
>> possible, etc. Mostly for easier reproduction and provenance tracking. In
>> my experience I had challenges recalling what y-d-u-en.bin was trained on,
>> on which revision of that corpus, which part or subset, language, and
>> whether it had also other annotations (and respective models) for
>> connecting all the possible models from that corpora (e.g.
>> sent-tok-pos-chunk-...).
>>
>> Aliaksandr
>>
>> On 10 July 2017 at 17:41, Jeff Zemerick <jz...@apache.org> wrote:
>>
>>> +1 to an opennlp-models jar on Maven Central that contains the models.
>>> +1 to having the models available for download separately (if easily
>>> possible) for users who know what they want.
>>> +1 to having the training data shared somewhere with scripts to generate
>>> the models. It will help protect against losing data as William mentioned.
>>> I don't think we should depend on others to reliably host the data. I'll
>>> volunteer to help script the model generation to run on a fleet of EC2
>>> instances if it helps.
>>>
>>> If the user does not provide a model to use on the CLI, can the CLI tools
>>> look on the classpath for a model whose name fits the needed model (like
>>> en-ner-person.bin) and if found use it automatically?
>>>
>>> Jeff
>>>
>>>
>>>
>>> On Mon, Jul 10, 2017 at 5:06 PM, Chris Mattmann <ma...@apache.org>
>>> wrote:
>>>
>>>> +1. In terms of releasing models, maybe an opennlp-models package, and
>>> then
>>>> using Maven structure of src/main/resources/<package prefix dirs>/*.bin
>>> for
>>>> putting the models.
>>>>
>>>> Then using an assembly descriptor to compile the above into a *-bin.jar?
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>>
>>>>
>>>>
>>>> On 7/10/17, 4:09 PM, "Joern Kottmann" <ko...@gmail.com> wrote:
>>>>
>>>>    My opinion about this is that we should offer the model as maven
>>>>    dependency for users who just want to use it in their projects, and
>>>>    also offer models for download for people to quickly try out OpenNLP.
>>>>    If the models can be downloaded, a new users could very quickly test
>>>>    it via the command line.
>>>>
>>>>    I don't really have any thoughts yet on how we should organize it, it
>>>>    would probably be nice to have some place where we can share all the
>>>>    training data, and then have the scripts to produce the models
>>> checked
>>>>    in. It should be easy to retrain all the models in case we do a major
>>>>    release.
>>>>
>>>>    In case a corpus is vanishing we should drop support for it, must be
>>>>    obsolete then.
>>>>
>>>>    Jörn
>>>>
>>>>    On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org>
>>>> wrote:
>>>>> We need to address things such as sharing the evaluation results
>>> and
>>>> how to
>>>>> reproduce the training.
>>>>>
>>>>> There are several possibilities for that, but there are points to
>>>> consider:
>>>>>
>>>>> Will we store the model itself in a SCM repository or only the code
>>>> that
>>>>> can build it?
>>>>> Will we deploy the models to a Maven Central repository? It is good
>>>> for
>>>>> people using the Java API but not for command line interface,
>>> should
>>>> we
>>>>> change the CLI to handle models in the classpath?
>>>>> Should we keep a copy of the training model or always download from
>>>> the
>>>>> original provider? We can't guarantee that the corpus will be there
>>>>> forever, not only because it changed license, but simple because
>>> the
>>>>> provider is not keeping the server up anymore.
>>>>>
>>>>> William
>>>>>
>>>>>
>>>>>
>>>>> 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> since Apache OpenNLP 1.8.1 we have a new language detection
>>>> component
>>>>>> which like all our components has to be trained. I think we should
>>>>>> release a pre-build model for it trained on the Leipzig corpus.
>>> This
>>>>>> will allow the majority of our users to get started very quickly
>>>> with
>>>>>> language detection without the need to figure out on how to train
>>>> it.
>>>>>>
>>>>>> How should this project release models?
>>>>>>
>>>>>> Jörn
>>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>

Re: Releasing a Language Detection Model

Posted by dr...@apache.org.

+1 for releasing models

as for the rest not sure how I feel.  Is there just one model for the Language Detector? I don’t want this to become a versioning issue langDect.bin version 1 goes with 1.8.1, but 2 goes with 1.8.2.  Can anyone download the Leipzig corpus? Being able to reproduce the model is very powerful, because if you have additional data you can add it to the Leipzig corpus to improve your model.

I am not a big fan of default models, because it is frustrating as a using when unexpected things happen (like if you thing you are telling it to use your model, but it uses the default).  However, if the code is verbose enough, this is really not a valid concern.  I would want to see the use case develop.
Daniel


> On Jul 10, 2017, at 8:58 PM, Aliaksandr Autayeu <al...@autayeu.com> wrote:
> 
> Great idea!
> 
> +1 for releasing models.
> 
> +1 to publish models in jars on Maven Central. This is the fastest way to
> have somebody started. Moreover, having an extensible mechanism for others
> to do it on their own is really helpful. I did this with extJWNL for
> packaging WordNet data files. It is also convenient for packaging own
> custom dictionaries and providing them via repositories. It reuses existing
> infrastructure for things like versioning and distribution. Model metadata
> has to be thought through though. Oh, what a mouthful...
> 
> +1 for separate download ("no dependency manager" cases)
> 
> +1 to publish data\scripts\provenance. The more reproducible it is, the
> better.
> 
> +1 for some mechanism of loading models from classpath.
> 
> ~ +1 to maybe explore classpath for a "default" model for API (code) use
> cases. Perhaps similarly to Dictionary.getDefaultResourceInstance() from
> extJWNL. But this has to be well thought through as design mistakes here
> might release some demons from jar hell. I didn't face it, but I'm not sure
> the extJWNL design is best as I didn't do much research on alternatives.
> And I'd think twice before adding model jars to main binary distribution.
> 
> +1 to store only the model-building-code in SCM repo. I would not bloat the
> SCM with binaries. Maven repositories though not ideal, are better for this
> than SCM (and there specialized tools like jFrog).
> 
> ~ -1 about changing CLI to use models from classpath. There was no
> proposal, but my understanding that it would be some sort of classpath://
> URL - please correct or clarify. I'd like to see the proposal and use cases
> where it is more convenient than current way of just pointing to the file.
> Perhaps it depends. Our models are already zips with manifests. Jars are
> zips too. Perhaps changing the model packaging layout to make it more
> "jar-like" or augmenting it with metadata for searching default models from
> classpath for the above cases of distributing through Maven repositories
> and loading from code, but perhaps leaving CLI as is - even if your model
> is technically on the classpath, in most cases you can point to a jar in
> the file system and thus leave CLI like it is now. It seems that dealing
> with classpath is more suitable (convenient, safer, customary, ...) for
> developers fiddling with code than for users fiddling with command-line.
> 
> +1 for mirroring source corpora. The more reproducible things are the
> better. But costs (infrastructure) and licenses (this looks like
> redistribution which is not always allowed) might be the case.
> 
> I'd also propose to augment model metadata with (optional) information
> about source corpora, provenance, as much reproduction information as
> possible, etc. Mostly for easier reproduction and provenance tracking. In
> my experience I had challenges recalling what y-d-u-en.bin was trained on,
> on which revision of that corpus, which part or subset, language, and
> whether it had also other annotations (and respective models) for
> connecting all the possible models from that corpora (e.g.
> sent-tok-pos-chunk-...).
> 
> Aliaksandr
> 
> On 10 July 2017 at 17:41, Jeff Zemerick <jz...@apache.org> wrote:
> 
>> +1 to an opennlp-models jar on Maven Central that contains the models.
>> +1 to having the models available for download separately (if easily
>> possible) for users who know what they want.
>> +1 to having the training data shared somewhere with scripts to generate
>> the models. It will help protect against losing data as William mentioned.
>> I don't think we should depend on others to reliably host the data. I'll
>> volunteer to help script the model generation to run on a fleet of EC2
>> instances if it helps.
>> 
>> If the user does not provide a model to use on the CLI, can the CLI tools
>> look on the classpath for a model whose name fits the needed model (like
>> en-ner-person.bin) and if found use it automatically?
>> 
>> Jeff
>> 
>> 
>> 
>> On Mon, Jul 10, 2017 at 5:06 PM, Chris Mattmann <ma...@apache.org>
>> wrote:
>> 
>>> +1. In terms of releasing models, maybe an opennlp-models package, and
>> then
>>> using Maven structure of src/main/resources/<package prefix dirs>/*.bin
>> for
>>> putting the models.
>>> 
>>> Then using an assembly descriptor to compile the above into a *-bin.jar?
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> 
>>> 
>>> 
>>> On 7/10/17, 4:09 PM, "Joern Kottmann" <ko...@gmail.com> wrote:
>>> 
>>>    My opinion about this is that we should offer the model as maven
>>>    dependency for users who just want to use it in their projects, and
>>>    also offer models for download for people to quickly try out OpenNLP.
>>>    If the models can be downloaded, a new users could very quickly test
>>>    it via the command line.
>>> 
>>>    I don't really have any thoughts yet on how we should organize it, it
>>>    would probably be nice to have some place where we can share all the
>>>    training data, and then have the scripts to produce the models
>> checked
>>>    in. It should be easy to retrain all the models in case we do a major
>>>    release.
>>> 
>>>    In case a corpus is vanishing we should drop support for it, must be
>>>    obsolete then.
>>> 
>>>    Jörn
>>> 
>>>    On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org>
>>> wrote:
>>>> We need to address things such as sharing the evaluation results
>> and
>>> how to
>>>> reproduce the training.
>>>> 
>>>> There are several possibilities for that, but there are points to
>>> consider:
>>>> 
>>>> Will we store the model itself in a SCM repository or only the code
>>> that
>>>> can build it?
>>>> Will we deploy the models to a Maven Central repository? It is good
>>> for
>>>> people using the Java API but not for command line interface,
>> should
>>> we
>>>> change the CLI to handle models in the classpath?
>>>> Should we keep a copy of the training model or always download from
>>> the
>>>> original provider? We can't guarantee that the corpus will be there
>>>> forever, not only because it changed license, but simple because
>> the
>>>> provider is not keeping the server up anymore.
>>>> 
>>>> William
>>>> 
>>>> 
>>>> 
>>>> 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
>>>> 
>>>>> Hello all,
>>>>> 
>>>>> since Apache OpenNLP 1.8.1 we have a new language detection
>>> component
>>>>> which like all our components has to be trained. I think we should
>>>>> release a pre-build model for it trained on the Leipzig corpus.
>> This
>>>>> will allow the majority of our users to get started very quickly
>>> with
>>>>> language detection without the need to figure out on how to train
>>> it.
>>>>> 
>>>>> How should this project release models?
>>>>> 
>>>>> Jörn
>>>>> 
>>> 
>>> 
>>> 
>>> 
>>

Re: Releasing a Language Detection Model

Posted by Aliaksandr Autayeu <al...@autayeu.com>.

Great idea!

+1 for releasing models.

+1 to publish models in jars on Maven Central. This is the fastest way to
have somebody started. Moreover, having an extensible mechanism for others
to do it on their own is really helpful. I did this with extJWNL for
packaging WordNet data files. It is also convenient for packaging own
custom dictionaries and providing them via repositories. It reuses existing
infrastructure for things like versioning and distribution. Model metadata
has to be thought through though. Oh, what a mouthful...

+1 for separate download ("no dependency manager" cases)

+1 to publish data\scripts\provenance. The more reproducible it is, the
better.

+1 for some mechanism of loading models from classpath.

~ +1 to maybe explore classpath for a "default" model for API (code) use
cases. Perhaps similarly to Dictionary.getDefaultResourceInstance() from
extJWNL. But this has to be well thought through as design mistakes here
might release some demons from jar hell. I didn't face it, but I'm not sure
the extJWNL design is best as I didn't do much research on alternatives.
And I'd think twice before adding model jars to main binary distribution.

+1 to store only the model-building-code in SCM repo. I would not bloat the
SCM with binaries. Maven repositories though not ideal, are better for this
than SCM (and there specialized tools like jFrog).

~ -1 about changing CLI to use models from classpath. There was no
proposal, but my understanding that it would be some sort of classpath://
URL - please correct or clarify. I'd like to see the proposal and use cases
where it is more convenient than current way of just pointing to the file.
Perhaps it depends. Our models are already zips with manifests. Jars are
zips too. Perhaps changing the model packaging layout to make it more
"jar-like" or augmenting it with metadata for searching default models from
classpath for the above cases of distributing through Maven repositories
and loading from code, but perhaps leaving CLI as is - even if your model
is technically on the classpath, in most cases you can point to a jar in
the file system and thus leave CLI like it is now. It seems that dealing
with classpath is more suitable (convenient, safer, customary, ...) for
developers fiddling with code than for users fiddling with command-line.

+1 for mirroring source corpora. The more reproducible things are the
better. But costs (infrastructure) and licenses (this looks like
redistribution which is not always allowed) might be the case.

I'd also propose to augment model metadata with (optional) information
about source corpora, provenance, as much reproduction information as
possible, etc. Mostly for easier reproduction and provenance tracking. In
my experience I had challenges recalling what y-d-u-en.bin was trained on,
on which revision of that corpus, which part or subset, language, and
whether it had also other annotations (and respective models) for
connecting all the possible models from that corpora (e.g.
sent-tok-pos-chunk-...).

Aliaksandr

On 10 July 2017 at 17:41, Jeff Zemerick <jz...@apache.org> wrote:

> +1 to an opennlp-models jar on Maven Central that contains the models.
> +1 to having the models available for download separately (if easily
> possible) for users who know what they want.
> +1 to having the training data shared somewhere with scripts to generate
> the models. It will help protect against losing data as William mentioned.
> I don't think we should depend on others to reliably host the data. I'll
> volunteer to help script the model generation to run on a fleet of EC2
> instances if it helps.
>
> If the user does not provide a model to use on the CLI, can the CLI tools
> look on the classpath for a model whose name fits the needed model (like
> en-ner-person.bin) and if found use it automatically?
>
> Jeff
>
>
>
> On Mon, Jul 10, 2017 at 5:06 PM, Chris Mattmann <ma...@apache.org>
> wrote:
>
> > +1. In terms of releasing models, maybe an opennlp-models package, and
> then
> > using Maven structure of src/main/resources/<package prefix dirs>/*.bin
> for
> > putting the models.
> >
> > Then using an assembly descriptor to compile the above into a *-bin.jar?
> >
> > Cheers,
> > Chris
> >
> >
> >
> >
> > On 7/10/17, 4:09 PM, "Joern Kottmann" <ko...@gmail.com> wrote:
> >
> >     My opinion about this is that we should offer the model as maven
> >     dependency for users who just want to use it in their projects, and
> >     also offer models for download for people to quickly try out OpenNLP.
> >     If the models can be downloaded, a new users could very quickly test
> >     it via the command line.
> >
> >     I don't really have any thoughts yet on how we should organize it, it
> >     would probably be nice to have some place where we can share all the
> >     training data, and then have the scripts to produce the models
> checked
> >     in. It should be easy to retrain all the models in case we do a major
> >     release.
> >
> >     In case a corpus is vanishing we should drop support for it, must be
> >     obsolete then.
> >
> >     Jörn
> >
> >     On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org>
> > wrote:
> >     > We need to address things such as sharing the evaluation results
> and
> > how to
> >     > reproduce the training.
> >     >
> >     > There are several possibilities for that, but there are points to
> > consider:
> >     >
> >     > Will we store the model itself in a SCM repository or only the code
> > that
> >     > can build it?
> >     > Will we deploy the models to a Maven Central repository? It is good
> > for
> >     > people using the Java API but not for command line interface,
> should
> > we
> >     > change the CLI to handle models in the classpath?
> >     > Should we keep a copy of the training model or always download from
> > the
> >     > original provider? We can't guarantee that the corpus will be there
> >     > forever, not only because it changed license, but simple because
> the
> >     > provider is not keeping the server up anymore.
> >     >
> >     > William
> >     >
> >     >
> >     >
> >     > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
> >     >
> >     >> Hello all,
> >     >>
> >     >> since Apache OpenNLP 1.8.1 we have a new language detection
> > component
> >     >> which like all our components has to be trained. I think we should
> >     >> release a pre-build model for it trained on the Leipzig corpus.
> This
> >     >> will allow the majority of our users to get started very quickly
> > with
> >     >> language detection without the need to figure out on how to train
> > it.
> >     >>
> >     >> How should this project release models?
> >     >>
> >     >> Jörn
> >     >>
> >
> >
> >
> >
>

Re: Releasing a Language Detection Model

Posted by Jeff Zemerick <jz...@apache.org>.

+1 to an opennlp-models jar on Maven Central that contains the models.
+1 to having the models available for download separately (if easily
possible) for users who know what they want.
+1 to having the training data shared somewhere with scripts to generate
the models. It will help protect against losing data as William mentioned.
I don't think we should depend on others to reliably host the data. I'll
volunteer to help script the model generation to run on a fleet of EC2
instances if it helps.

If the user does not provide a model to use on the CLI, can the CLI tools
look on the classpath for a model whose name fits the needed model (like
en-ner-person.bin) and if found use it automatically?

Jeff



On Mon, Jul 10, 2017 at 5:06 PM, Chris Mattmann <ma...@apache.org> wrote:

> +1. In terms of releasing models, maybe an opennlp-models package, and then
> using Maven structure of src/main/resources/<package prefix dirs>/*.bin for
> putting the models.
>
> Then using an assembly descriptor to compile the above into a *-bin.jar?
>
> Cheers,
> Chris
>
>
>
>
> On 7/10/17, 4:09 PM, "Joern Kottmann" <ko...@gmail.com> wrote:
>
>     My opinion about this is that we should offer the model as maven
>     dependency for users who just want to use it in their projects, and
>     also offer models for download for people to quickly try out OpenNLP.
>     If the models can be downloaded, a new users could very quickly test
>     it via the command line.
>
>     I don't really have any thoughts yet on how we should organize it, it
>     would probably be nice to have some place where we can share all the
>     training data, and then have the scripts to produce the models checked
>     in. It should be easy to retrain all the models in case we do a major
>     release.
>
>     In case a corpus is vanishing we should drop support for it, must be
>     obsolete then.
>
>     Jörn
>
>     On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org>
> wrote:
>     > We need to address things such as sharing the evaluation results and
> how to
>     > reproduce the training.
>     >
>     > There are several possibilities for that, but there are points to
> consider:
>     >
>     > Will we store the model itself in a SCM repository or only the code
> that
>     > can build it?
>     > Will we deploy the models to a Maven Central repository? It is good
> for
>     > people using the Java API but not for command line interface, should
> we
>     > change the CLI to handle models in the classpath?
>     > Should we keep a copy of the training model or always download from
> the
>     > original provider? We can't guarantee that the corpus will be there
>     > forever, not only because it changed license, but simple because the
>     > provider is not keeping the server up anymore.
>     >
>     > William
>     >
>     >
>     >
>     > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
>     >
>     >> Hello all,
>     >>
>     >> since Apache OpenNLP 1.8.1 we have a new language detection
> component
>     >> which like all our components has to be trained. I think we should
>     >> release a pre-build model for it trained on the Leipzig corpus. This
>     >> will allow the majority of our users to get started very quickly
> with
>     >> language detection without the need to figure out on how to train
> it.
>     >>
>     >> How should this project release models?
>     >>
>     >> Jörn
>     >>
>
>
>
>

Re: Releasing a Language Detection Model

Posted by Chris Mattmann <ma...@apache.org>.

+1. In terms of releasing models, maybe an opennlp-models package, and then 
using Maven structure of src/main/resources/<package prefix dirs>/*.bin for 
putting the models. 

Then using an assembly descriptor to compile the above into a *-bin.jar?

Cheers,
Chris




On 7/10/17, 4:09 PM, "Joern Kottmann" <ko...@gmail.com> wrote:

    My opinion about this is that we should offer the model as maven
    dependency for users who just want to use it in their projects, and
    also offer models for download for people to quickly try out OpenNLP.
    If the models can be downloaded, a new users could very quickly test
    it via the command line.
    
    I don't really have any thoughts yet on how we should organize it, it
    would probably be nice to have some place where we can share all the
    training data, and then have the scripts to produce the models checked
    in. It should be easy to retrain all the models in case we do a major
    release.
    
    In case a corpus is vanishing we should drop support for it, must be
    obsolete then.
    
    Jörn
    
    On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org> wrote:
    > We need to address things such as sharing the evaluation results and how to
    > reproduce the training.
    >
    > There are several possibilities for that, but there are points to consider:
    >
    > Will we store the model itself in a SCM repository or only the code that
    > can build it?
    > Will we deploy the models to a Maven Central repository? It is good for
    > people using the Java API but not for command line interface, should we
    > change the CLI to handle models in the classpath?
    > Should we keep a copy of the training model or always download from the
    > original provider? We can't guarantee that the corpus will be there
    > forever, not only because it changed license, but simple because the
    > provider is not keeping the server up anymore.
    >
    > William
    >
    >
    >
    > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
    >
    >> Hello all,
    >>
    >> since Apache OpenNLP 1.8.1 we have a new language detection component
    >> which like all our components has to be trained. I think we should
    >> release a pre-build model for it trained on the Leipzig corpus. This
    >> will allow the majority of our users to get started very quickly with
    >> language detection without the need to figure out on how to train it.
    >>
    >> How should this project release models?
    >>
    >> Jörn
    >>

Re: Releasing a Language Detection Model

Posted by Joern Kottmann <ko...@gmail.com>.

My opinion about this is that we should offer the model as maven
dependency for users who just want to use it in their projects, and
also offer models for download for people to quickly try out OpenNLP.
If the models can be downloaded, a new users could very quickly test
it via the command line.

I don't really have any thoughts yet on how we should organize it, it
would probably be nice to have some place where we can share all the
training data, and then have the scripts to produce the models checked
in. It should be easy to retrain all the models in case we do a major
release.

In case a corpus is vanishing we should drop support for it, must be
obsolete then.

Jörn

On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org> wrote:
> We need to address things such as sharing the evaluation results and how to
> reproduce the training.
>
> There are several possibilities for that, but there are points to consider:
>
> Will we store the model itself in a SCM repository or only the code that
> can build it?
> Will we deploy the models to a Maven Central repository? It is good for
> people using the Java API but not for command line interface, should we
> change the CLI to handle models in the classpath?
> Should we keep a copy of the training model or always download from the
> original provider? We can't guarantee that the corpus will be there
> forever, not only because it changed license, but simple because the
> provider is not keeping the server up anymore.
>
> William
>
>
>
> 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
>
>> Hello all,
>>
>> since Apache OpenNLP 1.8.1 we have a new language detection component
>> which like all our components has to be trained. I think we should
>> release a pre-build model for it trained on the Leipzig corpus. This
>> will allow the majority of our users to get started very quickly with
>> language detection without the need to figure out on how to train it.
>>
>> How should this project release models?
>>
>> Jörn
>>

Re: Releasing a Language Detection Model

Posted by Joern Kottmann <ko...@gmail.com>.

1) This already included today by default in the model, it is possible
to also place more data in it e.g. a file which contains eval results,
a LICENSE and NOTICE file, etc

2) I would take a "best effort" approach and only publish one model
per task and data set, if there are not really good reasons to publish
multiple. In case of langdetect the perceptron and maxent models
perform almost identical, so no need to publish both. Probably we
should pick the perceptron model because it is slightly faster. And if
a user disagrees with us - that is totally fine - he can always train
himself with his personal preferences.

All the knowledge on how to train a model should be accessible via
git, and then it is just a matter of running the right command to
start it.

Jörn

On Tue, Jul 11, 2017 at 3:35 PM, Suneel Marthi <sm...@apache.org> wrote:
> ...one last point before wrapping up this discussion.  Is it possible to
> that u could have more than one lang detect model but trained with
> different algorithms - like say 'MaxEnt', 'Naive Bayes', ' Perceptron'
>
> Questions:
>
> 1.   Do we just publish one model trained on a specific algorithm, if so
> the metadata would have the algorithm information ?
>
> 2.  Do we publish multiple models for the same task, each trained on
> different algorithms ?
>
>
>
> On Tue, Jul 11, 2017 at 9:30 AM, Joern Kottmann <ko...@gmail.com> wrote:
>
>> Hello,
>>
>> right, very good point, I also think that it is very important to load
>> a model in one from the classpath.
>>
>> I propose we have the following setup:
>> - One jar contains one or multiple model packages (thats the zip container)
>> - A model name itself should be kind of unique  e.g. eng-ud-token.bin
>> - A user loads the model via: new
>> SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream
>> gets then closed properly
>>
>>
>> Lets take away three things from this discussion:
>> 1) Store the data in a place where the community can access it
>> 2) Offer models on our download page similar as it is done today on
>> the SourceForge page
>> 3) Release models packed inside a jar file via maven central
>>
>> Jörn
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu
>> <al...@autayeu.com> wrote:
>> > To clarify on models and jars.
>> >
>> > Putting model inside jar might not be a good idea. I mean here things
>> like
>> > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are
>> "jars"
>> > already in a sense. We're good. However, current packaging and metadata
>> > might not be very classpath friendly.
>> >
>> > The use case I have in mind is being able to add needed models as
>> > dependencies and load them by writing a line of code. For this case
>> having
>> > all models in a root with the same name might not be very convenient.
>> Same
>> > goes for manifest. The name "manifest.properties" is quite generic and
>> it's
>> > not too far-fetched to see some clashes because some other lib also
>> > manifests something. It might be better to allow for some flexibility and
>> > to adhere to classpath conventions. For example, having manifests in
>> > something like org/apache/opennlp/models/manifest.properties. Or
>> > opennlp/tools/manifest.properties. And perhaps even allowing to
>> reference a
>> > model in the manifest, so the model can be put elsewhere. Just in case
>> > there are several custom models of the same kind for different pipelines
>> in
>> > the same app. For example, processing queries with one pipeline - one set
>> > of models - and processing documents with another pipeline - another set
>> of
>> > models. In this case allowing for different classpaths is needed.
>> >
>> > Perhaps to illustrate my thinking, something like this (which still
>> keeps a
>> > lot of possibilities open):
>> > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps
>> contains
>> > a line with something like model =
>> > /opennlp/tools/sentdetect/model/sent.model)
>> > en-sent.bin/opennlp/tools/sentdetect/model/sent.model
>> >
>> > This allows including en-sent.bin as dependency. And then doing something
>> > like
>> > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we
>> want
>> > default models in this way. Seems verbose enough to allow for some safety
>> > through explicitness. That's if we want any defaults at all.
>> > Or something like:
>> > SentenceModel sdm =
>> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.
>> properties");
>> > Or
>> > SentenceModel sdm =
>> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.
>> model");
>> > Or more in-line with a current style:
>> > SentenceModel sdm = new
>> > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though
>> here
>> > we commit to interpreting String as classpath reference. That's why I'd
>> > prefer more explicit method names.
>> > Or leave dealing with resources to the users, leave current code intact
>> and
>> > provide only packaging and distribution:
>> > SentenceModel sdm = new
>> > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
>> > model"));
>> >
>> >
>> > And to add to model metadata also F1\accuracy (at least CV-based, for
>> > example 10-fold) for quick reference or quick understanding of what that
>> > model is capable of. Could be helpful for those with a bunch of models
>> > around. And for others as well to have a better insight about the model
>> in
>> > question.
>> >
>> >
>> >
>> > On 11 July 2017 at 06:37, Chris Mattmann <ma...@apache.org> wrote:
>> >
>> >> Hi,
>> >>
>> >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI
>> to
>> >> override an
>> >> internal classpath dependency. This is for people in environments who
>> want
>> >> a sensible
>> >> / delivered internal classpath default and the ability for run-time, non
>> >> zipped up/messing
>> >> with JAR file override. Think about people who are using OpenNLP in both
>> >> Java/Python
>> >> environments as an example.
>> >>
>> >> Cheers,
>> >> Chris
>> >>
>> >>
>> >>
>> >>
>> >> On 7/11/17, 3:25 AM, "Joern Kottmann" <ko...@gmail.com> wrote:
>> >>
>> >>     I would not change the CLI to load models from jar files. I never
>> used
>> >>     or saw a command line tool that expects a file as an input and would
>> >>     then also load it from inside a jar file. It will be hard to
>> >>     communicate how that works precisely in the CLI usage texts and this
>> >>     is not a feature anyone would expect to be there. The intention of
>> the
>> >>     CLI is to give users the ability to quickly test OpenNLP before they
>> >>     integrate it into their software and to train and evaluate models
>> >>
>> >>     Users who for some reason have a jar file with a model inside can
>> just
>> >>     write "unzip model.jar".
>> >>
>> >>     After all I think this is quite  a bit of complexity we would need
>> to
>> >>     add for it and it will have very limited use.
>> >>
>> >>     The use case of publishing jar files is to make the models easily
>> >>     available to people who have a build system with dependency
>> >>     management, they won't have to download models manually, and when
>> they
>> >>     update OpenNLP then can also update the models with a version string
>> >>     change.
>> >>
>> >>     For the command line "quick start" use case we should offer the
>> models
>> >>     on a download page as we do today. This page could list both, the
>> >>     download link and the maven dependency.
>> >>
>> >>     Jörn
>> >>
>> >>     On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org>
>> >> wrote:
>> >>     > We need to address things such as sharing the evaluation results
>> and
>> >> how to
>> >>     > reproduce the training.
>> >>     >
>> >>     > There are several possibilities for that, but there are points to
>> >> consider:
>> >>     >
>> >>     > Will we store the model itself in a SCM repository or only the
>> code
>> >> that
>> >>     > can build it?
>> >>     > Will we deploy the models to a Maven Central repository? It is
>> good
>> >> for
>> >>     > people using the Java API but not for command line interface,
>> should
>> >> we
>> >>     > change the CLI to handle models in the classpath?
>> >>     > Should we keep a copy of the training model or always download
>> from
>> >> the
>> >>     > original provider? We can't guarantee that the corpus will be
>> there
>> >>     > forever, not only because it changed license, but simple because
>> the
>> >>     > provider is not keeping the server up anymore.
>> >>     >
>> >>     > William
>> >>     >
>> >>     >
>> >>     >
>> >>     > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
>> >>     >
>> >>     >> Hello all,
>> >>     >>
>> >>     >> since Apache OpenNLP 1.8.1 we have a new language detection
>> >> component
>> >>     >> which like all our components has to be trained. I think we
>> should
>> >>     >> release a pre-build model for it trained on the Leipzig corpus.
>> This
>> >>     >> will allow the majority of our users to get started very quickly
>> >> with
>> >>     >> language detection without the need to figure out on how to train
>> >> it.
>> >>     >>
>> >>     >> How should this project release models?
>> >>     >>
>> >>     >> Jörn
>> >>     >>
>> >>
>> >>
>> >>
>> >>
>>

Re: Releasing a Language Detection Model

Posted by Suneel Marthi <sm...@apache.org>.

...one last point before wrapping up this discussion.  Is it possible to
that u could have more than one lang detect model but trained with
different algorithms - like say 'MaxEnt', 'Naive Bayes', ' Perceptron'

Questions:

1.   Do we just publish one model trained on a specific algorithm, if so
the metadata would have the algorithm information ?

2.  Do we publish multiple models for the same task, each trained on
different algorithms ?



On Tue, Jul 11, 2017 at 9:30 AM, Joern Kottmann <ko...@gmail.com> wrote:

> Hello,
>
> right, very good point, I also think that it is very important to load
> a model in one from the classpath.
>
> I propose we have the following setup:
> - One jar contains one or multiple model packages (thats the zip container)
> - A model name itself should be kind of unique  e.g. eng-ud-token.bin
> - A user loads the model via: new
> SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream
> gets then closed properly
>
>
> Lets take away three things from this discussion:
> 1) Store the data in a place where the community can access it
> 2) Offer models on our download page similar as it is done today on
> the SourceForge page
> 3) Release models packed inside a jar file via maven central
>
> Jörn
>
>
>
>
>
>
>
> On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu
> <al...@autayeu.com> wrote:
> > To clarify on models and jars.
> >
> > Putting model inside jar might not be a good idea. I mean here things
> like
> > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are
> "jars"
> > already in a sense. We're good. However, current packaging and metadata
> > might not be very classpath friendly.
> >
> > The use case I have in mind is being able to add needed models as
> > dependencies and load them by writing a line of code. For this case
> having
> > all models in a root with the same name might not be very convenient.
> Same
> > goes for manifest. The name "manifest.properties" is quite generic and
> it's
> > not too far-fetched to see some clashes because some other lib also
> > manifests something. It might be better to allow for some flexibility and
> > to adhere to classpath conventions. For example, having manifests in
> > something like org/apache/opennlp/models/manifest.properties. Or
> > opennlp/tools/manifest.properties. And perhaps even allowing to
> reference a
> > model in the manifest, so the model can be put elsewhere. Just in case
> > there are several custom models of the same kind for different pipelines
> in
> > the same app. For example, processing queries with one pipeline - one set
> > of models - and processing documents with another pipeline - another set
> of
> > models. In this case allowing for different classpaths is needed.
> >
> > Perhaps to illustrate my thinking, something like this (which still
> keeps a
> > lot of possibilities open):
> > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps
> contains
> > a line with something like model =
> > /opennlp/tools/sentdetect/model/sent.model)
> > en-sent.bin/opennlp/tools/sentdetect/model/sent.model
> >
> > This allows including en-sent.bin as dependency. And then doing something
> > like
> > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we
> want
> > default models in this way. Seems verbose enough to allow for some safety
> > through explicitness. That's if we want any defaults at all.
> > Or something like:
> > SentenceModel sdm =
> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.
> properties");
> > Or
> > SentenceModel sdm =
> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.
> model");
> > Or more in-line with a current style:
> > SentenceModel sdm = new
> > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though
> here
> > we commit to interpreting String as classpath reference. That's why I'd
> > prefer more explicit method names.
> > Or leave dealing with resources to the users, leave current code intact
> and
> > provide only packaging and distribution:
> > SentenceModel sdm = new
> > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
> > model"));
> >
> >
> > And to add to model metadata also F1\accuracy (at least CV-based, for
> > example 10-fold) for quick reference or quick understanding of what that
> > model is capable of. Could be helpful for those with a bunch of models
> > around. And for others as well to have a better insight about the model
> in
> > question.
> >
> >
> >
> > On 11 July 2017 at 06:37, Chris Mattmann <ma...@apache.org> wrote:
> >
> >> Hi,
> >>
> >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI
> to
> >> override an
> >> internal classpath dependency. This is for people in environments who
> want
> >> a sensible
> >> / delivered internal classpath default and the ability for run-time, non
> >> zipped up/messing
> >> with JAR file override. Think about people who are using OpenNLP in both
> >> Java/Python
> >> environments as an example.
> >>
> >> Cheers,
> >> Chris
> >>
> >>
> >>
> >>
> >> On 7/11/17, 3:25 AM, "Joern Kottmann" <ko...@gmail.com> wrote:
> >>
> >>     I would not change the CLI to load models from jar files. I never
> used
> >>     or saw a command line tool that expects a file as an input and would
> >>     then also load it from inside a jar file. It will be hard to
> >>     communicate how that works precisely in the CLI usage texts and this
> >>     is not a feature anyone would expect to be there. The intention of
> the
> >>     CLI is to give users the ability to quickly test OpenNLP before they
> >>     integrate it into their software and to train and evaluate models
> >>
> >>     Users who for some reason have a jar file with a model inside can
> just
> >>     write "unzip model.jar".
> >>
> >>     After all I think this is quite  a bit of complexity we would need
> to
> >>     add for it and it will have very limited use.
> >>
> >>     The use case of publishing jar files is to make the models easily
> >>     available to people who have a build system with dependency
> >>     management, they won't have to download models manually, and when
> they
> >>     update OpenNLP then can also update the models with a version string
> >>     change.
> >>
> >>     For the command line "quick start" use case we should offer the
> models
> >>     on a download page as we do today. This page could list both, the
> >>     download link and the maven dependency.
> >>
> >>     Jörn
> >>
> >>     On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org>
> >> wrote:
> >>     > We need to address things such as sharing the evaluation results
> and
> >> how to
> >>     > reproduce the training.
> >>     >
> >>     > There are several possibilities for that, but there are points to
> >> consider:
> >>     >
> >>     > Will we store the model itself in a SCM repository or only the
> code
> >> that
> >>     > can build it?
> >>     > Will we deploy the models to a Maven Central repository? It is
> good
> >> for
> >>     > people using the Java API but not for command line interface,
> should
> >> we
> >>     > change the CLI to handle models in the classpath?
> >>     > Should we keep a copy of the training model or always download
> from
> >> the
> >>     > original provider? We can't guarantee that the corpus will be
> there
> >>     > forever, not only because it changed license, but simple because
> the
> >>     > provider is not keeping the server up anymore.
> >>     >
> >>     > William
> >>     >
> >>     >
> >>     >
> >>     > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
> >>     >
> >>     >> Hello all,
> >>     >>
> >>     >> since Apache OpenNLP 1.8.1 we have a new language detection
> >> component
> >>     >> which like all our components has to be trained. I think we
> should
> >>     >> release a pre-build model for it trained on the Leipzig corpus.
> This
> >>     >> will allow the majority of our users to get started very quickly
> >> with
> >>     >> language detection without the need to figure out on how to train
> >> it.
> >>     >>
> >>     >> How should this project release models?
> >>     >>
> >>     >> Jörn
> >>     >>
> >>
> >>
> >>
> >>
>

Re: Releasing a Language Detection Model

Posted by William Colen <wi...@gmail.com>.

+1


2017-07-11 10:30 GMT-03:00 Joern Kottmann <ko...@gmail.com>:

> Hello,
>
> right, very good point, I also think that it is very important to load
> a model in one from the classpath.
>
> I propose we have the following setup:
> - One jar contains one or multiple model packages (thats the zip container)
> - A model name itself should be kind of unique  e.g. eng-ud-token.bin
> - A user loads the model via: new
> SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream
> gets then closed properly
>
>
> Lets take away three things from this discussion:
> 1) Store the data in a place where the community can access it
> 2) Offer models on our download page similar as it is done today on
> the SourceForge page
> 3) Release models packed inside a jar file via maven central
>
> Jörn
>
>
>
>
>
>
>
> On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu
> <al...@autayeu.com> wrote:
> > To clarify on models and jars.
> >
> > Putting model inside jar might not be a good idea. I mean here things
> like
> > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are
> "jars"
> > already in a sense. We're good. However, current packaging and metadata
> > might not be very classpath friendly.
> >
> > The use case I have in mind is being able to add needed models as
> > dependencies and load them by writing a line of code. For this case
> having
> > all models in a root with the same name might not be very convenient.
> Same
> > goes for manifest. The name "manifest.properties" is quite generic and
> it's
> > not too far-fetched to see some clashes because some other lib also
> > manifests something. It might be better to allow for some flexibility and
> > to adhere to classpath conventions. For example, having manifests in
> > something like org/apache/opennlp/models/manifest.properties. Or
> > opennlp/tools/manifest.properties. And perhaps even allowing to
> reference a
> > model in the manifest, so the model can be put elsewhere. Just in case
> > there are several custom models of the same kind for different pipelines
> in
> > the same app. For example, processing queries with one pipeline - one set
> > of models - and processing documents with another pipeline - another set
> of
> > models. In this case allowing for different classpaths is needed.
> >
> > Perhaps to illustrate my thinking, something like this (which still
> keeps a
> > lot of possibilities open):
> > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps
> contains
> > a line with something like model =
> > /opennlp/tools/sentdetect/model/sent.model)
> > en-sent.bin/opennlp/tools/sentdetect/model/sent.model
> >
> > This allows including en-sent.bin as dependency. And then doing something
> > like
> > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we
> want
> > default models in this way. Seems verbose enough to allow for some safety
> > through explicitness. That's if we want any defaults at all.
> > Or something like:
> > SentenceModel sdm =
> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.
> properties");
> > Or
> > SentenceModel sdm =
> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.
> model");
> > Or more in-line with a current style:
> > SentenceModel sdm = new
> > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though
> here
> > we commit to interpreting String as classpath reference. That's why I'd
> > prefer more explicit method names.
> > Or leave dealing with resources to the users, leave current code intact
> and
> > provide only packaging and distribution:
> > SentenceModel sdm = new
> > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
> > model"));
> >
> >
> > And to add to model metadata also F1\accuracy (at least CV-based, for
> > example 10-fold) for quick reference or quick understanding of what that
> > model is capable of. Could be helpful for those with a bunch of models
> > around. And for others as well to have a better insight about the model
> in
> > question.
> >
> >
> >
> > On 11 July 2017 at 06:37, Chris Mattmann <ma...@apache.org> wrote:
> >
> >> Hi,
> >>
> >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI
> to
> >> override an
> >> internal classpath dependency. This is for people in environments who
> want
> >> a sensible
> >> / delivered internal classpath default and the ability for run-time, non
> >> zipped up/messing
> >> with JAR file override. Think about people who are using OpenNLP in both
> >> Java/Python
> >> environments as an example.
> >>
> >> Cheers,
> >> Chris
> >>
> >>
> >>
> >>
> >> On 7/11/17, 3:25 AM, "Joern Kottmann" <ko...@gmail.com> wrote:
> >>
> >>     I would not change the CLI to load models from jar files. I never
> used
> >>     or saw a command line tool that expects a file as an input and would
> >>     then also load it from inside a jar file. It will be hard to
> >>     communicate how that works precisely in the CLI usage texts and this
> >>     is not a feature anyone would expect to be there. The intention of
> the
> >>     CLI is to give users the ability to quickly test OpenNLP before they
> >>     integrate it into their software and to train and evaluate models
> >>
> >>     Users who for some reason have a jar file with a model inside can
> just
> >>     write "unzip model.jar".
> >>
> >>     After all I think this is quite  a bit of complexity we would need
> to
> >>     add for it and it will have very limited use.
> >>
> >>     The use case of publishing jar files is to make the models easily
> >>     available to people who have a build system with dependency
> >>     management, they won't have to download models manually, and when
> they
> >>     update OpenNLP then can also update the models with a version string
> >>     change.
> >>
> >>     For the command line "quick start" use case we should offer the
> models
> >>     on a download page as we do today. This page could list both, the
> >>     download link and the maven dependency.
> >>
> >>     Jörn
> >>
> >>     On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org>
> >> wrote:
> >>     > We need to address things such as sharing the evaluation results
> and
> >> how to
> >>     > reproduce the training.
> >>     >
> >>     > There are several possibilities for that, but there are points to
> >> consider:
> >>     >
> >>     > Will we store the model itself in a SCM repository or only the
> code
> >> that
> >>     > can build it?
> >>     > Will we deploy the models to a Maven Central repository? It is
> good
> >> for
> >>     > people using the Java API but not for command line interface,
> should
> >> we
> >>     > change the CLI to handle models in the classpath?
> >>     > Should we keep a copy of the training model or always download
> from
> >> the
> >>     > original provider? We can't guarantee that the corpus will be
> there
> >>     > forever, not only because it changed license, but simple because
> the
> >>     > provider is not keeping the server up anymore.
> >>     >
> >>     > William
> >>     >
> >>     >
> >>     >
> >>     > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
> >>     >
> >>     >> Hello all,
> >>     >>
> >>     >> since Apache OpenNLP 1.8.1 we have a new language detection
> >> component
> >>     >> which like all our components has to be trained. I think we
> should
> >>     >> release a pre-build model for it trained on the Leipzig corpus.
> This
> >>     >> will allow the majority of our users to get started very quickly
> >> with
> >>     >> language detection without the need to figure out on how to train
> >> it.
> >>     >>
> >>     >> How should this project release models?
> >>     >>
> >>     >> Jörn
> >>     >>
> >>
> >>
> >>
> >>
>

Re: Releasing a Language Detection Model

Posted by Chris Mattmann <ma...@apache.org>.

Sounds good to me…



On 7/11/17, 9:30 AM, "Joern Kottmann" <ko...@gmail.com> wrote:

    Hello,
    
    right, very good point, I also think that it is very important to load
    a model in one from the classpath.
    
    I propose we have the following setup:
    - One jar contains one or multiple model packages (thats the zip container)
    - A model name itself should be kind of unique  e.g. eng-ud-token.bin
    - A user loads the model via: new
    SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream
    gets then closed properly
    
    
    Lets take away three things from this discussion:
    1) Store the data in a place where the community can access it
    2) Offer models on our download page similar as it is done today on
    the SourceForge page
    3) Release models packed inside a jar file via maven central
    
    Jörn
    
    
    
    
    
    
    
    On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu
    <al...@autayeu.com> wrote:
    > To clarify on models and jars.
    >
    > Putting model inside jar might not be a good idea. I mean here things like
    > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are "jars"
    > already in a sense. We're good. However, current packaging and metadata
    > might not be very classpath friendly.
    >
    > The use case I have in mind is being able to add needed models as
    > dependencies and load them by writing a line of code. For this case having
    > all models in a root with the same name might not be very convenient. Same
    > goes for manifest. The name "manifest.properties" is quite generic and it's
    > not too far-fetched to see some clashes because some other lib also
    > manifests something. It might be better to allow for some flexibility and
    > to adhere to classpath conventions. For example, having manifests in
    > something like org/apache/opennlp/models/manifest.properties. Or
    > opennlp/tools/manifest.properties. And perhaps even allowing to reference a
    > model in the manifest, so the model can be put elsewhere. Just in case
    > there are several custom models of the same kind for different pipelines in
    > the same app. For example, processing queries with one pipeline - one set
    > of models - and processing documents with another pipeline - another set of
    > models. In this case allowing for different classpaths is needed.
    >
    > Perhaps to illustrate my thinking, something like this (which still keeps a
    > lot of possibilities open):
    > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps contains
    > a line with something like model =
    > /opennlp/tools/sentdetect/model/sent.model)
    > en-sent.bin/opennlp/tools/sentdetect/model/sent.model
    >
    > This allows including en-sent.bin as dependency. And then doing something
    > like
    > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we want
    > default models in this way. Seems verbose enough to allow for some safety
    > through explicitness. That's if we want any defaults at all.
    > Or something like:
    > SentenceModel sdm =
    > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.properties");
    > Or
    > SentenceModel sdm =
    > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.model");
    > Or more in-line with a current style:
    > SentenceModel sdm = new
    > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though here
    > we commit to interpreting String as classpath reference. That's why I'd
    > prefer more explicit method names.
    > Or leave dealing with resources to the users, leave current code intact and
    > provide only packaging and distribution:
    > SentenceModel sdm = new
    > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
    > model"));
    >
    >
    > And to add to model metadata also F1\accuracy (at least CV-based, for
    > example 10-fold) for quick reference or quick understanding of what that
    > model is capable of. Could be helpful for those with a bunch of models
    > around. And for others as well to have a better insight about the model in
    > question.
    >
    >
    >
    > On 11 July 2017 at 06:37, Chris Mattmann <ma...@apache.org> wrote:
    >
    >> Hi,
    >>
    >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI to
    >> override an
    >> internal classpath dependency. This is for people in environments who want
    >> a sensible
    >> / delivered internal classpath default and the ability for run-time, non
    >> zipped up/messing
    >> with JAR file override. Think about people who are using OpenNLP in both
    >> Java/Python
    >> environments as an example.
    >>
    >> Cheers,
    >> Chris
    >>
    >>
    >>
    >>
    >> On 7/11/17, 3:25 AM, "Joern Kottmann" <ko...@gmail.com> wrote:
    >>
    >>     I would not change the CLI to load models from jar files. I never used
    >>     or saw a command line tool that expects a file as an input and would
    >>     then also load it from inside a jar file. It will be hard to
    >>     communicate how that works precisely in the CLI usage texts and this
    >>     is not a feature anyone would expect to be there. The intention of the
    >>     CLI is to give users the ability to quickly test OpenNLP before they
    >>     integrate it into their software and to train and evaluate models
    >>
    >>     Users who for some reason have a jar file with a model inside can just
    >>     write "unzip model.jar".
    >>
    >>     After all I think this is quite  a bit of complexity we would need to
    >>     add for it and it will have very limited use.
    >>
    >>     The use case of publishing jar files is to make the models easily
    >>     available to people who have a build system with dependency
    >>     management, they won't have to download models manually, and when they
    >>     update OpenNLP then can also update the models with a version string
    >>     change.
    >>
    >>     For the command line "quick start" use case we should offer the models
    >>     on a download page as we do today. This page could list both, the
    >>     download link and the maven dependency.
    >>
    >>     Jörn
    >>
    >>     On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org>
    >> wrote:
    >>     > We need to address things such as sharing the evaluation results and
    >> how to
    >>     > reproduce the training.
    >>     >
    >>     > There are several possibilities for that, but there are points to
    >> consider:
    >>     >
    >>     > Will we store the model itself in a SCM repository or only the code
    >> that
    >>     > can build it?
    >>     > Will we deploy the models to a Maven Central repository? It is good
    >> for
    >>     > people using the Java API but not for command line interface, should
    >> we
    >>     > change the CLI to handle models in the classpath?
    >>     > Should we keep a copy of the training model or always download from
    >> the
    >>     > original provider? We can't guarantee that the corpus will be there
    >>     > forever, not only because it changed license, but simple because the
    >>     > provider is not keeping the server up anymore.
    >>     >
    >>     > William
    >>     >
    >>     >
    >>     >
    >>     > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
    >>     >
    >>     >> Hello all,
    >>     >>
    >>     >> since Apache OpenNLP 1.8.1 we have a new language detection
    >> component
    >>     >> which like all our components has to be trained. I think we should
    >>     >> release a pre-build model for it trained on the Leipzig corpus. This
    >>     >> will allow the majority of our users to get started very quickly
    >> with
    >>     >> language detection without the need to figure out on how to train
    >> it.
    >>     >>
    >>     >> How should this project release models?
    >>     >>
    >>     >> Jörn
    >>     >>
    >>
    >>
    >>
    >>

Re: Releasing a Language Detection Model

Posted by Joern Kottmann <ko...@gmail.com>.

Hello,

right, very good point, I also think that it is very important to load
a model in one from the classpath.

I propose we have the following setup:
- One jar contains one or multiple model packages (thats the zip container)
- A model name itself should be kind of unique  e.g. eng-ud-token.bin
- A user loads the model via: new
SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream
gets then closed properly


Lets take away three things from this discussion:
1) Store the data in a place where the community can access it
2) Offer models on our download page similar as it is done today on
the SourceForge page
3) Release models packed inside a jar file via maven central

Jörn







On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu
<al...@autayeu.com> wrote:
> To clarify on models and jars.
>
> Putting model inside jar might not be a good idea. I mean here things like
> bla-bla.jar/en-sent.bin. Our models are already zipped, so they are "jars"
> already in a sense. We're good. However, current packaging and metadata
> might not be very classpath friendly.
>
> The use case I have in mind is being able to add needed models as
> dependencies and load them by writing a line of code. For this case having
> all models in a root with the same name might not be very convenient. Same
> goes for manifest. The name "manifest.properties" is quite generic and it's
> not too far-fetched to see some clashes because some other lib also
> manifests something. It might be better to allow for some flexibility and
> to adhere to classpath conventions. For example, having manifests in
> something like org/apache/opennlp/models/manifest.properties. Or
> opennlp/tools/manifest.properties. And perhaps even allowing to reference a
> model in the manifest, so the model can be put elsewhere. Just in case
> there are several custom models of the same kind for different pipelines in
> the same app. For example, processing queries with one pipeline - one set
> of models - and processing documents with another pipeline - another set of
> models. In this case allowing for different classpaths is needed.
>
> Perhaps to illustrate my thinking, something like this (which still keeps a
> lot of possibilities open):
> en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps contains
> a line with something like model =
> /opennlp/tools/sentdetect/model/sent.model)
> en-sent.bin/opennlp/tools/sentdetect/model/sent.model
>
> This allows including en-sent.bin as dependency. And then doing something
> like
> SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we want
> default models in this way. Seems verbose enough to allow for some safety
> through explicitness. That's if we want any defaults at all.
> Or something like:
> SentenceModel sdm =
> SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.properties");
> Or
> SentenceModel sdm =
> SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.model");
> Or more in-line with a current style:
> SentenceModel sdm = new
> SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though here
> we commit to interpreting String as classpath reference. That's why I'd
> prefer more explicit method names.
> Or leave dealing with resources to the users, leave current code intact and
> provide only packaging and distribution:
> SentenceModel sdm = new
> SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
> model"));
>
>
> And to add to model metadata also F1\accuracy (at least CV-based, for
> example 10-fold) for quick reference or quick understanding of what that
> model is capable of. Could be helpful for those with a bunch of models
> around. And for others as well to have a better insight about the model in
> question.
>
>
>
> On 11 July 2017 at 06:37, Chris Mattmann <ma...@apache.org> wrote:
>
>> Hi,
>>
>> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI to
>> override an
>> internal classpath dependency. This is for people in environments who want
>> a sensible
>> / delivered internal classpath default and the ability for run-time, non
>> zipped up/messing
>> with JAR file override. Think about people who are using OpenNLP in both
>> Java/Python
>> environments as an example.
>>
>> Cheers,
>> Chris
>>
>>
>>
>>
>> On 7/11/17, 3:25 AM, "Joern Kottmann" <ko...@gmail.com> wrote:
>>
>>     I would not change the CLI to load models from jar files. I never used
>>     or saw a command line tool that expects a file as an input and would
>>     then also load it from inside a jar file. It will be hard to
>>     communicate how that works precisely in the CLI usage texts and this
>>     is not a feature anyone would expect to be there. The intention of the
>>     CLI is to give users the ability to quickly test OpenNLP before they
>>     integrate it into their software and to train and evaluate models
>>
>>     Users who for some reason have a jar file with a model inside can just
>>     write "unzip model.jar".
>>
>>     After all I think this is quite  a bit of complexity we would need to
>>     add for it and it will have very limited use.
>>
>>     The use case of publishing jar files is to make the models easily
>>     available to people who have a build system with dependency
>>     management, they won't have to download models manually, and when they
>>     update OpenNLP then can also update the models with a version string
>>     change.
>>
>>     For the command line "quick start" use case we should offer the models
>>     on a download page as we do today. This page could list both, the
>>     download link and the maven dependency.
>>
>>     Jörn
>>
>>     On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org>
>> wrote:
>>     > We need to address things such as sharing the evaluation results and
>> how to
>>     > reproduce the training.
>>     >
>>     > There are several possibilities for that, but there are points to
>> consider:
>>     >
>>     > Will we store the model itself in a SCM repository or only the code
>> that
>>     > can build it?
>>     > Will we deploy the models to a Maven Central repository? It is good
>> for
>>     > people using the Java API but not for command line interface, should
>> we
>>     > change the CLI to handle models in the classpath?
>>     > Should we keep a copy of the training model or always download from
>> the
>>     > original provider? We can't guarantee that the corpus will be there
>>     > forever, not only because it changed license, but simple because the
>>     > provider is not keeping the server up anymore.
>>     >
>>     > William
>>     >
>>     >
>>     >
>>     > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
>>     >
>>     >> Hello all,
>>     >>
>>     >> since Apache OpenNLP 1.8.1 we have a new language detection
>> component
>>     >> which like all our components has to be trained. I think we should
>>     >> release a pre-build model for it trained on the Leipzig corpus. This
>>     >> will allow the majority of our users to get started very quickly
>> with
>>     >> language detection without the need to figure out on how to train
>> it.
>>     >>
>>     >> How should this project release models?
>>     >>
>>     >> Jörn
>>     >>
>>
>>
>>
>>

Re: Releasing a Language Detection Model

Posted by Aliaksandr Autayeu <al...@autayeu.com>.

To clarify on models and jars.

Putting model inside jar might not be a good idea. I mean here things like
bla-bla.jar/en-sent.bin. Our models are already zipped, so they are "jars"
already in a sense. We're good. However, current packaging and metadata
might not be very classpath friendly.

The use case I have in mind is being able to add needed models as
dependencies and load them by writing a line of code. For this case having
all models in a root with the same name might not be very convenient. Same
goes for manifest. The name "manifest.properties" is quite generic and it's
not too far-fetched to see some clashes because some other lib also
manifests something. It might be better to allow for some flexibility and
to adhere to classpath conventions. For example, having manifests in
something like org/apache/opennlp/models/manifest.properties. Or
opennlp/tools/manifest.properties. And perhaps even allowing to reference a
model in the manifest, so the model can be put elsewhere. Just in case
there are several custom models of the same kind for different pipelines in
the same app. For example, processing queries with one pipeline - one set
of models - and processing documents with another pipeline - another set of
models. In this case allowing for different classpaths is needed.

Perhaps to illustrate my thinking, something like this (which still keeps a
lot of possibilities open):
en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps contains
a line with something like model =
/opennlp/tools/sentdetect/model/sent.model)
en-sent.bin/opennlp/tools/sentdetect/model/sent.model

This allows including en-sent.bin as dependency. And then doing something
like
SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we want
default models in this way. Seems verbose enough to allow for some safety
through explicitness. That's if we want any defaults at all.
Or something like:
SentenceModel sdm =
SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.properties");
Or
SentenceModel sdm =
SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.model");
Or more in-line with a current style:
SentenceModel sdm = new
SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though here
we commit to interpreting String as classpath reference. That's why I'd
prefer more explicit method names.
Or leave dealing with resources to the users, leave current code intact and
provide only packaging and distribution:
SentenceModel sdm = new
SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
model"));

And to add to model metadata also F1\accuracy (at least CV-based, for
example 10-fold) for quick reference or quick understanding of what that
model is capable of. Could be helpful for those with a bunch of models
around. And for others as well to have a better insight about the model in
question.

On 11 July 2017 at 06:37, Chris Mattmann <ma...@apache.org> wrote:

> Hi,
>
> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI to
> override an
> internal classpath dependency. This is for people in environments who want
> a sensible
> / delivered internal classpath default and the ability for run-time, non
> zipped up/messing
> with JAR file override. Think about people who are using OpenNLP in both
> Java/Python
> environments as an example.
>
> Cheers,
> Chris
>
>
>
>
> On 7/11/17, 3:25 AM, "Joern Kottmann" <ko...@gmail.com> wrote:
>
>     I would not change the CLI to load models from jar files. I never used
>     or saw a command line tool that expects a file as an input and would
>     then also load it from inside a jar file. It will be hard to
>     communicate how that works precisely in the CLI usage texts and this
>     is not a feature anyone would expect to be there. The intention of the
>     CLI is to give users the ability to quickly test OpenNLP before they
>     integrate it into their software and to train and evaluate models
>
>     Users who for some reason have a jar file with a model inside can just
>     write "unzip model.jar".
>
>     After all I think this is quite  a bit of complexity we would need to
>     add for it and it will have very limited use.
>
>     The use case of publishing jar files is to make the models easily
>     available to people who have a build system with dependency
>     management, they won't have to download models manually, and when they
>     update OpenNLP then can also update the models with a version string
>     change.
>
>     For the command line "quick start" use case we should offer the models
>     on a download page as we do today. This page could list both, the
>     download link and the maven dependency.
>
>     Jörn
>
>     On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org>
> wrote:
>     > We need to address things such as sharing the evaluation results and
> how to
>     > reproduce the training.
>     >
>     > There are several possibilities for that, but there are points to
> consider:
>     >
>     > Will we store the model itself in a SCM repository or only the code
> that
>     > can build it?
>     > Will we deploy the models to a Maven Central repository? It is good
> for
>     > people using the Java API but not for command line interface, should
> we
>     > change the CLI to handle models in the classpath?
>     > Should we keep a copy of the training model or always download from
> the
>     > original provider? We can't guarantee that the corpus will be there
>     > forever, not only because it changed license, but simple because the
>     > provider is not keeping the server up anymore.
>     >
>     > William
>     >
>     >
>     >
>     > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
>     >
>     >> Hello all,
>     >>
>     >> since Apache OpenNLP 1.8.1 we have a new language detection
> component
>     >> which like all our components has to be trained. I think we should
>     >> release a pre-build model for it trained on the Leipzig corpus. This
>     >> will allow the majority of our users to get started very quickly
> with
>     >> language detection without the need to figure out on how to train
> it.
>     >>
>     >> How should this project release models?
>     >>
>     >> Jörn
>     >>
>
>
>
>

Re: Releasing a Language Detection Model

Posted by Chris Mattmann <ma...@apache.org>.

Hi,

FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI to override an
internal classpath dependency. This is for people in environments who want a sensible
/ delivered internal classpath default and the ability for run-time, non zipped up/messing
with JAR file override. Think about people who are using OpenNLP in both Java/Python
environments as an example.

Cheers,
Chris




On 7/11/17, 3:25 AM, "Joern Kottmann" <ko...@gmail.com> wrote:

    I would not change the CLI to load models from jar files. I never used
    or saw a command line tool that expects a file as an input and would
    then also load it from inside a jar file. It will be hard to
    communicate how that works precisely in the CLI usage texts and this
    is not a feature anyone would expect to be there. The intention of the
    CLI is to give users the ability to quickly test OpenNLP before they
    integrate it into their software and to train and evaluate models
    
    Users who for some reason have a jar file with a model inside can just
    write "unzip model.jar".
    
    After all I think this is quite  a bit of complexity we would need to
    add for it and it will have very limited use.
    
    The use case of publishing jar files is to make the models easily
    available to people who have a build system with dependency
    management, they won't have to download models manually, and when they
    update OpenNLP then can also update the models with a version string
    change.
    
    For the command line "quick start" use case we should offer the models
    on a download page as we do today. This page could list both, the
    download link and the maven dependency.
    
    Jörn
    
    On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org> wrote:
    > We need to address things such as sharing the evaluation results and how to
    > reproduce the training.
    >
    > There are several possibilities for that, but there are points to consider:
    >
    > Will we store the model itself in a SCM repository or only the code that
    > can build it?
    > Will we deploy the models to a Maven Central repository? It is good for
    > people using the Java API but not for command line interface, should we
    > change the CLI to handle models in the classpath?
    > Should we keep a copy of the training model or always download from the
    > original provider? We can't guarantee that the corpus will be there
    > forever, not only because it changed license, but simple because the
    > provider is not keeping the server up anymore.
    >
    > William
    >
    >
    >
    > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
    >
    >> Hello all,
    >>
    >> since Apache OpenNLP 1.8.1 we have a new language detection component
    >> which like all our components has to be trained. I think we should
    >> release a pre-build model for it trained on the Leipzig corpus. This
    >> will allow the majority of our users to get started very quickly with
    >> language detection without the need to figure out on how to train it.
    >>
    >> How should this project release models?
    >>
    >> Jörn
    >>

Re: Releasing a Language Detection Model

Posted by Joern Kottmann <ko...@gmail.com>.

I would not change the CLI to load models from jar files. I never used
or saw a command line tool that expects a file as an input and would
then also load it from inside a jar file. It will be hard to
communicate how that works precisely in the CLI usage texts and this
is not a feature anyone would expect to be there. The intention of the
CLI is to give users the ability to quickly test OpenNLP before they
integrate it into their software and to train and evaluate models

Users who for some reason have a jar file with a model inside can just
write "unzip model.jar".

After all I think this is quite  a bit of complexity we would need to
add for it and it will have very limited use.

The use case of publishing jar files is to make the models easily
available to people who have a build system with dependency
management, they won't have to download models manually, and when they
update OpenNLP then can also update the models with a version string
change.

For the command line "quick start" use case we should offer the models
on a download page as we do today. This page could list both, the
download link and the maven dependency.

Jörn

On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org> wrote:
> We need to address things such as sharing the evaluation results and how to
> reproduce the training.
>
> There are several possibilities for that, but there are points to consider:
>
> Will we store the model itself in a SCM repository or only the code that
> can build it?
> Will we deploy the models to a Maven Central repository? It is good for
> people using the Java API but not for command line interface, should we
> change the CLI to handle models in the classpath?
> Should we keep a copy of the training model or always download from the
> original provider? We can't guarantee that the corpus will be there
> forever, not only because it changed license, but simple because the
> provider is not keeping the server up anymore.
>
> William
>
>
>
> 2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
>
>> Hello all,
>>
>> since Apache OpenNLP 1.8.1 we have a new language detection component
>> which like all our components has to be trained. I think we should
>> release a pre-build model for it trained on the Leipzig corpus. This
>> will allow the majority of our users to get started very quickly with
>> language detection without the need to figure out on how to train it.
>>
>> How should this project release models?
>>
>> Jörn
>>

Re: Releasing a Language Detection Model

Posted by William Colen <co...@apache.org>.

We need to address things such as sharing the evaluation results and how to
reproduce the training.

There are several possibilities for that, but there are points to consider:

Will we store the model itself in a SCM repository or only the code that
can build it?
Will we deploy the models to a Maven Central repository? It is good for
people using the Java API but not for command line interface, should we
change the CLI to handle models in the classpath?
Should we keep a copy of the training model or always download from the
original provider? We can't guarantee that the corpus will be there
forever, not only because it changed license, but simple because the
provider is not keeping the server up anymore.

William



2017-07-10 14:52 GMT-03:00 Joern Kottmann <ko...@gmail.com>:

> Hello all,
>
> since Apache OpenNLP 1.8.1 we have a new language detection component
> which like all our components has to be trained. I think we should
> release a pre-build model for it trained on the Leipzig corpus. This
> will allow the majority of our users to get started very quickly with
> language detection without the need to figure out on how to train it.
>
> How should this project release models?
>
> Jörn
>