You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Jason Baldridge <ja...@gmail.com> on 2011/05/02 23:13:36 UTC

proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

I think the redesign of opennlp.maxent into opennlp.ml should not be pinned
down by the previous API. I say this mainly because the current design has a
lot of obvious problems, including poor encapsulation and a proliferation of
methods for different options that could be handled much more cleanly. So, I
propose a fairly clean break with the past API. Thoughts?

Jason

-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Chris Collins <ch...@yahoo.com>.

Yes that would give ultimate control.

C 
On May 5, 2011, at 9:42 AM, Jörn Kottmann wrote:

> On 5/5/11 6:12 PM, Chris Collins wrote:
>> Right, I guess so:
>> 
>> #Mon Mar 28 12:17:52 PDT 2011
>> Training-Eventhash=d61e8fc9af7e230ff91060f27e0d2959
>> Manifest-Version=1.0
>> Language=de
>> useTokenEnd=true
>> Training-Cutoff=5
>> Training-Iterations=100
>> OpenNLP-Version=1.5.0
>> Timestamp=1301339872213
>> Component-Name=SentenceDetectorME
>> 
>> though I meant also major minor version that the person doing the build can provide for the version of the data not the OpenNLP software (don't forget data location e.g. /Users/chris/model_training/en/me_playing_around_dont_use_in_production :-})
> 
> Maybe we should give the user the freedom to write custom properties into the earlier proposed training file and
> extend the above manifest with automatically generates properties as far as it makes sense.
> 
> I guess that would suit your needs?
> 
> The training data location might not always be available. I for example retrieve my training data from a
> database which contains my corpus. The data is then directly streamed into OpenNLP without ever hitting the disk.
> 
> Jörn

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Jörn Kottmann <ko...@gmail.com>.

On 5/5/11 6:12 PM, Chris Collins wrote:
> Right, I guess so:
>
> #Mon Mar 28 12:17:52 PDT 2011
> Training-Eventhash=d61e8fc9af7e230ff91060f27e0d2959
> Manifest-Version=1.0
> Language=de
> useTokenEnd=true
> Training-Cutoff=5
> Training-Iterations=100
> OpenNLP-Version=1.5.0
> Timestamp=1301339872213
> Component-Name=SentenceDetectorME
>
> though I meant also major minor version that the person doing the build can provide for the version of the data not the OpenNLP software (don't forget data location e.g. /Users/chris/model_training/en/me_playing_around_dont_use_in_production :-})

Maybe we should give the user the freedom to write custom properties 
into the earlier proposed training file and
extend the above manifest with automatically generates properties as far 
as it makes sense.

I guess that would suit your needs?

The training data location might not always be available. I for example 
retrieve my training data from a
database which contains my corpus. The data is then directly streamed 
into OpenNLP without ever hitting the disk.

Jörn

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Chris Collins <ch...@yahoo.com>.

Right, I guess so:

#Mon Mar 28 12:17:52 PDT 2011
Training-Eventhash=d61e8fc9af7e230ff91060f27e0d2959
Manifest-Version=1.0
Language=de
useTokenEnd=true
Training-Cutoff=5
Training-Iterations=100
OpenNLP-Version=1.5.0
Timestamp=1301339872213
Component-Name=SentenceDetectorME

though I meant also major minor version that the person doing the build can provide for the version of the data not the OpenNLP software (don't forget data location e.g. /Users/chris/model_training/en/me_playing_around_dont_use_in_production :-})

C
On May 5, 2011, at 9:01 AM, Jörn Kottmann wrote:

> On 5/5/11 5:57 PM, Chris Collins wrote:
>> That is a good idea, I would also consider including a few other optional fields and making it human readable.  In the system I work on all our data gets this type of "body tag", we include other things like:
>> 
>> - machine it was built on and perhaps the os user that did the run.
>> - build date
>> - source path to where the input data (in this case training set)
>> - maybe a hash of the training set.
>> - major/ minor version number
>> - maybe the training tool allows you to pass a set of arbitrary key value pairs this way the above could be defined in an ant script or what have you.
>> 
>> This way when you find this model sitting a disk some day you can actually figure out if you trust it.  Nothing like going into production with something like this to find it was something built on your interns laptop just as a test that everyone forgot about.
>> 
> 
> That just sounds like what we already write into the model, expect the machine name, OS and user.
> The model itself is a zip package, and includes a manifest which includes these values.
> 
> Maybe we should extend the cmd line tooling to display it, then you do not need to unpack
> the zip package.
> 
> Jörn
>

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Jörn Kottmann <ko...@gmail.com>.

On 5/5/11 5:57 PM, Chris Collins wrote:
> That is a good idea, I would also consider including a few other optional fields and making it human readable.  In the system I work on all our data gets this type of "body tag", we include other things like:
>
> - machine it was built on and perhaps the os user that did the run.
> - build date
> - source path to where the input data (in this case training set)
> - maybe a hash of the training set.
> - major/ minor version number
> - maybe the training tool allows you to pass a set of arbitrary key value pairs this way the above could be defined in an ant script or what have you.
>
> This way when you find this model sitting a disk some day you can actually figure out if you trust it.  Nothing like going into production with something like this to find it was something built on your interns laptop just as a test that everyone forgot about.
>

That just sounds like what we already write into the model, expect the 
machine name, OS and user.
The model itself is a zip package, and includes a manifest which 
includes these values.

Maybe we should extend the cmd line tooling to display it, then you do 
not need to unpack
the zip package.

Jörn

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Chris Collins <ch...@yahoo.com>.

That is a good idea, I would also consider including a few other optional fields and making it human readable.  In the system I work on all our data gets this type of "body tag", we include other things like:

- machine it was built on and perhaps the os user that did the run.
- build date
- source path to where the input data (in this case training set)
- maybe a hash of the training set.
- major/ minor version number
- maybe the training tool allows you to pass a set of arbitrary key value pairs this way the above could be defined in an ant script or what have you.

This way when you find this model sitting a disk some day you can actually figure out if you trust it.  Nothing like going into production with something like this to find it was something built on your interns laptop just as a test that everyone forgot about.

Best

C

On May 5, 2011, at 6:39 AM, Jörn Kottmann wrote:

> On 5/3/11 5:05 PM, Jason Baldridge wrote:
>> Sure. But that proposal will involve blasting things apart. ;)
>> 
> 
> What do you think about defining some kind of training attribute file,
> which specifies all the parameters which are needed to train a model.
> 
> This file could contain the training algorithm combined with several attributes,
> e.g. cutoff, iterations, etc. The attributes could also be algorithm dependent,
> e.g for Perceptron there could be a property which defines the number of iterations
> where the accuracy must be identical in order to stop.
> 
> Such a file would make our code simpler in some places, e.g command line argument
> handling, writing of these attributes in to the model packages, simple APIs for training
> with all kind of parameters, etc.
> 
> Any opinions?
> 
> Jörn

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Jason Baldridge <ja...@gmail.com>.

+1

On Thu, May 5, 2011 at 8:39 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 5/3/11 5:05 PM, Jason Baldridge wrote:
>
>> Sure. But that proposal will involve blasting things apart. ;)
>>
>>
> What do you think about defining some kind of training attribute file,
> which specifies all the parameters which are needed to train a model.
>
> This file could contain the training algorithm combined with several
> attributes,
> e.g. cutoff, iterations, etc. The attributes could also be algorithm
> dependent,
> e.g for Perceptron there could be a property which defines the number of
> iterations
> where the accuracy must be identical in order to stop.
>
> Such a file would make our code simpler in some places, e.g command line
> argument
> handling, writing of these attributes in to the model packages, simple APIs
> for training
> with all kind of parameters, etc.
>
> Any opinions?
>
> Jörn
>



-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Jörn Kottmann <ko...@gmail.com>.

On 5/3/11 5:05 PM, Jason Baldridge wrote:
> Sure. But that proposal will involve blasting things apart. ;)
>

What do you think about defining some kind of training attribute file,
which specifies all the parameters which are needed to train a model.

This file could contain the training algorithm combined with several 
attributes,
e.g. cutoff, iterations, etc. The attributes could also be algorithm 
dependent,
e.g for Perceptron there could be a property which defines the number of 
iterations
where the accuracy must be identical in order to stop.

Such a file would make our code simpler in some places, e.g command line 
argument
handling, writing of these attributes in to the model packages, simple 
APIs for training
with all kind of parameters, etc.

Any opinions?

Jörn

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Jörn Kottmann <ko...@gmail.com>.

When we change the model loading API we will also break backward 
compatibility of
the old OpenNLP Tools APIs, these APIs are deprecated for quite some 
time now,
so it might be a good idea to remove them with the next release. Then we 
will be able
to replace maxent without breaking the OpenNLP Tools API.

Are there any special rules at Apache regarding the removal of 
deprecated API ?

Jörn

On 5/3/11 5:05 PM, Jason Baldridge wrote:
> Sure. But that proposal will involve blasting things apart. ;)
>
> On Tue, May 3, 2011 at 10:03 AM, Jörn Kottmann<ko...@gmail.com>  wrote:
>
>> On 5/3/11 4:49 PM, Jason Baldridge wrote:
>>
>>> Okay. I guess I'd like to more or less go in and blast things apart, so
>>> anything like that should go in before. Feel free to make progress on
>>> those
>>> for now, and I'll get to coding later this month, probably.
>>>
>>>   I think we really should take some time to create a proposal which
>> describes
>> the new API and other changes, before you just start coding and changing
>> everything.
>>
>> Jörn
>>
>
>

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Jason Baldridge <ja...@gmail.com>.

Sure. But that proposal will involve blasting things apart. ;)

On Tue, May 3, 2011 at 10:03 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 5/3/11 4:49 PM, Jason Baldridge wrote:
>
>> Okay. I guess I'd like to more or less go in and blast things apart, so
>> anything like that should go in before. Feel free to make progress on
>> those
>> for now, and I'll get to coding later this month, probably.
>>
>>  I think we really should take some time to create a proposal which
> describes
> the new API and other changes, before you just start coding and changing
> everything.
>
> Jörn
>



-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Jörn Kottmann <ko...@gmail.com>.

On 5/3/11 4:49 PM, Jason Baldridge wrote:
> Okay. I guess I'd like to more or less go in and blast things apart, so
> anything like that should go in before. Feel free to make progress on those
> for now, and I'll get to coding later this month, probably.
>
I think we really should take some time to create a proposal which describes
the new API and other changes, before you just start coding and changing 
everything.

Jörn

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Jason Baldridge <ja...@gmail.com>.

Okay. I guess I'd like to more or less go in and blast things apart, so
anything like that should go in before. Feel free to make progress on those
for now, and I'll get to coding later this month, probably.

I also agree that we can break with the old model format. In that regard, we
should look at this:

http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language

(I haven't had a chance to review that yet, myself.)

Jason

On Tue, May 3, 2011 at 3:16 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 5/3/11 4:25 AM, Jason Baldridge wrote:
>
>> Sounds good. I'd like to start working on this after our spring semester
>> has
>> ended (two more weeks). -Jason
>>
>>
> There are also a few changes we could just do on the current implementation
> without
> the API redesign, for example I would like contribute my multi threaded GIS
> training
> changes, refactoring the code to inline instance variables, introduce the
> progress
> monitoring, etc.
>
> An advantage of doing these things before we release a new API is that we
> then have one
> more try to fix things which might turn out to not be as well as we
> expected them.
>
> Anyway I created a new wiki space, so we can go ahead and sketch down the
> new API:
> https://cwiki.apache.org/confluence/display/OPENNLP/Maxent+Refactoring
>
> Jörn
>
>


-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Jörn Kottmann <ko...@gmail.com>.

On 5/3/11 4:25 AM, Jason Baldridge wrote:
> Sounds good. I'd like to start working on this after our spring semester has
> ended (two more weeks). -Jason
>

There are also a few changes we could just do on the current 
implementation without
the API redesign, for example I would like contribute my multi threaded 
GIS training
changes, refactoring the code to inline instance variables, introduce 
the progress
monitoring, etc.

An advantage of doing these things before we release a new API is that 
we then have one
more try to fix things which might turn out to not be as well as we 
expected them.

Anyway I created a new wiki space, so we can go ahead and sketch down 
the new API:
https://cwiki.apache.org/confluence/display/OPENNLP/Maxent+Refactoring

Jörn

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Jason Baldridge <ja...@gmail.com>.

Sounds good. I'd like to start working on this after our spring semester has
ended (two more weeks). -Jason

On Mon, May 2, 2011 at 4:29 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 5/2/11 11:13 PM, Jason Baldridge wrote:
>
>> I think the redesign of opennlp.maxent into opennlp.ml should not be
>> pinned
>> down by the previous API. I say this mainly because the current design has
>> a
>> lot of obvious problems, including poor encapsulation and a proliferation
>> of
>> methods for different options that could be handled much more cleanly. So,
>> I
>> propose a fairly clean break with the past API. Thoughts?
>>
>
> Yes, I agree, but there should be a very strong focus on only break
> backward
> compatibility once. In this case I suggest that we use the re-naming also
> to align the version with the opennlp-tools and opennlp-uima project.
>
> Before we actually change anything I suggest that we work on a proposal
> on how the new API could look, what do you think?
>
> Jörn
>



-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Jörn Kottmann <ko...@gmail.com>.

On 5/2/11 11:13 PM, Jason Baldridge wrote:
> I think the redesign of opennlp.maxent into opennlp.ml should not be pinned
> down by the previous API. I say this mainly because the current design has a
> lot of obvious problems, including poor encapsulation and a proliferation of
> methods for different options that could be handled much more cleanly. So, I
> propose a fairly clean break with the past API. Thoughts?

Yes, I agree, but there should be a very strong focus on only break backward
compatibility once. In this case I suggest that we use the re-naming also
to align the version with the opennlp-tools and opennlp-uima project.

Before we actually change anything I suggest that we work on a proposal
on how the new API could look, what do you think?

Jörn

Re: proposal: let's ignore backward compatibility for the opennlp.maxent (opennlp.ml) redesign

Posted by Jörn Kottmann <ko...@gmail.com>.

On 5/2/11 11:13 PM, Jason Baldridge wrote:
> I think the redesign of opennlp.maxent into opennlp.ml should not be pinned
> down by the previous API. I say this mainly because the current design has a
> lot of obvious problems, including poor encapsulation and a proliferation of
> methods for different options that could be handled much more cleanly. So, I
> propose a fairly clean break with the past API. Thoughts?


This switch would also be a good chance to break backward compatibility with
our model format.

Jörn