You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov> on 2016/11/04 12:58:18 UTC

Re: new tool training

Hi Jörn,
   Currently, GIS is a wrapper for the GISTrainer, which ironically is NOT an EventTrainer.  There are several ways to handle this.

First, don’t touch GIS.  This is the class that people know best.  It is easy to use if you use the parameter hard coded in the class.  We can make GISTrainer an Event trainer and refactor it appropriately to handle parameters in the map.

Second, refactor GIS, but keep the static methods in the code. This allows user to keep their code working.

Third, refactor GIS and deprecate the static methods. 

I don’t see any value to the GIS class other than an Adapter.  My vote would be the fix the original problem of GISTrainer NOT being an event trainer. 

What is your opinion?
Daniel



On 10/31/16, 10:18 AM, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov> wrote:

    Ok, I will send you patch with a refactored GIS (can you find the jira number).  If we refactor DataIndexer, I have no problem using a factory. I am happy to extend OnePassDataIndexer. I won’t have to copy code.  I will have to look at DataIndexer again to before I comment on specifics.
    Daniel
    
    
    On 10/29/16, 8:45 AM, "Joern Kottmann" <ko...@gmail.com> wrote:
    
        On Fri, 2016-10-28 at 14:16 +0000, Russ, Daniel (NIH/CIT) [E] wrote:
        > Hi Jörn,
        > 1) I agree that the field values should be set in the init method for
        > the QNTrainer.  Other minor changes I would make include adding a
        > getNumberOfThreads() method to AbstractTrainer, and a default of
        > 1.  I would modify GIS to use the value set in the
        > TrainingParameters.  I also think this needs to be documented more
        > clearly.  I can take a stab at some text and send it out to the
        > community.
        
        Sounds good.
        
        I had a look at the code and it looks like it was never refactored to
        fit properly in the new structure we created. I think the multi-
        threaded GIS training is actually working, but couldn't see how the
        parameter is hooked up to the training code.
        
        Suggestion: We open a jira issue to refactor the GIS training to make
        use of the init method, have all variables externalized (e.g. smoothing
        factor, print debug output to the console, etc.) and other clean up we
        think make sense. 
        
        The problem with GIS is that people take it as an example on how things
        should be done when they implement a new trainer.
        
        Another issue is that we should have some validation of parameters,
        e.g. threads can only be one for perceptron.
        
        > 2) The One/TwoPassDataIndexers are very rigid classes because they do
        > the indexing in the constructor.  The trainers deal with the rigidity
        > by overloading methods that take a cutoff/sort/#Threads.  In my
        > opinion, that was not the best way to do it.  There should have been
        > a constructor and an index method.  At this point it may be a large
        > refactoring effort for little gain.
        
        Yes, I agree with the first part. I don't think it will be so much
        effort to add a new consructor and the index method. The old
        constructors can stay and be deprecated until we remove them.
        
        Should we open a jira issue for this task?
        
        > As of 1.6.0, I see that there are 4 EventTrainers (GIS,
        > NaiveBayesTrainer, PerceptronTrainer, QNTrainer).  All of these are
        > AbstractEventTrainers.  So if we add a public MaxentModel
        > train(DataIndexer di) to both the interface and the abstract class,
        > nothing should break.  Now, if you are concerned, the
        > train(ObjectStream<Event>) method calls doTrain(DataIndexer), So the
        > behavior is exactly the same for the two methods, I just call the
        > doTrain(DataIndexer).  This would not be used in the 1.5.x versions.
        > 
        > I don’t like the idea of using a factory because the DataIndexer
        > requires you pass parameters into the constructor.  It may be
        > possible to use reflection, but why make like so difficult.  Let the
        > client give the trainer the dataindexer.
        
        We use this kind of factory in various places already, I think it would
        be a good solution, because then people can switch the Data Indexer
        without writing training code and it will work with all components we
        have out of the box. I agree it would only work properly if we do the
        refactoring.
        
        In OPENNLP-830 someone said the TwoPassDataIndexer can be implemented
        better performing these days with non-blocking IO. I don't think it
        make sense to change the current implementation. But it would be nice
        to add an improved version as a third option. If we have plugable Data
        Indexer support this idea could be easily explored.
        
        This could be done exactly like it is done for the trainer in:
        TrainerFactory.getEventTrainer. 
        
        And we also need a jira for this.
        
        > In my not-so-expert opinion, I think adding a train(Dataindexer)
        > method to EventTrainer is the best way forward.
        
        We have might have the case that certain trainers don't support the
        Data Indexer, for example deeplearning libraries, but for them we
        probably have to add a new trainer type anyway.
        
        Would this work also when we refactor the Data Indexer? Then a user a
        would create a Data Indexer instance, calls the index method and passes
        it in. Should be fine. 
        
        I have time I can dedicate to help out with refactoring things.
        
        Jörn

Re: new tool training

Posted by Cohan Sujay Carlos <co...@aiaioo.com>.

Joern,

Regarding the deprecation of GIS, I noticed that OpenNLP had been
refactored between 1.5.2 and 1.6.0 to move the constant *MAXENT_VALUE
<https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/ml/maxent/GIS.html#MAXENT_VALUE>*
(that
is used to select a MaxEnt classifier) to the GIS class from TrainUtil (
*opennlp.model.TrainUtil*).

Since MAXENT_VALUE had been moved to GIS, I put the corresponding constant
for the Naive Bayes classifier (*NAIVE_BAYES_VALUE*) into the
NaiveBayesTrainer class.

So, if you're moving constants back to something like the TrainUtil, you
may want to move the NAIVE_BAYES_VALUE over too.

Cohan Sujay Carlos

On Mon, Nov 7, 2016 at 2:44 PM, Joern Kottmann <ko...@gmail.com> wrote:

> I also opened a Jira now for the refactoring effort:
> https://issues.apache.org/jira/browse/OPENNLP-880
>
> Jörn
>
> On Sun, Nov 6, 2016 at 4:41 PM, Joern Kottmann <ko...@gmail.com> wrote:
>
> > On Fri, 2016-11-04 at 12:58 +0000, Russ, Daniel (NIH/CIT) [E] wrote:
> > > Hi Jörn,
> > >    Currently, GIS is a wrapper for the GISTrainer, which ironically
> > > is NOT an EventTrainer.  There are several ways to handle this.
> > >
> > > First, don’t touch GIS.  This is the class that people know best.  It
> > > is easy to use if you use the parameter hard coded in the class.  We
> > > can make GISTrainer an Event trainer and refactor it appropriately to
> > > handle parameters in the map.
> > >
> > > Second, refactor GIS, but keep the static methods in the code. This
> > > allows user to keep their code working.
> > >
> > > Third, refactor GIS and deprecate the static methods.
> > >
> > > I don’t see any value to the GIS class other than an Adapter.  My
> > > vote would be the fix the original problem of GISTrainer NOT being an
> > > event trainer.
> >
> >
> > I think we are lucky that GISTrainer is not a public class, so that
> > allows us to modify it as we need. I agree with you, we should make it
> > an EventTrainer.
> >
> > The GIS class should probably just be deprecated and then removed
> > later.
> >
> > Regards,
> > Jörn
> >
>

Re: new tool training

Posted by Joern Kottmann <ko...@gmail.com>.

I also opened a Jira now for the refactoring effort:
https://issues.apache.org/jira/browse/OPENNLP-880

Jörn

On Sun, Nov 6, 2016 at 4:41 PM, Joern Kottmann <ko...@gmail.com> wrote:

> On Fri, 2016-11-04 at 12:58 +0000, Russ, Daniel (NIH/CIT) [E] wrote:
> > Hi Jörn,
> >    Currently, GIS is a wrapper for the GISTrainer, which ironically
> > is NOT an EventTrainer.  There are several ways to handle this.
> >
> > First, don’t touch GIS.  This is the class that people know best.  It
> > is easy to use if you use the parameter hard coded in the class.  We
> > can make GISTrainer an Event trainer and refactor it appropriately to
> > handle parameters in the map.
> >
> > Second, refactor GIS, but keep the static methods in the code. This
> > allows user to keep their code working.
> >
> > Third, refactor GIS and deprecate the static methods.
> >
> > I don’t see any value to the GIS class other than an Adapter.  My
> > vote would be the fix the original problem of GISTrainer NOT being an
> > event trainer.
>
>
> I think we are lucky that GISTrainer is not a public class, so that
> allows us to modify it as we need. I agree with you, we should make it
> an EventTrainer.
>
> The GIS class should probably just be deprecated and then removed
> later.
>
> Regards,
> Jörn
>

Re: new tool training

Posted by Joern Kottmann <ko...@gmail.com>.

On Fri, 2016-11-04 at 12:58 +0000, Russ, Daniel (NIH/CIT) [E] wrote:
> Hi J�rn,
> ���Currently, GIS is a wrapper for the GISTrainer, which ironically
> is NOT an EventTrainer.��There are several ways to handle this.
> 
> First, don\u2019t touch GIS.��This is the class that people know best.��It
> is easy to use if you use the parameter hard coded in the class.��We
> can make GISTrainer an Event trainer and refactor it appropriately to
> handle parameters in the map.
> 
> Second, refactor GIS, but keep the static methods in the code. This
> allows user to keep their code working.
> 
> Third, refactor GIS and deprecate the static methods.�
> 
> I don\u2019t see any value to the GIS class other than an Adapter.��My
> vote would be the fix the original problem of GISTrainer NOT being an
> event trainer.�


I think we are lucky that GISTrainer is not a public class, so that
allows us to modify it as we need. I agree with you, we should make it
an EventTrainer.

The GIS class should probably just be deprecated and then removed
later.

Regards,
J�rn