You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Joern Kottmann <ko...@gmail.com> on 2017/01/11 22:51:44 UTC

Anyone using the UIMA trainer AEs?

Hello all,

the UIMA integration contains AEs which can be used to train models if
a UIMA pipeline is set up to process a some kind of corpus.

I have the impression that this is kind of dead/unused code.

I opened an issue [1] to deprecate it and would like to know if there
is any interest in keeping that code? Do we have someone here using
that?

Please share your opinion with us so we can make a good decision!

J�rn

[1] https://issues.apache.org/jira/browse/OPENNLP-928

Re: Anyone using the UIMA trainer AEs?

Posted by Richard Eckart de Castilho <re...@apache.org>.

On 13.01.2017, at 23:33, Peter Klügl <pk...@gmail.com> wrote:
> 
> I do not recall the exact licenses and their implications right now but Genia [1] or English Universal Dependencies [2], for example, should do the trick (with some converting). Genia contains inline xml tags for words/tokens and the English UD contains information about the spaces.

English UD is actually a good idea :) Now we also have a
test for training the OpenNLP tokenizer in DKPro Core.

Thanks!

-- Richard

Re: Anyone using the UIMA trainer AEs?

Posted by Peter Klügl <pk...@gmail.com>.

Am 13.01.2017 um 21:12 schrieb Richard Eckart de Castilho:
> ...
> I think the problem was that the data I had easily available was in a CoNLL format - you cannot train a tokenizer from most CoNLL formats because there is no information whether two tokens are directly adjacent or not.
>
> Do you have a suggestion for a publicly available corpus that contains offset information and which would be suitable?


I do not recall the exact licenses and their implications right now but 
Genia [1] or English Universal Dependencies [2], for example, should do 
the trick (with some converting). Genia contains inline xml tags for 
words/tokens and the English UD contains information about the spaces.

Best,

Peter

[1] http://www.geniaproject.org/genia-corpus/pos-annotation
[2] https://github.com/UniversalDependencies/UD_English

Re: Anyone using the UIMA trainer AEs?

Posted by Joern Kottmann <ko...@gmail.com>.

No, we don't have any.

It should be possible to take the tokenizer training data as it is andevaluate on it. 

+1 to add detokenizer evaluation

J�rn

On Sun, 2017-01-15 at 02:42 +0100, Richard Eckart de Castilho wrote:
> On 14.01.2017, at 20:54, Joern Kottmann <ko...@gmail.com> wrote:
> > 
> > You can do that, we have a rule based detokeizer which can be used
> > to
> > produce training data from tokenized input.
> > 
> > Have a look at the detokenizer in the tokenizer package.
> 
> However, do you have any evaluation of the detokenizer?
> 
> Cheers,
> 
> -- Richard

Re: Anyone using the UIMA trainer AEs?

Posted by Joern Kottmann <ko...@gmail.com>.

On Sun, 2017-01-15 at 02:42 +0100, Richard Eckart de Castilho wrote:
> On 14.01.2017, at 20:54, Joern Kottmann <ko...@gmail.com> wrote:
> > 
> > You can do that, we have a rule based detokeizer which can be used
> > to
> > produce training data from tokenized input.
> > 
> > Have a look at the detokenizer in the tokenizer package.
> 
> However, do you have any evaluation of the detokenizer?
> 

I opened an issue to add that:
https://issues.apache.org/jira/browse/OPENNLP-941

J�rn

Re: Anyone using the UIMA trainer AEs?

Posted by Richard Eckart de Castilho <re...@apache.org>.

On 14.01.2017, at 20:54, Joern Kottmann <ko...@gmail.com> wrote:
> 
> You can do that, we have a rule based detokeizer which can be used to
> produce training data from tokenized input.
> 
> Have a look at the detokenizer in the tokenizer package.

However, do you have any evaluation of the detokenizer?

Cheers,

-- Richard

Re: Anyone using the UIMA trainer AEs?

Posted by Joern Kottmann <ko...@gmail.com>.

On Fri, 2017-01-13 at 21:12 +0100, Richard Eckart de Castilho wrote:
> On 13.01.2017, at 11:15, Peter Kl�gl <pe...@averbis.com>
> wrote:
> >�
> > Am 13.01.2017 um 08:19 schrieb Richard Eckart de Castilho:
> >> ...
> >>�
> >> In theory there is also a trainer for the tokenizer, but I haven't
> been able yet to set up a working unit test for it. I think that was
> due to an immediate lack up suitable training data. So it remains on
> the todo list.
> >>�
> >�
> > we have several OpenNLP tokenizer models. Aren't most corpora,
> e.g.,
> > annotated with POS tags, suitable?
> 
> I think the problem was that the data I had easily available was in a
> CoNLL format - you cannot train a tokenizer from most CoNLL formats
> because there is no information whether two tokens are directly
> adjacent or not.


You can do that, we have a rule based detokeizer which can be used to
produce training data from tokenized input.

Have a look at the detokenizer in the tokenizer package.

J�rn

Re: Anyone using the UIMA trainer AEs?

Posted by Richard Eckart de Castilho <re...@apache.org>.

On 13.01.2017, at 11:15, Peter Klügl <pe...@averbis.com> wrote:
> 
> Am 13.01.2017 um 08:19 schrieb Richard Eckart de Castilho:
>> ...
>> 
>> In theory there is also a trainer for the tokenizer, but I haven't been able yet to set up a working unit test for it. I think that was due to an immediate lack up suitable training data. So it remains on the todo list.
>> 
> 
> we have several OpenNLP tokenizer models. Aren't most corpora, e.g.,
> annotated with POS tags, suitable?

I think the problem was that the data I had easily available was in a CoNLL format - you cannot train a tokenizer from most CoNLL formats because there is no information whether two tokens are directly adjacent or not.

Do you have a suggestion for a publicly available corpus that contains offset information and which would be suitable?

Cheers,

-- Richard

Re: Anyone using the UIMA trainer AEs?

Posted by Peter Klügl <pe...@averbis.com>.

Hi,


Am 13.01.2017 um 08:19 schrieb Richard Eckart de Castilho:
> ...
>
> In theory there is also a trainer for the tokenizer, but I haven't been able yet to set up a working unit test for it. I think that was due to an immediate lack up suitable training data. So it remains on the todo list.
>

we have several OpenNLP tokenizer models. Aren't most corpora, e.g.,
annotated with POS tags, suitable?

Best,

Peter



> Cheers,
>
> -- Richard
>
>> On 11.01.2017, at 23:51, Joern Kottmann <ko...@gmail.com> wrote:
>>
>> Hello all,
>>
>> the UIMA integration contains AEs which can be used to train models if
>> a UIMA pipeline is set up to process a some kind of corpus.
>>
>> I have the impression that this is kind of dead/unused code.
>>
>> I opened an issue [1] to deprecate it and would like to know if there
>> is any interest in keeping that code? Do we have someone here using
>> that?
>>
>> Please share your opinion with us so we can make a good decision!
>>
>> J�rn
>>
>> [1] https://issues.apache.org/jira/browse/OPENNLP-928

Re: Anyone using the UIMA trainer AEs?

Posted by Richard Eckart de Castilho <re...@apache.org>.

Hi Jörn,

DKPro Core also started adding trainer UIMA components for OpenNLP. We now have:

- OpenNlpChunkerTrainer
- OpenNlpLemmatizerTrainer
- OpenNlpNamedEntityRecognizerTrainer
- OpenNlpPosTaggerTrainer
- OpenNlpSentenceTrainer

In theory there is also a trainer for the tokenizer, but I haven't been able yet to set up a working unit test for it. I think that was due to an immediate lack up suitable training data. So it remains on the todo list.

Cheers,

-- Richard

> On 11.01.2017, at 23:51, Joern Kottmann <ko...@gmail.com> wrote:
> 
> Hello all,
> 
> the UIMA integration contains AEs which can be used to train models if
> a UIMA pipeline is set up to process a some kind of corpus.
> 
> I have the impression that this is kind of dead/unused code.
> 
> I opened an issue [1] to deprecate it and would like to know if there
> is any interest in keeping that code? Do we have someone here using
> that?
> 
> Please share your opinion with us so we can make a good decision!
> 
> Jörn
> 
> [1] https://issues.apache.org/jira/browse/OPENNLP-928

Re: Anyone using the UIMA trainer AEs?

Posted by Joern Kottmann <ko...@gmail.com>.

Thanks for sharing this.

I believe it would be better to re-build this anyway.

If there is in general interest we could hook up UIMA into the formats
package. A user would then provide us with the necessary parts (Collection
Reader, maybe a typesystem converter AE) in order to process CASes, OpenNLP
would just take care of starting up the pipeline, running it and being
smart about using the output to train a model.

Jörn

On Thu, Jan 12, 2017 at 2:10 PM, Peter Klügl <pe...@averbis.com>
wrote:

> Hi,
>
>
> we have our own wrappers for applying and training. The AEs can be
> removed from our point of view.
>
>
> Best,
>
>
> Peter
>
>
> Am 11.01.2017 um 23:51 schrieb Joern Kottmann:
> > Hello all,
> >
> > the UIMA integration contains AEs which can be used to train models if
> > a UIMA pipeline is set up to process a some kind of corpus.
> >
> > I have the impression that this is kind of dead/unused code.
> >
> > I opened an issue [1] to deprecate it and would like to know if there
> > is any interest in keeping that code? Do we have someone here using
> > that?
> >
> > Please share your opinion with us so we can make a good decision!
> >
> > Jörn
> >
> > [1] https://issues.apache.org/jira/browse/OPENNLP-928
>
>

Re: Anyone using the UIMA trainer AEs?

Posted by Peter Klügl <pe...@averbis.com>.

Hi,


we have our own wrappers for applying and training. The AEs can be
removed from our point of view.


Best,


Peter


Am 11.01.2017 um 23:51 schrieb Joern Kottmann:
> Hello all,
>
> the UIMA integration contains AEs which can be used to train models if
> a UIMA pipeline is set up to process a some kind of corpus.
>
> I have the impression that this is kind of dead/unused code.
>
> I opened an issue [1] to deprecate it and would like to know if there
> is any interest in keeping that code? Do we have someone here using
> that?
>
> Please share your opinion with us so we can make a good decision!
>
> J�rn
>
> [1] https://issues.apache.org/jira/browse/OPENNLP-928

Re: Anyone using the UIMA trainer AEs?

Posted by Joern Kottmann <ko...@gmail.com>.

We marked them now as deprecated, if you still need them don't hesitate
to inform us about that.

J�rn

On Wed, 2017-01-11 at 23:51 +0100, Joern Kottmann wrote:
> Hello all,
> 
> the UIMA integration contains AEs which can be used to train models
> if
> a UIMA pipeline is set up to process a some kind of corpus.
> 
> I have the impression that this is kind of dead/unused code.
> 
> I opened an issue [1] to deprecate it and would like to know if there
> is any interest in keeping that code? Do we have someone here using
> that?
> 
> Please share your opinion with us so we can make a good decision!
> 
> J�rn
> 
> [1] https://issues.apache.org/jira/browse/OPENNLP-928