You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Markus Kreuzthaler <ma...@gmail.com> on 2017/09/27 08:13:34 UTC

Tokenizer in NameFinderME

Hello!

Does anyone know, what tokenizer is used when applying NameFinderME for
training a custom named entity recognition model? I was searching but I
could not find this information.

I have to attach the same tokenizer when using the trained model, but I
don't know which one was used.

Therefore at the moment I just tokenize via:
String[] tokens = sentence.getCoveredText().split("\\s+");

Thank you for feedback!

lg Markus

Re: Tokenizer in NameFinderME

Posted by Markus Kreuzthaler <ma...@gmail.com>.

Hi Jörn!

Thank you!
This issue is solved for me...

lg Markus


2017-09-28 14:19 GMT+02:00 Joern Kottmann <ko...@gmail.com>:

> Use the same tokenizer as you used to tokenize the training data. The
> default format assumes the input text is whitespace tokenized and then
> uses the whitespace tokenizer to detect the tokens. But for applying
> the model you need to use the tokenizer which was used for the
> training data.
>
> Jörn
>
> On Thu, Sep 28, 2017 at 8:45 AM, Markus Kreuzthaler
> <ma...@gmail.com> wrote:
> > Hi Jeff!
> >
> > Thank you for this hint!
> > Yes, looks like the WhitespaceTokenizer is used in this case...
> >
> > All the best!
> >
> > Markus
> >
> >
> > 2017-09-27 13:03 GMT+02:00 Jeff Zemerick <jz...@apache.org>:
> >
> >> Markus,
> >>
> >> I believe the WhitespaceTokenizer is used [1].
> >>
> >> Jeff
> >>
> >> [1]
> >> https://github.com/apache/opennlp/blob/4362e02ed0404d12ca75ee3476d4a3
> >> 2f9f671811/opennlp-tools/src/main/java/opennlp/tools/
> >> namefind/NameSample.java#L220
> >>
> >> On Wed, Sep 27, 2017 at 4:13 AM, Markus Kreuzthaler <
> >> markus.kreuzthaler@gmail.com> wrote:
> >>
> >> > Hello!
> >> >
> >> > Does anyone know, what tokenizer is used when applying NameFinderME
> for
> >> > training a custom named entity recognition model? I was searching but
> I
> >> > could not find this information.
> >> >
> >> > I have to attach the same tokenizer when using the trained model, but
> I
> >> > don't know which one was used.
> >> >
> >> > Therefore at the moment I just tokenize via:
> >> > String[] tokens = sentence.getCoveredText().split("\\s+");
> >> >
> >> > Thank you for feedback!
> >> >
> >> > lg Markus
> >> >
> >>
>

Re: Tokenizer in NameFinderME

Posted by Joern Kottmann <ko...@gmail.com>.

Use the same tokenizer as you used to tokenize the training data. The
default format assumes the input text is whitespace tokenized and then
uses the whitespace tokenizer to detect the tokens. But for applying
the model you need to use the tokenizer which was used for the
training data.

Jörn

On Thu, Sep 28, 2017 at 8:45 AM, Markus Kreuzthaler
<ma...@gmail.com> wrote:
> Hi Jeff!
>
> Thank you for this hint!
> Yes, looks like the WhitespaceTokenizer is used in this case...
>
> All the best!
>
> Markus
>
>
> 2017-09-27 13:03 GMT+02:00 Jeff Zemerick <jz...@apache.org>:
>
>> Markus,
>>
>> I believe the WhitespaceTokenizer is used [1].
>>
>> Jeff
>>
>> [1]
>> https://github.com/apache/opennlp/blob/4362e02ed0404d12ca75ee3476d4a3
>> 2f9f671811/opennlp-tools/src/main/java/opennlp/tools/
>> namefind/NameSample.java#L220
>>
>> On Wed, Sep 27, 2017 at 4:13 AM, Markus Kreuzthaler <
>> markus.kreuzthaler@gmail.com> wrote:
>>
>> > Hello!
>> >
>> > Does anyone know, what tokenizer is used when applying NameFinderME for
>> > training a custom named entity recognition model? I was searching but I
>> > could not find this information.
>> >
>> > I have to attach the same tokenizer when using the trained model, but I
>> > don't know which one was used.
>> >
>> > Therefore at the moment I just tokenize via:
>> > String[] tokens = sentence.getCoveredText().split("\\s+");
>> >
>> > Thank you for feedback!
>> >
>> > lg Markus
>> >
>>

Re: Tokenizer in NameFinderME

Posted by Markus Kreuzthaler <ma...@gmail.com>.

Hi Jeff!

Thank you for this hint!
Yes, looks like the WhitespaceTokenizer is used in this case...

All the best!

Markus


2017-09-27 13:03 GMT+02:00 Jeff Zemerick <jz...@apache.org>:

> Markus,
>
> I believe the WhitespaceTokenizer is used [1].
>
> Jeff
>
> [1]
> https://github.com/apache/opennlp/blob/4362e02ed0404d12ca75ee3476d4a3
> 2f9f671811/opennlp-tools/src/main/java/opennlp/tools/
> namefind/NameSample.java#L220
>
> On Wed, Sep 27, 2017 at 4:13 AM, Markus Kreuzthaler <
> markus.kreuzthaler@gmail.com> wrote:
>
> > Hello!
> >
> > Does anyone know, what tokenizer is used when applying NameFinderME for
> > training a custom named entity recognition model? I was searching but I
> > could not find this information.
> >
> > I have to attach the same tokenizer when using the trained model, but I
> > don't know which one was used.
> >
> > Therefore at the moment I just tokenize via:
> > String[] tokens = sentence.getCoveredText().split("\\s+");
> >
> > Thank you for feedback!
> >
> > lg Markus
> >
>

Re: Tokenizer in NameFinderME

Posted by Jeff Zemerick <jz...@apache.org>.

Markus,

I believe the WhitespaceTokenizer is used [1].

Jeff

[1]
https://github.com/apache/opennlp/blob/4362e02ed0404d12ca75ee3476d4a32f9f671811/opennlp-tools/src/main/java/opennlp/tools/namefind/NameSample.java#L220

On Wed, Sep 27, 2017 at 4:13 AM, Markus Kreuzthaler <
markus.kreuzthaler@gmail.com> wrote:

> Hello!
>
> Does anyone know, what tokenizer is used when applying NameFinderME for
> training a custom named entity recognition model? I was searching but I
> could not find this information.
>
> I have to attach the same tokenizer when using the trained model, but I
> don't know which one was used.
>
> Therefore at the moment I just tokenize via:
> String[] tokens = sentence.getCoveredText().split("\\s+");
>
> Thank you for feedback!
>
> lg Markus
>