You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Thomas Zastrow <po...@thomas-zastrow.de> on 2013/10/14 21:27:56 UTC

Training NEs?

Hello,

I have a question: when creating training material, does it make a
difference if there are " " (blanks) around the NE? In other words, is
it the same to have:

<START:loc>Hamburg<END>

or:

<START:loc> Hamburg <END>

The example in the documentation shows up with the " " ... ?

Best,

Tom

P.S.: ca. 1300 sentences for a free German NE model are done :-)

Re: Training NEs?

Posted by Richard Eckart de Castilho <re...@apache.org>.

Do you have a need for tokens that contain spaces? Otherwise, space-separated
tokens appear a pretty good approach. 

-- Richard

On 14.10.2013, at 21:59, Thomas Zastrow <po...@thomas-zastrow.de> wrote:

> Hello,
> 
> In any case, I think its a little bit oldschool to identify tokens and
> additional annotations just with spaces between them ... what about a
> nice XML format (no, not that ISO crap .. what about TCF [1])? Or maybe
> NEGRA?
> 
> Best,
> 
> Tom
> 
> [1]
> http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format

Re: Training NEs?

Posted by Thomas Zastrow <po...@thomas-zastrow.de>.

Hello Jörn,

Yes, I have a lot of code around TCF, I will see how it can be
integrated. AT least, I'll need importers/exporters for OpenNLP/TCF
anyway :-)

Best,

Tom



Am 15.10.2013 10:06, schrieb Jörn Kottmann:
> OpenNLP is designed to support many formats for training, but we had to
> decide
> on one default format, and that is the one which was always supported.
> 
> We can support the proposed TCF Format, are you interested to contribute
> parsing code for it?
> 
> Jörn
> 
> On 10/14/2013 09:59 PM, Thomas Zastrow wrote:
>> Hello,
>>
>> In any case, I think its a little bit oldschool to identify tokens and
>> additional annotations just with spaces between them ... what about a
>> nice XML format (no, not that ISO crap .. what about TCF [1])? Or maybe
>> NEGRA?
>>
>> Best,
>>
>> Tom
>>
>> [1]
>> http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format
>>
>>
>>
>> Am 14.10.2013 21:53, schrieb Charles Martin:
>>> What happens if all the entity tokens are at the beginning of every
>>> line?
>>> I find that openlp then thinks that any string near the beginning of
>>> a line
>>> is an entity,
>>> regardless of the content or word context
>>>
>>>
>>>
>>> On Mon, Oct 14, 2013 at 12:48 PM, Thomas Zastrow
>>> <po...@thomas-zastrow.de>wrote:
>>>
>>>> Thanks. That explains a lot ... :-)
>>>>
>>>> Does it play a role it it is one or two blanks?
>>>>
>>>>
>>>>
>>>> Am 14.10.2013 21:44, schrieb William Colen:
>>>>> Yes, it does. Include a blank between any element, including
>>>>> punctuations
>>>>> and annotations. The corpus must be tokenized.
>>>>>
>>>>>
>>>>> 2013/10/14 Thomas Zastrow <po...@thomas-zastrow.de>
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a question: when creating training material, does it make a
>>>>>> difference if there are " " (blanks) around the NE? In other
>>>>>> words, is
>>>>>> it the same to have:
>>>>>>
>>>>>> <START:loc>Hamburg<END>
>>>>>>
>>>>>> or:
>>>>>>
>>>>>> <START:loc> Hamburg <END>
>>>>>>
>>>>>> The example in the documentation shows up with the " " ... ?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>> P.S.: ca. 1300 sentences for a free German NE model are done :-)
>>>>>>
>>>>
>>>
>

Re: Training NEs?

Posted by Jörn Kottmann <ko...@gmail.com>.

OpenNLP is designed to support many formats for training, but we had to 
decide
on one default format, and that is the one which was always supported.

We can support the proposed TCF Format, are you interested to contribute
parsing code for it?

Jörn

On 10/14/2013 09:59 PM, Thomas Zastrow wrote:
> Hello,
>
> In any case, I think its a little bit oldschool to identify tokens and
> additional annotations just with spaces between them ... what about a
> nice XML format (no, not that ISO crap .. what about TCF [1])? Or maybe
> NEGRA?
>
> Best,
>
> Tom
>
> [1]
> http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format
>
>
> Am 14.10.2013 21:53, schrieb Charles Martin:
>> What happens if all the entity tokens are at the beginning of every line?
>> I find that openlp then thinks that any string near the beginning of a line
>> is an entity,
>> regardless of the content or word context
>>
>>
>>
>> On Mon, Oct 14, 2013 at 12:48 PM, Thomas Zastrow <po...@thomas-zastrow.de>wrote:
>>
>>> Thanks. That explains a lot ... :-)
>>>
>>> Does it play a role it it is one or two blanks?
>>>
>>>
>>>
>>> Am 14.10.2013 21:44, schrieb William Colen:
>>>> Yes, it does. Include a blank between any element, including punctuations
>>>> and annotations. The corpus must be tokenized.
>>>>
>>>>
>>>> 2013/10/14 Thomas Zastrow <po...@thomas-zastrow.de>
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a question: when creating training material, does it make a
>>>>> difference if there are " " (blanks) around the NE? In other words, is
>>>>> it the same to have:
>>>>>
>>>>> <START:loc>Hamburg<END>
>>>>>
>>>>> or:
>>>>>
>>>>> <START:loc> Hamburg <END>
>>>>>
>>>>> The example in the documentation shows up with the " " ... ?
>>>>>
>>>>> Best,
>>>>>
>>>>> Tom
>>>>>
>>>>> P.S.: ca. 1300 sentences for a free German NE model are done :-)
>>>>>
>>>
>>

Re: Training NEs?

Posted by Thomas Zastrow <po...@thomas-zastrow.de>.

Hello,

In any case, I think its a little bit oldschool to identify tokens and
additional annotations just with spaces between them ... what about a
nice XML format (no, not that ISO crap .. what about TCF [1])? Or maybe
NEGRA?

Best,

Tom

[1]
http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format


Am 14.10.2013 21:53, schrieb Charles Martin:
> What happens if all the entity tokens are at the beginning of every line?
> I find that openlp then thinks that any string near the beginning of a line
> is an entity,
> regardless of the content or word context
> 
> 
> 
> On Mon, Oct 14, 2013 at 12:48 PM, Thomas Zastrow <po...@thomas-zastrow.de>wrote:
> 
>> Thanks. That explains a lot ... :-)
>>
>> Does it play a role it it is one or two blanks?
>>
>>
>>
>> Am 14.10.2013 21:44, schrieb William Colen:
>>> Yes, it does. Include a blank between any element, including punctuations
>>> and annotations. The corpus must be tokenized.
>>>
>>>
>>> 2013/10/14 Thomas Zastrow <po...@thomas-zastrow.de>
>>>
>>>> Hello,
>>>>
>>>> I have a question: when creating training material, does it make a
>>>> difference if there are " " (blanks) around the NE? In other words, is
>>>> it the same to have:
>>>>
>>>> <START:loc>Hamburg<END>
>>>>
>>>> or:
>>>>
>>>> <START:loc> Hamburg <END>
>>>>
>>>> The example in the documentation shows up with the " " ... ?
>>>>
>>>> Best,
>>>>
>>>> Tom
>>>>
>>>> P.S.: ca. 1300 sentences for a free German NE model are done :-)
>>>>
>>>
>>
>>
> 
>

Re: Training NEs?

Posted by Charles Martin <ch...@gmail.com>.

What happens if all the entity tokens are at the beginning of every line?
I find that openlp then thinks that any string near the beginning of a line
is an entity,
regardless of the content or word context



On Mon, Oct 14, 2013 at 12:48 PM, Thomas Zastrow <po...@thomas-zastrow.de>wrote:

> Thanks. That explains a lot ... :-)
>
> Does it play a role it it is one or two blanks?
>
>
>
> Am 14.10.2013 21:44, schrieb William Colen:
> > Yes, it does. Include a blank between any element, including punctuations
> > and annotations. The corpus must be tokenized.
> >
> >
> > 2013/10/14 Thomas Zastrow <po...@thomas-zastrow.de>
> >
> >> Hello,
> >>
> >> I have a question: when creating training material, does it make a
> >> difference if there are " " (blanks) around the NE? In other words, is
> >> it the same to have:
> >>
> >> <START:loc>Hamburg<END>
> >>
> >> or:
> >>
> >> <START:loc> Hamburg <END>
> >>
> >> The example in the documentation shows up with the " " ... ?
> >>
> >> Best,
> >>
> >> Tom
> >>
> >> P.S.: ca. 1300 sentences for a free German NE model are done :-)
> >>
> >
>
>


-- 
This e-mail message, and any attachments, is intended only for the use of
the individual or entity identified in the alias address of this message
and may contain information that is confidential, privileged and subject to
legal restrictions and penalties regarding its unauthorized disclosure and
use. Any unauthorized review, copying, disclosure, use or distribution is
strictly prohibited. If you have received this e-mail message in error,
please notify the sender immediately by reply e-mail and delete this
message, and any attachments, from your system. Thank you.

Re: Training NEs?

Posted by Thomas Zastrow <po...@thomas-zastrow.de>.

Thanks. That explains a lot ... :-)

Does it play a role it it is one or two blanks?



Am 14.10.2013 21:44, schrieb William Colen:
> Yes, it does. Include a blank between any element, including punctuations
> and annotations. The corpus must be tokenized.
> 
> 
> 2013/10/14 Thomas Zastrow <po...@thomas-zastrow.de>
> 
>> Hello,
>>
>> I have a question: when creating training material, does it make a
>> difference if there are " " (blanks) around the NE? In other words, is
>> it the same to have:
>>
>> <START:loc>Hamburg<END>
>>
>> or:
>>
>> <START:loc> Hamburg <END>
>>
>> The example in the documentation shows up with the " " ... ?
>>
>> Best,
>>
>> Tom
>>
>> P.S.: ca. 1300 sentences for a free German NE model are done :-)
>>
>

Re: Training NEs?

Posted by William Colen <wi...@gmail.com>.

Yes, it does. Include a blank between any element, including punctuations
and annotations. The corpus must be tokenized.


2013/10/14 Thomas Zastrow <po...@thomas-zastrow.de>

> Hello,
>
> I have a question: when creating training material, does it make a
> difference if there are " " (blanks) around the NE? In other words, is
> it the same to have:
>
> <START:loc>Hamburg<END>
>
> or:
>
> <START:loc> Hamburg <END>
>
> The example in the documentation shows up with the " " ... ?
>
> Best,
>
> Tom
>
> P.S.: ca. 1300 sentences for a free German NE model are done :-)
>

Re: Training NEs?

Posted by Thomas Zastrow <po...@thomas-zastrow.de>.

Dear Jörn,

Thanks for your answer. I know these tools, but I'm happy (and
effective) with my little, self-programmed tool. If it will be stable
enough, I will publish it sometime. word2vec sounds interesting, I will
take a look.

Best,

Tom


Am 15.10.2013 11:02, schrieb Jörn Kottmann:
> You can also use a tools like the Apache UIMA Cas Editor, Brat, WebAnno,
> etc.
> Usually the annotation speed is much higher if you don't need to edit a
> text file
> yourself.
> 
> The Tagging Server in the sandbox can be used to pre-label data for brat
> or the Apache UIMA Cas Editor.
> 
> Another tool you should try is word2vec, it can create word clusters
> which can be used as part of
> the feature generation, in my tests that increased the recall a few
> percents, but it is still work in progress,
> it will take a few days until that works with the TokenNameFinderTrainer
> command line tool.
> 
> HTH,
> Jörn
> 
> On 10/14/2013 09:27 PM, Thomas Zastrow wrote:
>> Hello,
>>
>> I have a question: when creating training material, does it make a
>> difference if there are " " (blanks) around the NE? In other words, is
>> it the same to have:
>>
>> <START:loc>Hamburg<END>
>>
>> or:
>>
>> <START:loc> Hamburg <END>
>>
>> The example in the documentation shows up with the " " ... ?
>>
>> Best,
>>
>> Tom
>>
>> P.S.: ca. 1300 sentences for a free German NE model are done :-)
>

Re: Training NEs?

Posted by Jörn Kottmann <ko...@gmail.com>.

You can also use a tools like the Apache UIMA Cas Editor, Brat, WebAnno, 
etc.
Usually the annotation speed is much higher if you don't need to edit a 
text file
yourself.

The Tagging Server in the sandbox can be used to pre-label data for brat 
or the Apache UIMA Cas Editor.

Another tool you should try is word2vec, it can create word clusters 
which can be used as part of
the feature generation, in my tests that increased the recall a few 
percents, but it is still work in progress,
it will take a few days until that works with the TokenNameFinderTrainer 
command line tool.

HTH,
Jörn

On 10/14/2013 09:27 PM, Thomas Zastrow wrote:
> Hello,
>
> I have a question: when creating training material, does it make a
> difference if there are " " (blanks) around the NE? In other words, is
> it the same to have:
>
> <START:loc>Hamburg<END>
>
> or:
>
> <START:loc> Hamburg <END>
>
> The example in the documentation shows up with the " " ... ?
>
> Best,
>
> Tom
>
> P.S.: ca. 1300 sentences for a free German NE model are done :-)