You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by György Chityil <gy...@gmail.com> on 2011/08/15 14:00:33 UTC

POSTagger handling of punctuation

Hello,

I noticed the POSTagger adds info to words next to a punctuation like this

questions?_NN
specifics,_NN

I guess it should be like

questions_NN?
specifics_NN,

Cheers,
Gyuri

Re: POSTagger handling of punctuation

Posted by Jörn Kottmann <ko...@gmail.com>.
On 8/15/11 2:47 PM, György Chityil wrote:
> It seems Dr. and the first double quotes are not tokenized. I guess Dr.
> should not be tokenized, while the double quotes are missed in this case.

You are getting this as a token: "Dr.

It is not a bug in our code, but rather a problem with the statistical 
model, usually
such mistakes are fixed by adding more training data.

Jörn

Re: POSTagger handling of punctuation

Posted by György Chityil <gy...@gmail.com>.
Hello Jörn,

While testing I think I found some issues:
Here is a made up sample sentence I tried just now to test punctuation :

"
Dr. George wrote this book; it's his second publication after publishing
tons of books such as "500 tips" and "kick by kick" on top of the list.
"


Tokenizer gives this:

"
Dr. George wrote this book ; it 's his second publication after publishing
tons of books such as "500 tips " and "kick by kick " on top of the list .
"

It seems Dr. and the first double quotes are not tokenized. I guess Dr.
should not be tokenized, while the double quotes are missed in this case.


Cheers,
Gyuri



On Mon, Aug 15, 2011 at 2:20 PM, György Chityil <gy...@gmail.com>wrote:

> Thanks Jörn, I was unaware I am supposed to tokenize first :)
>
> Just fed the essay straight to POSTagger.
>
> Will try to tokenize it now, and report back.
>
>
> On Mon, Aug 15, 2011 at 2:15 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>
>> On 8/15/11 2:00 PM, György Chityil wrote:
>>
>>> I noticed the POSTagger adds info to words next to a punctuation like
>>> this
>>>
>>> questions?_NN
>>> specifics,_NN
>>>
>>> I guess it should be like
>>>
>>> questions_NN?
>>> specifics_NN,
>>>
>>
>> It looks like you don't tokenize the input sentence correctly. Maybe you
>> can post
>> a little more context, then I can give you a better answer.
>>
>> Jörn
>>
>
>
>
> --
> Gyuri
> 274 44 98
> 06 30 5888 744
>
>


-- 
Gyuri
274 44 98
06 30 5888 744

Re: POSTagger handling of punctuation

Posted by György Chityil <gy...@gmail.com>.
Thanks Jörn, I was unaware I am supposed to tokenize first :)

Just fed the essay straight to POSTagger.

Will try to tokenize it now, and report back.

On Mon, Aug 15, 2011 at 2:15 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 8/15/11 2:00 PM, György Chityil wrote:
>
>> I noticed the POSTagger adds info to words next to a punctuation like this
>>
>> questions?_NN
>> specifics,_NN
>>
>> I guess it should be like
>>
>> questions_NN?
>> specifics_NN,
>>
>
> It looks like you don't tokenize the input sentence correctly. Maybe you
> can post
> a little more context, then I can give you a better answer.
>
> Jörn
>



-- 
Gyuri
274 44 98
06 30 5888 744

Re: POSTagger handling of punctuation

Posted by Jörn Kottmann <ko...@gmail.com>.
On 8/15/11 2:00 PM, György Chityil wrote:
> I noticed the POSTagger adds info to words next to a punctuation like this
>
> questions?_NN
> specifics,_NN
>
> I guess it should be like
>
> questions_NN?
> specifics_NN,

It looks like you don't tokenize the input sentence correctly. Maybe you 
can post
a little more context, then I can give you a better answer.

Jörn