You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Rohana Rajapakse <Ro...@gossinteractive.com> on 2011/02/22 14:18:35 UTC

Tokenizer issue - Quotation marks

Hi,

I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
made "mistakes". It gives me "mistakes as a token (note starting quote
is part of the token). But, if I change the word mistake to Mistake
(i.e. capitol M) in the input text, then I get the token Mistakes
(correctly).

Anyone aware of this issue and any idea of how to get-around this?

Thanks

Rohana

GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.

Re: Tokenizer issue - Quotation marks

Posted by James Kosin <ja...@gmail.com>.

On 2/22/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi,
>
>  
>
> I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
> made "mistakes".  It gives me "mistakes as a token (note starting quote
> is part of the token). But, if I change the word mistake to Mistake
> (i.e. capitol M) in the input text, then I get the token Mistakes
> (correctly). 
>
>  
>
> Anyone aware of this issue and any idea of how to get-around this?
>
>  
It looks like from the comment on the download site that the model was
trained with OpenNLP data.  This usually means it doesn't contain very
many samples.  I was able to add some samples to my (own) model and was
able to get a model that works good.

For those interested, the tokenizer usually gets passed data after the
sentence detector.  The tokenizer then breaks based on tokens or other
punctuation.  eg:
    This old house was painted white.    -- would have training data --
    This old house was painted white<SPLIT>.
This indicates we want the period at the end of the sentence to be split
from the last word in the sentence.  Similar ideas hold for the comma
and quote characters.  Special handling is required for possessive nouns
like James' would be James<SPLIT>' and John's would be John<SPLIT>'s ...
Note words in the sentence that are already separated by spaces don't
need to be <SPLIT>.

James

Re: Tokenizer issue - Quotation marks

Posted by James Kosin <ja...@gmail.com>.

On 2/22/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi,
>
>  
>
> I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
> made "mistakes".  It gives me "mistakes as a token (note starting quote
> is part of the token). But, if I change the word mistake to Mistake
> (i.e. capitol M) in the input text, then I get the token Mistakes
> (correctly). 
>
>  
>
> Anyone aware of this issue and any idea of how to get-around this?
>
>  
>
> Thanks
>
>  
>
> Rohana
I see the issue, unfortunately, can't do much about fixing this without
the training data used for the tokenizer.  You can use the
SimpleTokenizer and that appears to be working with your sample.

I found a few more samples that don't work:
    This model is "bad."
    This is the "year" of the "pig."

It seems to be a problem following a " with certain characters.

James

Re: Tokenizer issue - Quotation marks

Posted by Jörn Kottmann <ko...@gmail.com>.

On 2/23/11 9:23 AM, Rohana Rajapakse wrote:
> Thanks for the replies.
>
> Which model are you referring to? Where can I find it?
>

The mailing list does not allow mail attachments, I believe he attached
it to the mail and our list server removed it.

Jörn

Re: Tokenizer issue - Quotation marks

Posted by James Kosin <ja...@gmail.com>.

On 2/23/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi James,
>
> It works for double quotes, but not for single quotes (i.e. fails for
> 'mistakes'). Is it a training issue then (not having cases with words
> enclosed within single/double quotes. 
>
> I have noticed that your model file is much smaller than the model file
> available to download. Is it because your training data set is smaller?
> How does it affect tokenizing overall?
>
> Are there training sets available to download? 
>
> Regards
>
> Rohana
>
>
Rohana,

It doesn't take many to produce a good model.  I was only doing for
simple cases as is since many of the models have no training data that
is freely available.  Even if you get, most will have to be hand parsed
into the correct format and tokenized by hand in most cases.

The tokenizer is expecting the sentence detector to get the data first,
generating a sentence on each line.  My model was created with about 75
sentences in total.  I only added two to help with the "" characters
around words.  The only time I'd expect single-quotes would be quotes of
quotes... which doesn't happen very often.

I think the issue was not having any single words that had quotes around
them that caused the issue, in the context you brought up the quotes
around the "mistakes" isn't a direct quote but using the word out of
context to represent a thought or idea outside of how the word would
normally be used.  ie: the military doesn't make mistakes since they are
in the business of war which is messy by design.

I don't think the small sampling of text is really bad, I have to train
the model a bit differently; but, even my model had issues with the same
sentences you had given; meaning it was at least as good as the larger
model.  There may have even been some the larger model can get that my
model wouldn't get correct as well.

As far as the other model, I can't comment too much other than most of
the models are trained on news stories and not text documents.  It is
really based on what you are looking to parse; however a tokenizer is a
fairly simple thing to train.

James