You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Rohana Rajapakse <Ro...@gossinteractive.com> on 2011/02/22 14:18:35 UTC

Tokenizer issue - Quotation marks

Hi,

 

I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
made "mistakes".  It gives me "mistakes as a token (note starting quote
is part of the token). But, if I change the word mistake to Mistake
(i.e. capitol M) in the input text, then I get the token Mistakes
(correctly). 

 

Anyone aware of this issue and any idea of how to get-around this?

 

Thanks

 

Rohana

 

 

 

 



GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup 

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.




Re: Tokenizer issue - Quotation marks

Posted by James Kosin <ja...@gmail.com>.
On 2/22/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi,
>
>  
>
> I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
> made "mistakes".  It gives me "mistakes as a token (note starting quote
> is part of the token). But, if I change the word mistake to Mistake
> (i.e. capitol M) in the input text, then I get the token Mistakes
> (correctly). 
>
>  
>
> Anyone aware of this issue and any idea of how to get-around this?
>
>  
It looks like from the comment on the download site that the model was
trained with OpenNLP data.  This usually means it doesn't contain very
many samples.  I was able to add some samples to my (own) model and was
able to get a model that works good.

For those interested, the tokenizer usually gets passed data after the
sentence detector.  The tokenizer then breaks based on tokens or other
punctuation.  eg:
    This old house was painted white.    -- would have training data --
    This old house was painted white<SPLIT>.
This indicates we want the period at the end of the sentence to be split
from the last word in the sentence.  Similar ideas hold for the comma
and quote characters.  Special handling is required for possessive nouns
like James' would be James<SPLIT>' and John's would be John<SPLIT>'s ...
Note words in the sentence that are already separated by spaces don't
need to be <SPLIT>.

James


Re: Tokenizer issue - Quotation marks

Posted by James Kosin <ja...@gmail.com>.
On 2/22/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi,
>
>  
>
> I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
> made "mistakes".  It gives me "mistakes as a token (note starting quote
> is part of the token). But, if I change the word mistake to Mistake
> (i.e. capitol M) in the input text, then I get the token Mistakes
> (correctly). 
>
>  
>
> Anyone aware of this issue and any idea of how to get-around this?
>
>  
>
> Thanks
>
>  
>
> Rohana
I see the issue, unfortunately, can't do much about fixing this without
the training data used for the tokenizer.  You can use the
SimpleTokenizer and that appears to be working with your sample.

I found a few more samples that don't work:
    This model is "bad."
    This is the "year" of the "pig."

It seems to be a problem following a " with certain characters.

James

Re: Tokenizer issue - Quotation marks

Posted by Jörn Kottmann <ko...@gmail.com>.
On 2/23/11 9:23 AM, Rohana Rajapakse wrote:
> Thanks for the replies.
>
> Which model are you referring to? Where can I find it?
>

The mailing list does not allow mail attachments, I believe he attached
it to the mail and our list server removed it.

Jörn

Re: Tokenizer issue - Quotation marks

Posted by James Kosin <ja...@gmail.com>.
On 2/23/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi James,
>
> It works for double quotes, but not for single quotes (i.e. fails for
> 'mistakes'). Is it a training issue then (not having cases with words
> enclosed within single/double quotes. 
>
> I have noticed that your model file is much smaller than the model file
> available to download. Is it because your training data set is smaller?
> How does it affect tokenizing overall?
>
> Are there training sets available to download? 
>
> Regards
>
> Rohana
>
>
Rohana,

It doesn't take many to produce a good model.  I was only doing for
simple cases as is since many of the models have no training data that
is freely available.  Even if you get, most will have to be hand parsed
into the correct format and tokenized by hand in most cases.

The tokenizer is expecting the sentence detector to get the data first,
generating a sentence on each line.  My model was created with about 75
sentences in total.  I only added two to help with the "" characters
around words.  The only time I'd expect single-quotes would be quotes of
quotes... which doesn't happen very often.

I think the issue was not having any single words that had quotes around
them that caused the issue, in the context you brought up the quotes
around the "mistakes" isn't a direct quote but using the word out of
context to represent a thought or idea outside of how the word would
normally be used.  ie: the military doesn't make mistakes since they are
in the business of war which is messy by design.

I don't think the small sampling of text is really bad, I have to train
the model a bit differently; but, even my model had issues with the same
sentences you had given; meaning it was at least as good as the larger
model.  There may have even been some the larger model can get that my
model wouldn't get correct as well.

As far as the other model, I can't comment too much other than most of
the models are trained on news stories and not text documents.  It is
really based on what you are looking to parse; however a tokenizer is a
fairly simple thing to train.

James

Re: Tokenizer issue - Quotation marks

Posted by Jörn Kottmann <ko...@gmail.com>.
On 2/24/11 12:10 PM, Rohana Rajapakse wrote:
> Thanks a lot.
>
> I did create a name finder training data file using CONNEL sometimes ago. Will have a look at how I did it. I may be able to convert this training file to produce a tokenizer training file.
>
> Thanks a lot. Will let you know how I get on with this. Would contribute anything that might be useful to others.

English detoeknizer rules will be helpful for many, would be nice if you 
could contribute yours then,
there is a general one you can use to start with inside 
opennlp-tools/src/test/resources/opennlp/tools/latin-detokenizer.xml

The name finder training data you have should be good enough to start 
with, depending
on the way its tokenized.

Jörn

RE: Tokenizer issue - Quotation marks

Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
Thanks Jörn. 

I can see in the code that depending on the operation a token will be merged to the left or right token. But I can't see where (in the code) it adds a <SPLIT> token. Can you please point me to the right place in the code?

Thanks

Rohana

-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com] 
Sent: 04 March 2011 15:05
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks

On 3/4/11 3:46 PM, Rohana Rajapakse wrote:
> That works great. It was not clear to me where/how the detokenizer rules are used. I thought it's for combining a given token to the one before or after it.

Exactly, the converter detokenizes your input tokens with the 
detokenizer, now it knows which tokens will be merged
together and is able to add the <SPLIT> tag between the token instead of 
just cat the tokens together.
> I can see in the training file that it has added<SPLIT>  tags before "'s".
> Can I include spaces in the rule (e.g. "'s " note the trailing space). This will make it explicit.
Not sure I understand you correctly here. I mean we already have 
tokenized data, the tokens do not contain
any information about the spaces between them.

Lets say we have these two strings:

1: "A    sample"
2: "A sample"

Now we use a white space tokenizer to tokenize them and
the result would be like this:
1: "A", "sample"
2: "A", "sample"

In the token representation we do not have any white spaces anymore.

Jörn


GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup 

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.




Re: Tokenizer issue - Quotation marks

Posted by Jörn Kottmann <ko...@gmail.com>.
On 3/4/11 3:46 PM, Rohana Rajapakse wrote:
> That works great. It was not clear to me where/how the detokenizer rules are used. I thought it's for combining a given token to the one before or after it.

Exactly, the converter detokenizes your input tokens with the 
detokenizer, now it knows which tokens will be merged
together and is able to add the <SPLIT> tag between the token instead of 
just cat the tokens together.
> I can see in the training file that it has added<SPLIT>  tags before "'s".
> Can I include spaces in the rule (e.g. "'s " note the trailing space). This will make it explicit.
Not sure I understand you correctly here. I mean we already have 
tokenized data, the tokens do not contain
any information about the spaces between them.

Lets say we have these two strings:

1: "A    sample"
2: "A sample"

Now we use a white space tokenizer to tokenize them and
the result would be like this:
1: "A", "sample"
2: "A", "sample"

In the token representation we do not have any white spaces anymore.

Jörn

RE: Tokenizer issue - Quotation marks

Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
That works great. It was not clear to me where/how the detokenizer rules are used. I thought it's for combining a given token to the one before or after it. I can see in the training file that it has added <SPLIT> tags before "'s".
Can I include spaces in the rule (e.g. "'s " note the trailing space). This will make it explicit.


Thanks

Rohana


-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com] 
Sent: 04 March 2011 12:34
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks

On 3/4/11 1:06 PM, Rohana Rajapakse wrote:
> Hi Jorn,
>
>
>
> I have modified the toString() method in TokenSample.java as given below. This is to add a<SPLIT>  token before the token 's .
>
> This helped me to train a tokenizer model that splits for eg. "it's" into two tokens "it" and "'s" at the same time detokenizer rule (same as the rule for double quote) splitting single quotes from expressions that are enclosed in between a pair of single quotes.
>
> This does not handle other cases of single quotes (e.g. don't, can't etc and names like O'Conner).
>
Had a look at the change. The tokenization information must be provided 
to TokenSample, this class
then just encapsulates that knowledge. So it is not the responsibility 
of it to figure out how
things should be tokenized or not.

In your case I think you can just add "'s" to your detokenizer 
dictionary like this:
<entry operation="MOVE_LEFT">
<token>'s</token>
</entry>

Doesn't that fix your issue?

Jörn



GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup 

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.




Re: Tokenizer issue - Quotation marks

Posted by Jörn Kottmann <ko...@gmail.com>.
On 3/4/11 1:06 PM, Rohana Rajapakse wrote:
> Hi Jorn,
>
>
>
> I have modified the toString() method in TokenSample.java as given below. This is to add a<SPLIT>  token before the token 's .
>
> This helped me to train a tokenizer model that splits for eg. "it's" into two tokens "it" and "'s" at the same time detokenizer rule (same as the rule for double quote) splitting single quotes from expressions that are enclosed in between a pair of single quotes.
>
> This does not handle other cases of single quotes (e.g. don't, can't etc and names like O'Conner).
>
Had a look at the change. The tokenization information must be provided 
to TokenSample, this class
then just encapsulates that knowledge. So it is not the responsibility 
of it to figure out how
things should be tokenized or not.

In your case I think you can just add "'s" to your detokenizer 
dictionary like this:
<entry operation="MOVE_LEFT">
<token>'s</token>
</entry>

Doesn't that fix your issue?

Jörn


RE: Tokenizer issue - Quotation marks

Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
Hi Jorn,

 

I have modified the toString() method in TokenSample.java as given below. This is to add a <SPLIT> token before the token 's .

This helped me to train a tokenizer model that splits for eg. "it's" into two tokens "it" and "'s" at the same time detokenizer rule (same as the rule for double quote) splitting single quotes from expressions that are enclosed in between a pair of single quotes.

This does not handle other cases of single quotes (e.g. don't, can't etc and names like O'Conner).

 

I am not sure if this change of code affects other functionalities of opennlp (where else the TokenSample class used?) and if it was the right place to do it. 

 

Please let me know what you think!

 

Regards

 

Rohana

 

 

  public String toString() {

    

    StringBuilder sentence = new StringBuilder();

    

    int lastEndIndex = -1;

    for (Span token : tokenSpans) {

      

      if (lastEndIndex != -1) {

 

        // If there are no chars between last token

        // and this token insert the separator chars

        // otherwise insert a space

        

        String separator = "";

        if (lastEndIndex == token.getStart())

          separator = separatorChars; 

         else {

             separator = " ";

             //New condition for adding <SPLIT> before 's into the training file when creating/converting conll03 to produce tokenizer training 

             //data using "TokenizerConverter"

             if (token.getCoveredText(text).equals("'s")) { 

                   separator = separatorChars; 

             }

         }

        sentence.append(separator);

      }

      

      sentence.append(token.getCoveredText(text));

      

      lastEndIndex = token.getEnd();

    }

    

    return sentence.toString();

  }

 

-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com] 
Sent: 03 March 2011 16:01
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks

 

On 3/3/11 4:33 PM, Rohana Rajapakse wrote:

> Thanks. I have got the training files created (conll03 + Reuters) and models trained. Used the latin-detokenizer that came with the download. The trained model solves the double quotation problem (e.g. "mistakes" now results in three tokens: ", mistakes and ").

> 

> I have tried adding the same detokenizer rules for single quote. However, it seems to conflict with the different usage of the single quote (e.g. possession as Tom's, It's etc.) This means we will have such cases separately. I will try adding<SPLIT>  tags for those cases (e.g. Tom<SPLIT>'s , it<SPLIT>'s  etc.). Don't know which gets the priority, rules in the detokenizer or<SPLIT>  tags...

 

Yes you need to add all the tokens which should be attached to the 

previous one, like "'s", "'t", etc.

It would be nice to have such a file as part of the project.

 

Jörn

 



GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup 

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.




Re: Tokenizer issue - Quotation marks

Posted by Jörn Kottmann <ko...@gmail.com>.
On 3/3/11 4:33 PM, Rohana Rajapakse wrote:
> Thanks. I have got the training files created (conll03 + Reuters) and models trained. Used the latin-detokenizer that came with the download. The trained model solves the double quotation problem (e.g. "mistakes" now results in three tokens: ", mistakes and ").
>
> I have tried adding the same detokenizer rules for single quote. However, it seems to conflict with the different usage of the single quote (e.g. possession as Tom's, It's etc.) This means we will have such cases separately. I will try adding<SPLIT>  tags for those cases (e.g. Tom<SPLIT>'s , it<SPLIT>'s  etc.). Don't know which gets the priority, rules in the detokenizer or<SPLIT>  tags...

Yes you need to add all the tokens which should be attached to the 
previous one, like "'s", "'t", etc.
It would be nice to have such a file as part of the project.

Jörn


RE: Tokenizer issue - Quotation marks

Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
Thanks. I have got the training files created (conll03 + Reuters) and models trained. Used the latin-detokenizer that came with the download. The trained model solves the double quotation problem (e.g. "mistakes" now results in three tokens: ", mistakes and ").

I have tried adding the same detokenizer rules for single quote. However, it seems to conflict with the different usage of the single quote (e.g. possession as Tom's, It's etc.) This means we will have such cases separately. I will try adding <SPLIT> tags for those cases (e.g. Tom <SPLIT>'s , it<SPLIT>'s  etc.). Don't know which gets the priority, rules in the detokenizer or <SPLIT> tags...

Rohana

-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com] 
Sent: 02 March 2011 13:08
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks

On 3/2/11 1:47 PM, Rohana Rajapakse wrote:
> My NameFinder training model (created from CONLL + Reuters) has<START>  and<END>  markups for person names. It doesn't have<SPLIT>  markups. I am trying the testTokenizer() test with TokenizerTestUtil.createMaxentTokenModel() to create a model using my training data file. I had to remove<START>  and<END>  tags and add few<SPLIT>  tags to get the test to work (to get "Number of Outcomes" to match). It learns a model now, but not perfect. I need to add<SPLIT>  markups for all single and double quotes etc.
>
> By the way, where is the " TokenizerConverter" that you had mentioned. My download (from sourceforge) doesn't have it. Also, where is the converter to produce name
> Finder that you have created to convert CONLL03. Am I missing some code in my download.
>
> Also, please point me to your "docbook". Would like to know more about the detokenizer. I can't find a "release candidate" in the download site.
>
The release candidate can be found here:
http://people.apache.org/~joern/releases/opennlp-1.5.1-incubating/rc1/

Just use your name finder training file with the TokenizerConverter. 
Pieces of the work
is in 1.5.0 and all the things you are missing are in 1.5.1. The docbook 
is also included
in the 1.5.1 distribution.

I suggest that you just re-try with the rc1.

Jörn


GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup 

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.




Re: Tokenizer issue - Quotation marks

Posted by Jörn Kottmann <ko...@gmail.com>.
On 3/2/11 1:47 PM, Rohana Rajapakse wrote:
> My NameFinder training model (created from CONLL + Reuters) has<START>  and<END>  markups for person names. It doesn't have<SPLIT>  markups. I am trying the testTokenizer() test with TokenizerTestUtil.createMaxentTokenModel() to create a model using my training data file. I had to remove<START>  and<END>  tags and add few<SPLIT>  tags to get the test to work (to get "Number of Outcomes" to match). It learns a model now, but not perfect. I need to add<SPLIT>  markups for all single and double quotes etc.
>
> By the way, where is the " TokenizerConverter" that you had mentioned. My download (from sourceforge) doesn't have it. Also, where is the converter to produce name
> Finder that you have created to convert CONLL03. Am I missing some code in my download.
>
> Also, please point me to your "docbook". Would like to know more about the detokenizer. I can't find a "release candidate" in the download site.
>
The release candidate can be found here:
http://people.apache.org/~joern/releases/opennlp-1.5.1-incubating/rc1/

Just use your name finder training file with the TokenizerConverter. 
Pieces of the work
is in 1.5.0 and all the things you are missing are in 1.5.1. The docbook 
is also included
in the 1.5.1 distribution.

I suggest that you just re-try with the rc1.

Jörn

RE: Tokenizer issue - Quotation marks

Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
My NameFinder training model (created from CONLL + Reuters) has <START> and <END> markups for person names. It doesn't have <SPLIT> markups. I am trying the testTokenizer() test with TokenizerTestUtil.createMaxentTokenModel() to create a model using my training data file. I had to remove <START> and <END> tags and add few <SPLIT> tags to get the test to work (to get "Number of Outcomes" to match). It learns a model now, but not perfect. I need to add <SPLIT> markups for all single and double quotes etc. 

By the way, where is the " TokenizerConverter" that you had mentioned. My download (from sourceforge) doesn't have it. Also, where is the converter to produce name 
Finder that you have created to convert CONLL03. Am I missing some code in my download.

Also, please point me to your "docbook". Would like to know more about the detokenizer. I can't find a "release candidate" in the download site. 

Thanks


Rohana Rajapakse
Senior Software Developer
GOSS Interactive
 
t:  +44 (0)844 880 3637
f:  +44 (0)844 880 3638
e: rohana.rajapakse@gossinteractive.com
w: www.gossinteractive.com 

-----Original Message-----
From: Rohana Rajapakse [mailto:Rohana.Rajapakse@gossinteractive.com] 
Sent: 24 February 2011 11:10
To: opennlp-users@incubator.apache.org
Subject: RE: Tokenizer issue - Quotation marks

Thanks a lot.

I did create a name finder training data file using CONNEL sometimes ago. Will have a look at how I did it. I may be able to convert this training file to produce a tokenizer training file.

Thanks a lot. Will let you know how I get on with this. Would contribute anything that might be useful to others.

Rohana


-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com] 
Sent: 24 February 2011 10:59
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks

On 2/23/11 2:33 PM, Rohana Rajapakse wrote:
> Yes I do have Reuters corpus with me. Also used Browns corpus (subset of Reuters?) some times ago. So, I should have that as well. Would like to know the steps to create a training set.
>
For CONLL03 (uses Reuters data) we created a converter to produce name 
finder
training material. The process how to do that is described in our 
docbook (just build
it or download the release candidate).

After you produced the name finder training file you can use the 
converter for
the tokenizer and a detokenizer file to produce training data for the 
tokenizer.
There is a sample detokenizer file which maybe must be extended a little 
for english,
maybe we should create a new folder within the tools project to collect 
all the
non-statistical model files. I guess a good detokenizer for the reuters 
corpus could be
useful for others too.

That can be done with a command like this
bin/opennlp TokenizerConverter namefinder -encoding UTF-8 -data 
en-ner-reuters.train -detokenizer lating-detokenizer.xml > en-tok.train

Hope that helps to get you started, contribution about this to the 
documentation is very welcome.

Jörn

Mail about the RC:
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201102.mbox/%3C4D659A52.1090806@gmail.com%3E


GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup 

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.




RE: Tokenizer issue - Quotation marks

Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
Thanks a lot.

I did create a name finder training data file using CONNEL sometimes ago. Will have a look at how I did it. I may be able to convert this training file to produce a tokenizer training file.

Thanks a lot. Will let you know how I get on with this. Would contribute anything that might be useful to others.

Rohana


-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com] 
Sent: 24 February 2011 10:59
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks

On 2/23/11 2:33 PM, Rohana Rajapakse wrote:
> Yes I do have Reuters corpus with me. Also used Browns corpus (subset of Reuters?) some times ago. So, I should have that as well. Would like to know the steps to create a training set.
>
For CONLL03 (uses Reuters data) we created a converter to produce name 
finder
training material. The process how to do that is described in our 
docbook (just build
it or download the release candidate).

After you produced the name finder training file you can use the 
converter for
the tokenizer and a detokenizer file to produce training data for the 
tokenizer.
There is a sample detokenizer file which maybe must be extended a little 
for english,
maybe we should create a new folder within the tools project to collect 
all the
non-statistical model files. I guess a good detokenizer for the reuters 
corpus could be
useful for others too.

That can be done with a command like this
bin/opennlp TokenizerConverter namefinder -encoding UTF-8 -data 
en-ner-reuters.train -detokenizer lating-detokenizer.xml > en-tok.train

Hope that helps to get you started, contribution about this to the 
documentation is very welcome.

Jörn

Mail about the RC:
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201102.mbox/%3C4D659A52.1090806@gmail.com%3E


GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup 

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.




Re: Tokenizer issue - Quotation marks

Posted by Jörn Kottmann <ko...@gmail.com>.
On 2/23/11 2:33 PM, Rohana Rajapakse wrote:
> Yes I do have Reuters corpus with me. Also used Browns corpus (subset of Reuters?) some times ago. So, I should have that as well. Would like to know the steps to create a training set.
>
For CONLL03 (uses Reuters data) we created a converter to produce name 
finder
training material. The process how to do that is described in our 
docbook (just build
it or download the release candidate).

After you produced the name finder training file you can use the 
converter for
the tokenizer and a detokenizer file to produce training data for the 
tokenizer.
There is a sample detokenizer file which maybe must be extended a little 
for english,
maybe we should create a new folder within the tools project to collect 
all the
non-statistical model files. I guess a good detokenizer for the reuters 
corpus could be
useful for others too.

That can be done with a command like this
bin/opennlp TokenizerConverter namefinder -encoding UTF-8 -data 
en-ner-reuters.train -detokenizer lating-detokenizer.xml > en-tok.train

Hope that helps to get you started, contribution about this to the 
documentation is very welcome.

Jörn

Mail about the RC:
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201102.mbox/%3C4D659A52.1090806@gmail.com%3E

RE: Tokenizer issue - Quotation marks

Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
Yes I do have Reuters corpus with me. Also used Browns corpus (subset of Reuters?) some times ago. So, I should have that as well. Would like to know the steps to create a training set.

Thanks

Rohana
-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com] 
Sent: 23 February 2011 13:28
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks

On 2/23/11 2:18 PM, Rohana Rajapakse wrote:
> Hi James,
>
> It works for double quotes, but not for single quotes (i.e. fails for
> 'mistakes'). Is it a training issue then (not having cases with words
> enclosed within single/double quotes.
>
> I have noticed that your model file is much smaller than the model file
> available to download. Is it because your training data set is smaller?
> How does it affect tokenizing overall?
>
> Are there training sets available to download?

Yes and no, you need a file with tokenization information to train the 
tokenizer.
In OpenNLP we use a sentence per line format, and the non-whitespace
separated tokens are separated by a special tag.
See our documentation for information about the format.

We observed that is is easy to use rules to detokenize correctly 
tokenized text.
For that reason I implemented a rule based detokenizer.

Now you just need some kind of tokenized text to produce a training file 
for our tokenizer.
You might want to use the reuters corpus, or other freely available 
english language corpora.
If you have access to the reuters corpus I suggest that we go through 
the steps to train the
tokenizer with it.

Jörn


GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup 

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.




Re: Tokenizer issue - Quotation marks

Posted by Jörn Kottmann <ko...@gmail.com>.
On 2/23/11 2:18 PM, Rohana Rajapakse wrote:
> Hi James,
>
> It works for double quotes, but not for single quotes (i.e. fails for
> 'mistakes'). Is it a training issue then (not having cases with words
> enclosed within single/double quotes.
>
> I have noticed that your model file is much smaller than the model file
> available to download. Is it because your training data set is smaller?
> How does it affect tokenizing overall?
>
> Are there training sets available to download?

Yes and no, you need a file with tokenization information to train the 
tokenizer.
In OpenNLP we use a sentence per line format, and the non-whitespace
separated tokens are separated by a special tag.
See our documentation for information about the format.

We observed that is is easy to use rules to detokenize correctly 
tokenized text.
For that reason I implemented a rule based detokenizer.

Now you just need some kind of tokenized text to produce a training file 
for our tokenizer.
You might want to use the reuters corpus, or other freely available 
english language corpora.
If you have access to the reuters corpus I suggest that we go through 
the steps to train the
tokenizer with it.

Jörn

RE: Tokenizer issue - Quotation marks

Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
Hi James,

It works for double quotes, but not for single quotes (i.e. fails for
'mistakes'). Is it a training issue then (not having cases with words
enclosed within single/double quotes. 

I have noticed that your model file is much smaller than the model file
available to download. Is it because your training data set is smaller?
How does it affect tokenizing overall?

Are there training sets available to download? 

Regards

Rohana


-----Original Message-----
From: James Kosin [mailto:james.kosin@gmail.com] 
Sent: 23 February 2011 11:26
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks

On 2/23/2011 3:23 AM, Rohana Rajapakse wrote:
> Thanks for the replies.
>
> Which model are you referring to? Where can I find it?
>
> Thanks
>
> Rohana
>
Sorry, I thought I responded directly to your email with the attachment.
I sent again offline.
James


GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup 

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.




Re: Tokenizer issue - Quotation marks

Posted by James Kosin <ja...@gmail.com>.
On 2/23/2011 3:23 AM, Rohana Rajapakse wrote:
> Thanks for the replies.
>
> Which model are you referring to? Where can I find it?
>
> Thanks
>
> Rohana
>
Sorry, I thought I responded directly to your email with the attachment.
I sent again offline.
James

RE: Tokenizer issue - Quotation marks

Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
Thanks for the replies.

Which model are you referring to? Where can I find it?

Thanks

Rohana

-----Original Message-----
From: James Kosin [mailto:james.kosin@gmail.com] 
Sent: 23 February 2011 03:53
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks

On 2/22/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi,
>
>  
>
> I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
> made "mistakes".  It gives me "mistakes as a token (note starting
quote
> is part of the token). But, if I change the word mistake to Mistake
> (i.e. capitol M) in the input text, then I get the token Mistakes
> (correctly). 
>
>  
>
> Anyone aware of this issue and any idea of how to get-around this?
>
>  
>

Can you see if this model will work for you?

Thanks,
James


GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup 

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.




Re: Tokenizer issue - Quotation marks

Posted by James Kosin <ja...@gmail.com>.
On 2/22/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi,
>
>  
>
> I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
> made "mistakes".  It gives me "mistakes as a token (note starting quote
> is part of the token). But, if I change the word mistake to Mistake
> (i.e. capitol M) in the input text, then I get the token Mistakes
> (correctly). 
>
>  
>
> Anyone aware of this issue and any idea of how to get-around this?
>
>  
>

Can you see if this model will work for you?

Thanks,
James