You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Rohana Rajapakse <Ro...@gossinteractive.com> on 2011/02/22 14:18:35 UTC
Tokenizer issue - Quotation marks
Hi,
I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
made "mistakes". It gives me "mistakes as a token (note starting quote
is part of the token). But, if I change the word mistake to Mistake
(i.e. capitol M) in the input text, then I get the token Mistakes
(correctly).
Anyone aware of this issue and any idea of how to get-around this?
Thanks
Rohana
GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup
Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908
This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.
Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
Re: Tokenizer issue - Quotation marks
Posted by James Kosin <ja...@gmail.com>.
On 2/22/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi,
>
>
>
> I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
> made "mistakes". It gives me "mistakes as a token (note starting quote
> is part of the token). But, if I change the word mistake to Mistake
> (i.e. capitol M) in the input text, then I get the token Mistakes
> (correctly).
>
>
>
> Anyone aware of this issue and any idea of how to get-around this?
>
>
It looks like from the comment on the download site that the model was
trained with OpenNLP data. This usually means it doesn't contain very
many samples. I was able to add some samples to my (own) model and was
able to get a model that works good.
For those interested, the tokenizer usually gets passed data after the
sentence detector. The tokenizer then breaks based on tokens or other
punctuation. eg:
This old house was painted white. -- would have training data --
This old house was painted white<SPLIT>.
This indicates we want the period at the end of the sentence to be split
from the last word in the sentence. Similar ideas hold for the comma
and quote characters. Special handling is required for possessive nouns
like James' would be James<SPLIT>' and John's would be John<SPLIT>'s ...
Note words in the sentence that are already separated by spaces don't
need to be <SPLIT>.
James
Re: Tokenizer issue - Quotation marks
Posted by James Kosin <ja...@gmail.com>.
On 2/22/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi,
>
>
>
> I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
> made "mistakes". It gives me "mistakes as a token (note starting quote
> is part of the token). But, if I change the word mistake to Mistake
> (i.e. capitol M) in the input text, then I get the token Mistakes
> (correctly).
>
>
>
> Anyone aware of this issue and any idea of how to get-around this?
>
>
>
> Thanks
>
>
>
> Rohana
I see the issue, unfortunately, can't do much about fixing this without
the training data used for the tokenizer. You can use the
SimpleTokenizer and that appears to be working with your sample.
I found a few more samples that don't work:
This model is "bad."
This is the "year" of the "pig."
It seems to be a problem following a " with certain characters.
James
Re: Tokenizer issue - Quotation marks
Posted by Jörn Kottmann <ko...@gmail.com>.
On 2/23/11 9:23 AM, Rohana Rajapakse wrote:
> Thanks for the replies.
>
> Which model are you referring to? Where can I find it?
>
The mailing list does not allow mail attachments, I believe he attached
it to the mail and our list server removed it.
Jörn
Re: Tokenizer issue - Quotation marks
Posted by James Kosin <ja...@gmail.com>.
On 2/23/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi James,
>
> It works for double quotes, but not for single quotes (i.e. fails for
> 'mistakes'). Is it a training issue then (not having cases with words
> enclosed within single/double quotes.
>
> I have noticed that your model file is much smaller than the model file
> available to download. Is it because your training data set is smaller?
> How does it affect tokenizing overall?
>
> Are there training sets available to download?
>
> Regards
>
> Rohana
>
>
Rohana,
It doesn't take many to produce a good model. I was only doing for
simple cases as is since many of the models have no training data that
is freely available. Even if you get, most will have to be hand parsed
into the correct format and tokenized by hand in most cases.
The tokenizer is expecting the sentence detector to get the data first,
generating a sentence on each line. My model was created with about 75
sentences in total. I only added two to help with the "" characters
around words. The only time I'd expect single-quotes would be quotes of
quotes... which doesn't happen very often.
I think the issue was not having any single words that had quotes around
them that caused the issue, in the context you brought up the quotes
around the "mistakes" isn't a direct quote but using the word out of
context to represent a thought or idea outside of how the word would
normally be used. ie: the military doesn't make mistakes since they are
in the business of war which is messy by design.
I don't think the small sampling of text is really bad, I have to train
the model a bit differently; but, even my model had issues with the same
sentences you had given; meaning it was at least as good as the larger
model. There may have even been some the larger model can get that my
model wouldn't get correct as well.
As far as the other model, I can't comment too much other than most of
the models are trained on news stories and not text documents. It is
really based on what you are looking to parse; however a tokenizer is a
fairly simple thing to train.
James
Re: Tokenizer issue - Quotation marks
Posted by Jörn Kottmann <ko...@gmail.com>.
On 2/24/11 12:10 PM, Rohana Rajapakse wrote:
> Thanks a lot.
>
> I did create a name finder training data file using CONNEL sometimes ago. Will have a look at how I did it. I may be able to convert this training file to produce a tokenizer training file.
>
> Thanks a lot. Will let you know how I get on with this. Would contribute anything that might be useful to others.
English detoeknizer rules will be helpful for many, would be nice if you
could contribute yours then,
there is a general one you can use to start with inside
opennlp-tools/src/test/resources/opennlp/tools/latin-detokenizer.xml
The name finder training data you have should be good enough to start
with, depending
on the way its tokenized.
Jörn
RE: Tokenizer issue - Quotation marks
Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
Thanks Jörn.
I can see in the code that depending on the operation a token will be merged to the left or right token. But I can't see where (in the code) it adds a <SPLIT> token. Can you please point me to the right place in the code?
Thanks
Rohana
-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com]
Sent: 04 March 2011 15:05
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks
On 3/4/11 3:46 PM, Rohana Rajapakse wrote:
> That works great. It was not clear to me where/how the detokenizer rules are used. I thought it's for combining a given token to the one before or after it.
Exactly, the converter detokenizes your input tokens with the
detokenizer, now it knows which tokens will be merged
together and is able to add the <SPLIT> tag between the token instead of
just cat the tokens together.
> I can see in the training file that it has added<SPLIT> tags before "'s".
> Can I include spaces in the rule (e.g. "'s " note the trailing space). This will make it explicit.
Not sure I understand you correctly here. I mean we already have
tokenized data, the tokens do not contain
any information about the spaces between them.
Lets say we have these two strings:
1: "A sample"
2: "A sample"
Now we use a white space tokenizer to tokenize them and
the result would be like this:
1: "A", "sample"
2: "A", "sample"
In the token representation we do not have any white spaces anymore.
Jörn
GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup
Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908
This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.
Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
Re: Tokenizer issue - Quotation marks
Posted by Jörn Kottmann <ko...@gmail.com>.
On 3/4/11 3:46 PM, Rohana Rajapakse wrote:
> That works great. It was not clear to me where/how the detokenizer rules are used. I thought it's for combining a given token to the one before or after it.
Exactly, the converter detokenizes your input tokens with the
detokenizer, now it knows which tokens will be merged
together and is able to add the <SPLIT> tag between the token instead of
just cat the tokens together.
> I can see in the training file that it has added<SPLIT> tags before "'s".
> Can I include spaces in the rule (e.g. "'s " note the trailing space). This will make it explicit.
Not sure I understand you correctly here. I mean we already have
tokenized data, the tokens do not contain
any information about the spaces between them.
Lets say we have these two strings:
1: "A sample"
2: "A sample"
Now we use a white space tokenizer to tokenize them and
the result would be like this:
1: "A", "sample"
2: "A", "sample"
In the token representation we do not have any white spaces anymore.
Jörn
RE: Tokenizer issue - Quotation marks
Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
That works great. It was not clear to me where/how the detokenizer rules are used. I thought it's for combining a given token to the one before or after it. I can see in the training file that it has added <SPLIT> tags before "'s".
Can I include spaces in the rule (e.g. "'s " note the trailing space). This will make it explicit.
Thanks
Rohana
-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com]
Sent: 04 March 2011 12:34
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks
On 3/4/11 1:06 PM, Rohana Rajapakse wrote:
> Hi Jorn,
>
>
>
> I have modified the toString() method in TokenSample.java as given below. This is to add a<SPLIT> token before the token 's .
>
> This helped me to train a tokenizer model that splits for eg. "it's" into two tokens "it" and "'s" at the same time detokenizer rule (same as the rule for double quote) splitting single quotes from expressions that are enclosed in between a pair of single quotes.
>
> This does not handle other cases of single quotes (e.g. don't, can't etc and names like O'Conner).
>
Had a look at the change. The tokenization information must be provided
to TokenSample, this class
then just encapsulates that knowledge. So it is not the responsibility
of it to figure out how
things should be tokenized or not.
In your case I think you can just add "'s" to your detokenizer
dictionary like this:
<entry operation="MOVE_LEFT">
<token>'s</token>
</entry>
Doesn't that fix your issue?
Jörn
GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup
Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908
This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.
Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
Re: Tokenizer issue - Quotation marks
Posted by Jörn Kottmann <ko...@gmail.com>.
On 3/4/11 1:06 PM, Rohana Rajapakse wrote:
> Hi Jorn,
>
>
>
> I have modified the toString() method in TokenSample.java as given below. This is to add a<SPLIT> token before the token 's .
>
> This helped me to train a tokenizer model that splits for eg. "it's" into two tokens "it" and "'s" at the same time detokenizer rule (same as the rule for double quote) splitting single quotes from expressions that are enclosed in between a pair of single quotes.
>
> This does not handle other cases of single quotes (e.g. don't, can't etc and names like O'Conner).
>
Had a look at the change. The tokenization information must be provided
to TokenSample, this class
then just encapsulates that knowledge. So it is not the responsibility
of it to figure out how
things should be tokenized or not.
In your case I think you can just add "'s" to your detokenizer
dictionary like this:
<entry operation="MOVE_LEFT">
<token>'s</token>
</entry>
Doesn't that fix your issue?
Jörn
RE: Tokenizer issue - Quotation marks
Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
Hi Jorn,
I have modified the toString() method in TokenSample.java as given below. This is to add a <SPLIT> token before the token 's .
This helped me to train a tokenizer model that splits for eg. "it's" into two tokens "it" and "'s" at the same time detokenizer rule (same as the rule for double quote) splitting single quotes from expressions that are enclosed in between a pair of single quotes.
This does not handle other cases of single quotes (e.g. don't, can't etc and names like O'Conner).
I am not sure if this change of code affects other functionalities of opennlp (where else the TokenSample class used?) and if it was the right place to do it.
Please let me know what you think!
Regards
Rohana
public String toString() {
StringBuilder sentence = new StringBuilder();
int lastEndIndex = -1;
for (Span token : tokenSpans) {
if (lastEndIndex != -1) {
// If there are no chars between last token
// and this token insert the separator chars
// otherwise insert a space
String separator = "";
if (lastEndIndex == token.getStart())
separator = separatorChars;
else {
separator = " ";
//New condition for adding <SPLIT> before 's into the training file when creating/converting conll03 to produce tokenizer training
//data using "TokenizerConverter"
if (token.getCoveredText(text).equals("'s")) {
separator = separatorChars;
}
}
sentence.append(separator);
}
sentence.append(token.getCoveredText(text));
lastEndIndex = token.getEnd();
}
return sentence.toString();
}
-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com]
Sent: 03 March 2011 16:01
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks
On 3/3/11 4:33 PM, Rohana Rajapakse wrote:
> Thanks. I have got the training files created (conll03 + Reuters) and models trained. Used the latin-detokenizer that came with the download. The trained model solves the double quotation problem (e.g. "mistakes" now results in three tokens: ", mistakes and ").
>
> I have tried adding the same detokenizer rules for single quote. However, it seems to conflict with the different usage of the single quote (e.g. possession as Tom's, It's etc.) This means we will have such cases separately. I will try adding<SPLIT> tags for those cases (e.g. Tom<SPLIT>'s , it<SPLIT>'s etc.). Don't know which gets the priority, rules in the detokenizer or<SPLIT> tags...
Yes you need to add all the tokens which should be attached to the
previous one, like "'s", "'t", etc.
It would be nice to have such a file as part of the project.
Jörn
GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup
Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908
This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.
Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
Re: Tokenizer issue - Quotation marks
Posted by Jörn Kottmann <ko...@gmail.com>.
On 3/3/11 4:33 PM, Rohana Rajapakse wrote:
> Thanks. I have got the training files created (conll03 + Reuters) and models trained. Used the latin-detokenizer that came with the download. The trained model solves the double quotation problem (e.g. "mistakes" now results in three tokens: ", mistakes and ").
>
> I have tried adding the same detokenizer rules for single quote. However, it seems to conflict with the different usage of the single quote (e.g. possession as Tom's, It's etc.) This means we will have such cases separately. I will try adding<SPLIT> tags for those cases (e.g. Tom<SPLIT>'s , it<SPLIT>'s etc.). Don't know which gets the priority, rules in the detokenizer or<SPLIT> tags...
Yes you need to add all the tokens which should be attached to the
previous one, like "'s", "'t", etc.
It would be nice to have such a file as part of the project.
Jörn
RE: Tokenizer issue - Quotation marks
Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
Thanks. I have got the training files created (conll03 + Reuters) and models trained. Used the latin-detokenizer that came with the download. The trained model solves the double quotation problem (e.g. "mistakes" now results in three tokens: ", mistakes and ").
I have tried adding the same detokenizer rules for single quote. However, it seems to conflict with the different usage of the single quote (e.g. possession as Tom's, It's etc.) This means we will have such cases separately. I will try adding <SPLIT> tags for those cases (e.g. Tom <SPLIT>'s , it<SPLIT>'s etc.). Don't know which gets the priority, rules in the detokenizer or <SPLIT> tags...
Rohana
-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com]
Sent: 02 March 2011 13:08
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks
On 3/2/11 1:47 PM, Rohana Rajapakse wrote:
> My NameFinder training model (created from CONLL + Reuters) has<START> and<END> markups for person names. It doesn't have<SPLIT> markups. I am trying the testTokenizer() test with TokenizerTestUtil.createMaxentTokenModel() to create a model using my training data file. I had to remove<START> and<END> tags and add few<SPLIT> tags to get the test to work (to get "Number of Outcomes" to match). It learns a model now, but not perfect. I need to add<SPLIT> markups for all single and double quotes etc.
>
> By the way, where is the " TokenizerConverter" that you had mentioned. My download (from sourceforge) doesn't have it. Also, where is the converter to produce name
> Finder that you have created to convert CONLL03. Am I missing some code in my download.
>
> Also, please point me to your "docbook". Would like to know more about the detokenizer. I can't find a "release candidate" in the download site.
>
The release candidate can be found here:
http://people.apache.org/~joern/releases/opennlp-1.5.1-incubating/rc1/
Just use your name finder training file with the TokenizerConverter.
Pieces of the work
is in 1.5.0 and all the things you are missing are in 1.5.1. The docbook
is also included
in the 1.5.1 distribution.
I suggest that you just re-try with the rc1.
Jörn
GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup
Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908
This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.
Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
Re: Tokenizer issue - Quotation marks
Posted by Jörn Kottmann <ko...@gmail.com>.
On 3/2/11 1:47 PM, Rohana Rajapakse wrote:
> My NameFinder training model (created from CONLL + Reuters) has<START> and<END> markups for person names. It doesn't have<SPLIT> markups. I am trying the testTokenizer() test with TokenizerTestUtil.createMaxentTokenModel() to create a model using my training data file. I had to remove<START> and<END> tags and add few<SPLIT> tags to get the test to work (to get "Number of Outcomes" to match). It learns a model now, but not perfect. I need to add<SPLIT> markups for all single and double quotes etc.
>
> By the way, where is the " TokenizerConverter" that you had mentioned. My download (from sourceforge) doesn't have it. Also, where is the converter to produce name
> Finder that you have created to convert CONLL03. Am I missing some code in my download.
>
> Also, please point me to your "docbook". Would like to know more about the detokenizer. I can't find a "release candidate" in the download site.
>
The release candidate can be found here:
http://people.apache.org/~joern/releases/opennlp-1.5.1-incubating/rc1/
Just use your name finder training file with the TokenizerConverter.
Pieces of the work
is in 1.5.0 and all the things you are missing are in 1.5.1. The docbook
is also included
in the 1.5.1 distribution.
I suggest that you just re-try with the rc1.
Jörn
RE: Tokenizer issue - Quotation marks
Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
My NameFinder training model (created from CONLL + Reuters) has <START> and <END> markups for person names. It doesn't have <SPLIT> markups. I am trying the testTokenizer() test with TokenizerTestUtil.createMaxentTokenModel() to create a model using my training data file. I had to remove <START> and <END> tags and add few <SPLIT> tags to get the test to work (to get "Number of Outcomes" to match). It learns a model now, but not perfect. I need to add <SPLIT> markups for all single and double quotes etc.
By the way, where is the " TokenizerConverter" that you had mentioned. My download (from sourceforge) doesn't have it. Also, where is the converter to produce name
Finder that you have created to convert CONLL03. Am I missing some code in my download.
Also, please point me to your "docbook". Would like to know more about the detokenizer. I can't find a "release candidate" in the download site.
Thanks
Rohana Rajapakse
Senior Software Developer
GOSS Interactive
t: +44 (0)844 880 3637
f: +44 (0)844 880 3638
e: rohana.rajapakse@gossinteractive.com
w: www.gossinteractive.com
-----Original Message-----
From: Rohana Rajapakse [mailto:Rohana.Rajapakse@gossinteractive.com]
Sent: 24 February 2011 11:10
To: opennlp-users@incubator.apache.org
Subject: RE: Tokenizer issue - Quotation marks
Thanks a lot.
I did create a name finder training data file using CONNEL sometimes ago. Will have a look at how I did it. I may be able to convert this training file to produce a tokenizer training file.
Thanks a lot. Will let you know how I get on with this. Would contribute anything that might be useful to others.
Rohana
-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com]
Sent: 24 February 2011 10:59
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks
On 2/23/11 2:33 PM, Rohana Rajapakse wrote:
> Yes I do have Reuters corpus with me. Also used Browns corpus (subset of Reuters?) some times ago. So, I should have that as well. Would like to know the steps to create a training set.
>
For CONLL03 (uses Reuters data) we created a converter to produce name
finder
training material. The process how to do that is described in our
docbook (just build
it or download the release candidate).
After you produced the name finder training file you can use the
converter for
the tokenizer and a detokenizer file to produce training data for the
tokenizer.
There is a sample detokenizer file which maybe must be extended a little
for english,
maybe we should create a new folder within the tools project to collect
all the
non-statistical model files. I guess a good detokenizer for the reuters
corpus could be
useful for others too.
That can be done with a command like this
bin/opennlp TokenizerConverter namefinder -encoding UTF-8 -data
en-ner-reuters.train -detokenizer lating-detokenizer.xml > en-tok.train
Hope that helps to get you started, contribution about this to the
documentation is very welcome.
Jörn
Mail about the RC:
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201102.mbox/%3C4D659A52.1090806@gmail.com%3E
GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup
Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908
This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.
Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
RE: Tokenizer issue - Quotation marks
Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
Thanks a lot.
I did create a name finder training data file using CONNEL sometimes ago. Will have a look at how I did it. I may be able to convert this training file to produce a tokenizer training file.
Thanks a lot. Will let you know how I get on with this. Would contribute anything that might be useful to others.
Rohana
-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com]
Sent: 24 February 2011 10:59
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks
On 2/23/11 2:33 PM, Rohana Rajapakse wrote:
> Yes I do have Reuters corpus with me. Also used Browns corpus (subset of Reuters?) some times ago. So, I should have that as well. Would like to know the steps to create a training set.
>
For CONLL03 (uses Reuters data) we created a converter to produce name
finder
training material. The process how to do that is described in our
docbook (just build
it or download the release candidate).
After you produced the name finder training file you can use the
converter for
the tokenizer and a detokenizer file to produce training data for the
tokenizer.
There is a sample detokenizer file which maybe must be extended a little
for english,
maybe we should create a new folder within the tools project to collect
all the
non-statistical model files. I guess a good detokenizer for the reuters
corpus could be
useful for others too.
That can be done with a command like this
bin/opennlp TokenizerConverter namefinder -encoding UTF-8 -data
en-ner-reuters.train -detokenizer lating-detokenizer.xml > en-tok.train
Hope that helps to get you started, contribution about this to the
documentation is very welcome.
Jörn
Mail about the RC:
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201102.mbox/%3C4D659A52.1090806@gmail.com%3E
GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup
Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908
This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.
Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
Re: Tokenizer issue - Quotation marks
Posted by Jörn Kottmann <ko...@gmail.com>.
On 2/23/11 2:33 PM, Rohana Rajapakse wrote:
> Yes I do have Reuters corpus with me. Also used Browns corpus (subset of Reuters?) some times ago. So, I should have that as well. Would like to know the steps to create a training set.
>
For CONLL03 (uses Reuters data) we created a converter to produce name
finder
training material. The process how to do that is described in our
docbook (just build
it or download the release candidate).
After you produced the name finder training file you can use the
converter for
the tokenizer and a detokenizer file to produce training data for the
tokenizer.
There is a sample detokenizer file which maybe must be extended a little
for english,
maybe we should create a new folder within the tools project to collect
all the
non-statistical model files. I guess a good detokenizer for the reuters
corpus could be
useful for others too.
That can be done with a command like this
bin/opennlp TokenizerConverter namefinder -encoding UTF-8 -data
en-ner-reuters.train -detokenizer lating-detokenizer.xml > en-tok.train
Hope that helps to get you started, contribution about this to the
documentation is very welcome.
Jörn
Mail about the RC:
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201102.mbox/%3C4D659A52.1090806@gmail.com%3E
RE: Tokenizer issue - Quotation marks
Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
Yes I do have Reuters corpus with me. Also used Browns corpus (subset of Reuters?) some times ago. So, I should have that as well. Would like to know the steps to create a training set.
Thanks
Rohana
-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com]
Sent: 23 February 2011 13:28
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks
On 2/23/11 2:18 PM, Rohana Rajapakse wrote:
> Hi James,
>
> It works for double quotes, but not for single quotes (i.e. fails for
> 'mistakes'). Is it a training issue then (not having cases with words
> enclosed within single/double quotes.
>
> I have noticed that your model file is much smaller than the model file
> available to download. Is it because your training data set is smaller?
> How does it affect tokenizing overall?
>
> Are there training sets available to download?
Yes and no, you need a file with tokenization information to train the
tokenizer.
In OpenNLP we use a sentence per line format, and the non-whitespace
separated tokens are separated by a special tag.
See our documentation for information about the format.
We observed that is is easy to use rules to detokenize correctly
tokenized text.
For that reason I implemented a rule based detokenizer.
Now you just need some kind of tokenized text to produce a training file
for our tokenizer.
You might want to use the reuters corpus, or other freely available
english language corpora.
If you have access to the reuters corpus I suggest that we go through
the steps to train the
tokenizer with it.
Jörn
GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup
Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908
This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.
Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
Re: Tokenizer issue - Quotation marks
Posted by Jörn Kottmann <ko...@gmail.com>.
On 2/23/11 2:18 PM, Rohana Rajapakse wrote:
> Hi James,
>
> It works for double quotes, but not for single quotes (i.e. fails for
> 'mistakes'). Is it a training issue then (not having cases with words
> enclosed within single/double quotes.
>
> I have noticed that your model file is much smaller than the model file
> available to download. Is it because your training data set is smaller?
> How does it affect tokenizing overall?
>
> Are there training sets available to download?
Yes and no, you need a file with tokenization information to train the
tokenizer.
In OpenNLP we use a sentence per line format, and the non-whitespace
separated tokens are separated by a special tag.
See our documentation for information about the format.
We observed that is is easy to use rules to detokenize correctly
tokenized text.
For that reason I implemented a rule based detokenizer.
Now you just need some kind of tokenized text to produce a training file
for our tokenizer.
You might want to use the reuters corpus, or other freely available
english language corpora.
If you have access to the reuters corpus I suggest that we go through
the steps to train the
tokenizer with it.
Jörn
RE: Tokenizer issue - Quotation marks
Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
Hi James,
It works for double quotes, but not for single quotes (i.e. fails for
'mistakes'). Is it a training issue then (not having cases with words
enclosed within single/double quotes.
I have noticed that your model file is much smaller than the model file
available to download. Is it because your training data set is smaller?
How does it affect tokenizing overall?
Are there training sets available to download?
Regards
Rohana
-----Original Message-----
From: James Kosin [mailto:james.kosin@gmail.com]
Sent: 23 February 2011 11:26
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks
On 2/23/2011 3:23 AM, Rohana Rajapakse wrote:
> Thanks for the replies.
>
> Which model are you referring to? Where can I find it?
>
> Thanks
>
> Rohana
>
Sorry, I thought I responded directly to your email with the attachment.
I sent again offline.
James
GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup
Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908
This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.
Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
Re: Tokenizer issue - Quotation marks
Posted by James Kosin <ja...@gmail.com>.
On 2/23/2011 3:23 AM, Rohana Rajapakse wrote:
> Thanks for the replies.
>
> Which model are you referring to? Where can I find it?
>
> Thanks
>
> Rohana
>
Sorry, I thought I responded directly to your email with the attachment.
I sent again offline.
James
RE: Tokenizer issue - Quotation marks
Posted by Rohana Rajapakse <Ro...@gossinteractive.com>.
Thanks for the replies.
Which model are you referring to? Where can I find it?
Thanks
Rohana
-----Original Message-----
From: James Kosin [mailto:james.kosin@gmail.com]
Sent: 23 February 2011 03:53
To: opennlp-users@incubator.apache.org
Subject: Re: Tokenizer issue - Quotation marks
On 2/22/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi,
>
>
>
> I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
> made "mistakes". It gives me "mistakes as a token (note starting
quote
> is part of the token). But, if I change the word mistake to Mistake
> (i.e. capitol M) in the input text, then I get the token Mistakes
> (correctly).
>
>
>
> Anyone aware of this issue and any idea of how to get-around this?
>
>
>
Can you see if this model will work for you?
Thanks,
James
GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup
Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908
This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email.
Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
Re: Tokenizer issue - Quotation marks
Posted by James Kosin <ja...@gmail.com>.
On 2/22/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi,
>
>
>
> I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
> made "mistakes". It gives me "mistakes as a token (note starting quote
> is part of the token). But, if I change the word mistake to Mistake
> (i.e. capitol M) in the input text, then I get the token Mistakes
> (correctly).
>
>
>
> Anyone aware of this issue and any idea of how to get-around this?
>
>
>
Can you see if this model will work for you?
Thanks,
James