You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Joan Codina <Jo...@upf.edu> on 2012/03/30 16:47:23 UTC

tokenizer abbreviation diccionary

Hello,
I want to train an english tokenizer, but to do so , I need the 
abbreviations dictionary and the sample tokenized .  I could not find 
any of them in the opennlp repositories?  even the format of the 
abbreviations diccionary is not explained, but I'm sure ther must be a 
basic sample one.


thanks

Joan Codina

Re: abbreviation diccionary format

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

Yes Jim, you need to train and that is the right format. Thank you. The
abbreviation dictionary can increase the effectiveness while dealing with
abbreviations, but you still need the model.

Just a note. Often you don't need to convert to the OpenNLP format by
yourself, you can use the formaters instead. I will explain how to use it
in 1.5.2-incubating.  This process were improved in trunk and in it will be
a lot easier in the next release.

The tool to use is the *SentenceDetectorConverter*:

$ bin/opennlp *SentenceDetectorConverter*
Usage: opennlp SentenceDetectorConverter format ...

You need to know the available formats for now. They are *conllx*, *pos*,
and *namefinder* (it has been improved already and the future release will
list it for you)

For example to create the Sentence Detector training data from conllx:

$ bin/opennlp* SentenceDetectorConverter conllx*
Usage: opennlp SentenceDetectorConverter conllx -encoding charsetName -data
sampleData -detokenizer dictionary

Arguments description:
-encoding charsetName
-data sampleData
 -detokenizer dictionary

You will need a detokenizer dictionary. There is one for English here:
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/lang/en/tokenizer/en-detokenizer.xml?view=co


William

On Tue, Apr 10, 2012 at 10:05 AM, william.colen@gmail.com <
william.colen@gmail.com> wrote:

> I checked the English models from download page. They were not trained
> using an abbreviation dictionary. If they were you would be able to see it
> if you extract the model like a zip file. So we don't have a basic English
> abbreviation dictionary for you to start with, you will need to create
> yours from scratch.
>
> To create your own abbreviation dictionary use *DictionaryBuilder* tool:
>
> $ *bin/opennlp DictionaryBuilder*
> Usage: opennlp DictionaryBuilder -inputFile in -outputFile out [-encoding
> charsetName]
>
> Arguments description:
> -inputFile in
> Plain file with one entry per line
>  -outputFile out
> The dictionary file.
> -encoding charsetName
>  specifies the encoding which should be used for reading and writing
> text. If not specified the system default will be used.
>
> The output looks like this:
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/abb.xml?view=markup
>
> On Tue, Apr 10, 2012 at 6:31 AM, Jim - FooBar(); <ji...@gmail.com>wrote:
>
>> To train models of any type you need training data...The pretrained
>> english tokenizer was trained on the CoNNL shared task if i remember
>> correctly...Maybe one of the developers can shine some light on
>> this...Anyway i don't think you need a dictionary but training data of the
>> following form :
>>
>> Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a
>> nonexecutive director Nov. 29<SPLIT>.
>> Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing
>> group<SPLIT>.
>> Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated
>> Gold Fields PLC<SPLIT>, was named a nonexecutive director of this British
>> industrial conglomerate<SPLIT>.
>>
>> Hope that helps,
>>
>> Jim
>>
>> p.s: Did you mean an abbreviation dictionary? Well, you can't really
>> train a model using an abbreviation dictionary...
>>
>>
>> On 10/04/12 09:02, Joan Codina wrote:
>>
>>>
>>> I sent this some days before, but I got no answer :-((  :
>>>
>>> To train a tokenizer I  can use a dictionary, but
>>> where is the dictionary used to train the current English model? and
>>> where can I  find information about the dictionary format? , so I can,
>>> at least, generate my own one.
>>>
>>> thanks
>>> Joan Codina
>>>
>>>
>>
>

Re: abbreviation diccionary format

Posted by Jörn Kottmann <ko...@gmail.com>.

On 04/11/2012 09:16 AM, Joan Codina wrote:
> Ok,
> I will try it,
> but this does not introduce a bias, as the de-tokenizer has a few rules?
>
> There is no way to do incremental train of an existing model, or just 
> add a dictionary of abbreviations to an existing model? 

No, we cannot complement an existing model with additional training data.
You need to re-train the whole thing with all the data.

Well, you can add a dictionary to the model, but the model would not 
know about the
new features you can produce via the dictionary.

You assume that the text was tokenized correctly, undoing it with the 
rule based de-tokenizer
usually produces something which is very close to the original text. In 
some cases you even
want to de-tokenize a bit too much to get a better tokenizer.

Jörn

Re: abbreviation diccionary format

Posted by Joan Codina <Jo...@upf.edu>.

Ok,
I will try it,
but this does not introduce a bias, as the de-tokenizer has a few rules?

There is no way to do incremental train of an existing model, or just 
add a dictionary of abbreviations to an existing model?

Joan

On 10/04/12 16:51, Jörn Kottmann wrote:
> On 04/10/2012 04:44 PM, Joan Codina wrote:
>> But to train the system I only found that file... which is small.
>> http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markup 
>>
>> which only contains 121 sentences. i don't know if this is enough or 
>> there are other training annotated models 
>
> No, that is not enough. Get some training data set for the language 
> you need. Most of the data sets
> referenced in the Corpora section can be used to train the tokenizer. 
> These corpora are already tokenized
> and can be de-tokenized into training data for the tokenizer.
>
> Jörn

-- 

Joan Codina Filbà
Departament de Tecnologia
Universitat Pompeu Fabra
_______________________________________________________________________________ 

Abans d'imprimir aquest e-mail, pensa si realment és necessari, i en cas 
de que ho sigui, pensa que si ho fas a doble cara estalvies un 25% del 
paper, els arbres t'ho agrairan.
_______________________________________________________________________________ 

/La informació d'aquest missatge electrònic és confidencial, personal i 
intransferible i només està dirigida a la/les adreça/ces indicades a 
dalt. Si vostè llegeix aquest missatge per equivocació, l'informem que 
queda prohibida la seva divulgació, ús o distribució, completa o en 
part, i li preguem esborri el missatge original juntament amb els seus 
fitxers annexos sense llegir-lo ni gravar-lo./

/Gràcies./

Re: abbreviation diccionary format

Posted by Jörn Kottmann <ko...@gmail.com>.

On 04/22/2012 10:35 PM, Joan Codina wrote:
> tanks Jörn
> What is the DictionaryDetokenizerTool?? 

Its our command line tool which reads the input from
stdin and writes the detokenized output to stdout.

Jörn

Re: abbreviation diccionary format

Posted by Joan Codina <Jo...@upf.edu>.

tanks Jörn
What is the DictionaryDetokenizerTool??


On 04/20/2012 09:35 AM, Jörn Kottmann wrote:
> On 04/20/2012 08:33 AM, Joan Codina wrote:
>>
>> So, the processing is corrent but the <SPLIT>'s  are missing at for 
>> example "Haag." or "Chicago's"
>> And i wonder if there is a missing parameter or I need another 
>> dictionary. 
>
> Just checked the code, looks like it cannot output the <SPLIT> markers.
> We should fix that.
>
> There is also a nice method inside the cmd line tool 
> (DictionaryDetokenizerTool)
> which can produce a detokenized string. We should move that one to the 
> DictionaryDetokenizer.
>
> Jörn
>
>

Re: abbreviation diccionary format

Posted by Jörn Kottmann <ko...@gmail.com>.

On 04/20/2012 08:33 AM, Joan Codina wrote:
>
> So, the processing is corrent but the <SPLIT>'s  are missing at for 
> example "Haag." or "Chicago's"
> And i wonder if there is a missing parameter or I need another 
> dictionary. 

Just checked the code, looks like it cannot output the <SPLIT> markers.
We should fix that.

There is also a nice method inside the cmd line tool 
(DictionaryDetokenizerTool)
which can produce a detokenized string. We should move that one to the 
DictionaryDetokenizer.

Jörn

Re: abbreviation diccionary format

Posted by Joan Codina <Jo...@upf.edu>.

from this text
"
in an Oct. 19 review of `` The Misanthrope '' at Chicago 's Goodman 
Theatre ( `` Revitalized Classics Take the Stage in Windy City , '' 
Leisure & Arts ) , the role of Celimene , played by Kim Cattrall , was 
mistakenly attributed to Christina Haag .
"

i get
"
in an Oct. 19 review of ``The Misanthrope'' at Chicago's Goodman Theatre 
(``Revitalized Classics Take the Stage in Windy City,'' Leisure & Arts), 
the role of Celimene, played by Kim Cattrall, was mistakenly attributed 
to Christina Haag.
"

So, the processing is corrent but the <SPLIT>'s  are missing at for 
example "Haag." or "Chicago's"
And i wonder if there is a missing parameter or I need another dictionary.

On 04/19/2012 07:11 PM, Jörn Kottmann wrote:
> On 04/19/2012 06:20 PM, Joan Codina wrote:
>>
>>
>> then with the sentences with all tokens separated by spaces y need to 
>> merge the words adding <space> but I don't know how to make it with 
>> the  dictionaryDetokenizer
>> ./opennlp DictionaryDetokenizer ../models/en-detokenizer.xml 
>> <../models/CoNLL2009-ST-English-train.sent
>>
>> as it merges the senteces but does not add the <space> 
>
> It should insert <SPLIT> tags for certain spaces, so the tokenizer can 
> learn
> that there is something to split. Input should be one sentence per line.
>
> What output do you get?
>
> Jörn

Re: abbreviation diccionary format

Posted by Jörn Kottmann <ko...@gmail.com>.

On 04/19/2012 06:20 PM, Joan Codina wrote:
>
>
> then with the sentences with all tokens separated by spaces y need to 
> merge the words adding <space> but I don't know how to make it with 
> the  dictionaryDetokenizer
> ./opennlp DictionaryDetokenizer ../models/en-detokenizer.xml 
> <../models/CoNLL2009-ST-English-train.sent
>
> as it merges the senteces but does not add the <space> 

It should insert <SPLIT> tags for certain spaces, so the tokenizer can learn
that there is something to split. Input should be one sentence per line.

What output do you get?

Jörn

Re: abbreviation diccionary format

Posted by Joan Codina <Jo...@upf.edu>.

How can I de-tokenize a conll training set?
I have tried some commands but none seems to work
i did

./detokenizer.sh models/CoNLL2009-ST-English-train.txt 
 >models/CoNLL2009-ST-English-train.sent
where detokenizer is like:

#!/bin/bash

SEP="\t";
TAG="[^${SEP}]*";
SENTENCESEP="<SENTENCE123456789SEP>";
exec cat $1 | perl -pe "s/^${TAG}${SEP}(${TAG}).*$/\1/g" | perl -pe "s/^\s*$/\n/g" | perl -pe "s/^$/${SENTENCESEP}/g" | perl -pe "s/\n/ /g" | perl -pe "s/ ${SENTENCESEP} /\n/g"


then with the sentences with all tokens separated by spaces y need to 
merge the words adding <space> but I don't know how to make it with the  
dictionaryDetokenizer
./opennlp DictionaryDetokenizer ../models/en-detokenizer.xml 
<../models/CoNLL2009-ST-English-train.sent

as it merges the senteces but does not add the <space>


thanks in advance

Joan.



On 04/10/2012 04:51 PM, Jörn Kottmann wrote:
> On 04/10/2012 04:44 PM, Joan Codina wrote:
>> But to train the system I only found that file... which is small.
>> http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markup 
>>
>> which only contains 121 sentences. i don't know if this is enough or 
>> there are other training annotated models 
>
> No, that is not enough. Get some training data set for the language 
> you need. Most of the data sets
> referenced in the Corpora section can be used to train the tokenizer. 
> These corpora are already tokenized
> and can be de-tokenized into training data for the tokenizer.
>
> Jörn

Re: abbreviation diccionary format

Posted by Jörn Kottmann <ko...@gmail.com>.

On 04/10/2012 04:44 PM, Joan Codina wrote:
> But to train the system I only found that file... which is small.
> http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markup 
>
> which only contains 121 sentences. i don't know if this is enough or 
> there are other training annotated models 

No, that is not enough. Get some training data set for the language you 
need. Most of the data sets
referenced in the Corpora section can be used to train the tokenizer. 
These corpora are already tokenized
and can be de-tokenized into training data for the tokenizer.

Jörn

Re: abbreviation diccionary format

Posted by Joan Codina <Jo...@upf.edu>.

Thanks
I know I need a training model with the <space> but, but if I can add a 
list of domain abbreviations, I hope, I will be able to solve some 
problems I have with tokenization.
Also I will expand a bit the training set, with some sentences I may 
find conflictive.
But to train the system I only found that file... which is small.
http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markup
which only contains 121 sentences. i don't know if this is enough or 
there are other training annotated models

Joan

On 10/04/12 15:20, Jim - FooBar(); wrote:
> On 10/04/12 14:18, Jörn Kottmann wrote:
>> On 04/10/2012 03:15 PM, Jim - FooBar(); wrote:
>>>
>>> But you still cannot "train" anything (maxent/perceptron) on the 
>>> dictionary, can you???
>>> One needs training data for that yes? 
>>
>> The dictionary is used to produce additional features to our standard 
>> feature set.
>> Therefor you need training data to train our statistical tokenizer, 
>> even so the feature
>> generation can use a dictionary to produce features.
>>
>> Jörn
>
> aha ok, that makes sense...
>
> Jim

-- 

Joan Codina Filbà
Departament de Tecnologia
Universitat Pompeu Fabra
_______________________________________________________________________________ 

Abans d'imprimir aquest e-mail, pensa si realment és necessari, i en cas 
de que ho sigui, pensa que si ho fas a doble cara estalvies un 25% del 
paper, els arbres t'ho agrairan.
_______________________________________________________________________________ 

/La informació d'aquest missatge electrònic és confidencial, personal i 
intransferible i només està dirigida a la/les adreça/ces indicades a 
dalt. Si vostè llegeix aquest missatge per equivocació, l'informem que 
queda prohibida la seva divulgació, ús o distribució, completa o en 
part, i li preguem esborri el missatge original juntament amb els seus 
fitxers annexos sense llegir-lo ni gravar-lo./

/Gràcies./

Re: abbreviation diccionary format

Posted by "Jim - FooBar();" <ji...@gmail.com>.

On 10/04/12 14:18, Jörn Kottmann wrote:
> On 04/10/2012 03:15 PM, Jim - FooBar(); wrote:
>>
>> But you still cannot "train" anything (maxent/perceptron) on the 
>> dictionary, can you???
>> One needs training data for that yes? 
>
> The dictionary is used to produce additional features to our standard 
> feature set.
> Therefor you need training data to train our statistical tokenizer, 
> even so the feature
> generation can use a dictionary to produce features.
>
> Jörn

aha ok, that makes sense...

Jim

Re: abbreviation diccionary format

Posted by Jörn Kottmann <ko...@gmail.com>.

On 04/10/2012 03:15 PM, Jim - FooBar(); wrote:
>
> But you still cannot "train" anything (maxent/perceptron) on the 
> dictionary, can you???
> One needs training data for that yes? 

The dictionary is used to produce additional features to our standard 
feature set.
Therefor you need training data to train our statistical tokenizer, even 
so the feature
generation can use a dictionary to produce features.

Jörn

Re: abbreviation diccionary format

Posted by "Jim - FooBar();" <ji...@gmail.com>.

On 10/04/12 14:05, william.colen@gmail.com wrote:
> I checked the English models from download page. They were not trained
> using an abbreviation dictionary. If they were you would be able to see it
> if you extract the model like a zip file. So we don't have a basic English
> abbreviation dictionary for you to start with, you will need to create
> yours from scratch.

But you still cannot "train" anything (maxent/perceptron) on the 
dictionary, can you???
One needs training data for that yes?

Jim

Re: abbreviation diccionary format

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

I checked the English models from download page. They were not trained
using an abbreviation dictionary. If they were you would be able to see it
if you extract the model like a zip file. So we don't have a basic English
abbreviation dictionary for you to start with, you will need to create
yours from scratch.

To create your own abbreviation dictionary use *DictionaryBuilder* tool:

$ *bin/opennlp DictionaryBuilder*
Usage: opennlp DictionaryBuilder -inputFile in -outputFile out [-encoding
charsetName]

Arguments description:
-inputFile in
Plain file with one entry per line
 -outputFile out
The dictionary file.
-encoding charsetName
 specifies the encoding which should be used for reading and writing text.
If not specified the system default will be used.

The output looks like this:
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/abb.xml?view=markup

On Tue, Apr 10, 2012 at 6:31 AM, Jim - FooBar(); <ji...@gmail.com>wrote:

> To train models of any type you need training data...The pretrained
> english tokenizer was trained on the CoNNL shared task if i remember
> correctly...Maybe one of the developers can shine some light on
> this...Anyway i don't think you need a dictionary but training data of the
> following form :
>
> Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a
> nonexecutive director Nov. 29<SPLIT>.
> Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing
> group<SPLIT>.
> Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated
> Gold Fields PLC<SPLIT>, was named a nonexecutive director of this British
> industrial conglomerate<SPLIT>.
>
> Hope that helps,
>
> Jim
>
> p.s: Did you mean an abbreviation dictionary? Well, you can't really train
> a model using an abbreviation dictionary...
>
>
> On 10/04/12 09:02, Joan Codina wrote:
>
>>
>> I sent this some days before, but I got no answer :-((  :
>>
>> To train a tokenizer I  can use a dictionary, but
>> where is the dictionary used to train the current English model? and
>> where can I  find information about the dictionary format? , so I can, at
>> least, generate my own one.
>>
>> thanks
>> Joan Codina
>>
>>
>

Re: abbreviation diccionary format

Posted by "Jim - FooBar();" <ji...@gmail.com>.

To train models of any type you need training data...The pretrained 
english tokenizer was trained on the CoNNL shared task if i remember 
correctly...Maybe one of the developers can shine some light on 
this...Anyway i don't think you need a dictionary but training data of 
the following form :

Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a 
nonexecutive director Nov. 29<SPLIT>.
Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing 
group<SPLIT>.
Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated 
Gold Fields PLC<SPLIT>, was named a nonexecutive director of this 
British industrial conglomerate<SPLIT>.

Hope that helps,

Jim

p.s: Did you mean an abbreviation dictionary? Well, you can't really 
train a model using an abbreviation dictionary...

On 10/04/12 09:02, Joan Codina wrote:
>
> I sent this some days before, but I got no answer :-((  :
>
> To train a tokenizer I  can use a dictionary, but
> where is the dictionary used to train the current English model? and
> where can I  find information about the dictionary format? , so I can, 
> at least, generate my own one.
>
> thanks
> Joan Codina
>

abbreviation diccionary format

Posted by Joan Codina <Jo...@upf.edu>.

I sent this some days before, but I got no answer :-((  :

To train a tokenizer I  can use a dictionary, but
where is the dictionary used to train the current English model? and
where can I  find information about the dictionary format? , so I can, 
at least, generate my own one.

thanks
Joan Codina