You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Joan Codina <Jo...@upf.edu> on 2012/03/30 16:47:23 UTC
tokenizer abbreviation diccionary
Hello,
I want to train an english tokenizer, but to do so , I need the
abbreviations dictionary and the sample tokenized . I could not find
any of them in the opennlp repositories? even the format of the
abbreviations diccionary is not explained, but I'm sure ther must be a
basic sample one.
thanks
Joan Codina
Re: abbreviation diccionary format
Posted by "william.colen@gmail.com" <wi...@gmail.com>.
Yes Jim, you need to train and that is the right format. Thank you. The
abbreviation dictionary can increase the effectiveness while dealing with
abbreviations, but you still need the model.
Just a note. Often you don't need to convert to the OpenNLP format by
yourself, you can use the formaters instead. I will explain how to use it
in 1.5.2-incubating. This process were improved in trunk and in it will be
a lot easier in the next release.
The tool to use is the *SentenceDetectorConverter*:
$ bin/opennlp *SentenceDetectorConverter*
Usage: opennlp SentenceDetectorConverter format ...
You need to know the available formats for now. They are *conllx*, *pos*,
and *namefinder* (it has been improved already and the future release will
list it for you)
For example to create the Sentence Detector training data from conllx:
$ bin/opennlp* SentenceDetectorConverter conllx*
Usage: opennlp SentenceDetectorConverter conllx -encoding charsetName -data
sampleData -detokenizer dictionary
Arguments description:
-encoding charsetName
-data sampleData
-detokenizer dictionary
You will need a detokenizer dictionary. There is one for English here:
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/lang/en/tokenizer/en-detokenizer.xml?view=co
William
On Tue, Apr 10, 2012 at 10:05 AM, william.colen@gmail.com <
william.colen@gmail.com> wrote:
> I checked the English models from download page. They were not trained
> using an abbreviation dictionary. If they were you would be able to see it
> if you extract the model like a zip file. So we don't have a basic English
> abbreviation dictionary for you to start with, you will need to create
> yours from scratch.
>
> To create your own abbreviation dictionary use *DictionaryBuilder* tool:
>
> $ *bin/opennlp DictionaryBuilder*
> Usage: opennlp DictionaryBuilder -inputFile in -outputFile out [-encoding
> charsetName]
>
> Arguments description:
> -inputFile in
> Plain file with one entry per line
> -outputFile out
> The dictionary file.
> -encoding charsetName
> specifies the encoding which should be used for reading and writing
> text. If not specified the system default will be used.
>
> The output looks like this:
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/abb.xml?view=markup
>
> On Tue, Apr 10, 2012 at 6:31 AM, Jim - FooBar(); <ji...@gmail.com>wrote:
>
>> To train models of any type you need training data...The pretrained
>> english tokenizer was trained on the CoNNL shared task if i remember
>> correctly...Maybe one of the developers can shine some light on
>> this...Anyway i don't think you need a dictionary but training data of the
>> following form :
>>
>> Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a
>> nonexecutive director Nov. 29<SPLIT>.
>> Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing
>> group<SPLIT>.
>> Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated
>> Gold Fields PLC<SPLIT>, was named a nonexecutive director of this British
>> industrial conglomerate<SPLIT>.
>>
>> Hope that helps,
>>
>> Jim
>>
>> p.s: Did you mean an abbreviation dictionary? Well, you can't really
>> train a model using an abbreviation dictionary...
>>
>>
>> On 10/04/12 09:02, Joan Codina wrote:
>>
>>>
>>> I sent this some days before, but I got no answer :-(( :
>>>
>>> To train a tokenizer I can use a dictionary, but
>>> where is the dictionary used to train the current English model? and
>>> where can I find information about the dictionary format? , so I can,
>>> at least, generate my own one.
>>>
>>> thanks
>>> Joan Codina
>>>
>>>
>>
>
Re: abbreviation diccionary format
Posted by Jörn Kottmann <ko...@gmail.com>.
On 04/11/2012 09:16 AM, Joan Codina wrote:
> Ok,
> I will try it,
> but this does not introduce a bias, as the de-tokenizer has a few rules?
>
> There is no way to do incremental train of an existing model, or just
> add a dictionary of abbreviations to an existing model?
No, we cannot complement an existing model with additional training data.
You need to re-train the whole thing with all the data.
Well, you can add a dictionary to the model, but the model would not
know about the
new features you can produce via the dictionary.
You assume that the text was tokenized correctly, undoing it with the
rule based de-tokenizer
usually produces something which is very close to the original text. In
some cases you even
want to de-tokenize a bit too much to get a better tokenizer.
Jörn
Re: abbreviation diccionary format
Posted by Joan Codina <Jo...@upf.edu>.
Ok,
I will try it,
but this does not introduce a bias, as the de-tokenizer has a few rules?
There is no way to do incremental train of an existing model, or just
add a dictionary of abbreviations to an existing model?
Joan
On 10/04/12 16:51, Jörn Kottmann wrote:
> On 04/10/2012 04:44 PM, Joan Codina wrote:
>> But to train the system I only found that file... which is small.
>> http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markup
>>
>> which only contains 121 sentences. i don't know if this is enough or
>> there are other training annotated models
>
> No, that is not enough. Get some training data set for the language
> you need. Most of the data sets
> referenced in the Corpora section can be used to train the tokenizer.
> These corpora are already tokenized
> and can be de-tokenized into training data for the tokenizer.
>
> Jörn
--
Joan Codina Filbà
Departament de Tecnologia
Universitat Pompeu Fabra
_______________________________________________________________________________
Abans d'imprimir aquest e-mail, pensa si realment és necessari, i en cas
de que ho sigui, pensa que si ho fas a doble cara estalvies un 25% del
paper, els arbres t'ho agrairan.
_______________________________________________________________________________
/La informació d'aquest missatge electrònic és confidencial, personal i
intransferible i només està dirigida a la/les adreça/ces indicades a
dalt. Si vostè llegeix aquest missatge per equivocació, l'informem que
queda prohibida la seva divulgació, ús o distribució, completa o en
part, i li preguem esborri el missatge original juntament amb els seus
fitxers annexos sense llegir-lo ni gravar-lo./
/Gràcies./
Re: abbreviation diccionary format
Posted by Jörn Kottmann <ko...@gmail.com>.
On 04/22/2012 10:35 PM, Joan Codina wrote:
> tanks Jörn
> What is the DictionaryDetokenizerTool??
Its our command line tool which reads the input from
stdin and writes the detokenized output to stdout.
Jörn
Re: abbreviation diccionary format
Posted by Joan Codina <Jo...@upf.edu>.
tanks Jörn
What is the DictionaryDetokenizerTool??
On 04/20/2012 09:35 AM, Jörn Kottmann wrote:
> On 04/20/2012 08:33 AM, Joan Codina wrote:
>>
>> So, the processing is corrent but the <SPLIT>'s are missing at for
>> example "Haag." or "Chicago's"
>> And i wonder if there is a missing parameter or I need another
>> dictionary.
>
> Just checked the code, looks like it cannot output the <SPLIT> markers.
> We should fix that.
>
> There is also a nice method inside the cmd line tool
> (DictionaryDetokenizerTool)
> which can produce a detokenized string. We should move that one to the
> DictionaryDetokenizer.
>
> Jörn
>
>
Re: abbreviation diccionary format
Posted by Jörn Kottmann <ko...@gmail.com>.
On 04/20/2012 08:33 AM, Joan Codina wrote:
>
> So, the processing is corrent but the <SPLIT>'s are missing at for
> example "Haag." or "Chicago's"
> And i wonder if there is a missing parameter or I need another
> dictionary.
Just checked the code, looks like it cannot output the <SPLIT> markers.
We should fix that.
There is also a nice method inside the cmd line tool
(DictionaryDetokenizerTool)
which can produce a detokenized string. We should move that one to the
DictionaryDetokenizer.
Jörn
Re: abbreviation diccionary format
Posted by Joan Codina <Jo...@upf.edu>.
from this text
"
in an Oct. 19 review of `` The Misanthrope '' at Chicago 's Goodman
Theatre ( `` Revitalized Classics Take the Stage in Windy City , ''
Leisure & Arts ) , the role of Celimene , played by Kim Cattrall , was
mistakenly attributed to Christina Haag .
"
i get
"
in an Oct. 19 review of ``The Misanthrope'' at Chicago's Goodman Theatre
(``Revitalized Classics Take the Stage in Windy City,'' Leisure & Arts),
the role of Celimene, played by Kim Cattrall, was mistakenly attributed
to Christina Haag.
"
So, the processing is corrent but the <SPLIT>'s are missing at for
example "Haag." or "Chicago's"
And i wonder if there is a missing parameter or I need another dictionary.
On 04/19/2012 07:11 PM, Jörn Kottmann wrote:
> On 04/19/2012 06:20 PM, Joan Codina wrote:
>>
>>
>> then with the sentences with all tokens separated by spaces y need to
>> merge the words adding <space> but I don't know how to make it with
>> the dictionaryDetokenizer
>> ./opennlp DictionaryDetokenizer ../models/en-detokenizer.xml
>> <../models/CoNLL2009-ST-English-train.sent
>>
>> as it merges the senteces but does not add the <space>
>
> It should insert <SPLIT> tags for certain spaces, so the tokenizer can
> learn
> that there is something to split. Input should be one sentence per line.
>
> What output do you get?
>
> Jörn
Re: abbreviation diccionary format
Posted by Jörn Kottmann <ko...@gmail.com>.
On 04/19/2012 06:20 PM, Joan Codina wrote:
>
>
> then with the sentences with all tokens separated by spaces y need to
> merge the words adding <space> but I don't know how to make it with
> the dictionaryDetokenizer
> ./opennlp DictionaryDetokenizer ../models/en-detokenizer.xml
> <../models/CoNLL2009-ST-English-train.sent
>
> as it merges the senteces but does not add the <space>
It should insert <SPLIT> tags for certain spaces, so the tokenizer can learn
that there is something to split. Input should be one sentence per line.
What output do you get?
Jörn
Re: abbreviation diccionary format
Posted by Joan Codina <Jo...@upf.edu>.
How can I de-tokenize a conll training set?
I have tried some commands but none seems to work
i did
./detokenizer.sh models/CoNLL2009-ST-English-train.txt
>models/CoNLL2009-ST-English-train.sent
where detokenizer is like:
#!/bin/bash
SEP="\t";
TAG="[^${SEP}]*";
SENTENCESEP="<SENTENCE123456789SEP>";
exec cat $1 | perl -pe "s/^${TAG}${SEP}(${TAG}).*$/\1/g" | perl -pe "s/^\s*$/\n/g" | perl -pe "s/^$/${SENTENCESEP}/g" | perl -pe "s/\n/ /g" | perl -pe "s/ ${SENTENCESEP} /\n/g"
then with the sentences with all tokens separated by spaces y need to
merge the words adding <space> but I don't know how to make it with the
dictionaryDetokenizer
./opennlp DictionaryDetokenizer ../models/en-detokenizer.xml
<../models/CoNLL2009-ST-English-train.sent
as it merges the senteces but does not add the <space>
thanks in advance
Joan.
On 04/10/2012 04:51 PM, Jörn Kottmann wrote:
> On 04/10/2012 04:44 PM, Joan Codina wrote:
>> But to train the system I only found that file... which is small.
>> http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markup
>>
>> which only contains 121 sentences. i don't know if this is enough or
>> there are other training annotated models
>
> No, that is not enough. Get some training data set for the language
> you need. Most of the data sets
> referenced in the Corpora section can be used to train the tokenizer.
> These corpora are already tokenized
> and can be de-tokenized into training data for the tokenizer.
>
> Jörn
Re: abbreviation diccionary format
Posted by Jörn Kottmann <ko...@gmail.com>.
On 04/10/2012 04:44 PM, Joan Codina wrote:
> But to train the system I only found that file... which is small.
> http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markup
>
> which only contains 121 sentences. i don't know if this is enough or
> there are other training annotated models
No, that is not enough. Get some training data set for the language you
need. Most of the data sets
referenced in the Corpora section can be used to train the tokenizer.
These corpora are already tokenized
and can be de-tokenized into training data for the tokenizer.
Jörn
Re: abbreviation diccionary format
Posted by Joan Codina <Jo...@upf.edu>.
Thanks
I know I need a training model with the <space> but, but if I can add a
list of domain abbreviations, I hope, I will be able to solve some
problems I have with tokenization.
Also I will expand a bit the training set, with some sentences I may
find conflictive.
But to train the system I only found that file... which is small.
http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markup
which only contains 121 sentences. i don't know if this is enough or
there are other training annotated models
Joan
On 10/04/12 15:20, Jim - FooBar(); wrote:
> On 10/04/12 14:18, Jörn Kottmann wrote:
>> On 04/10/2012 03:15 PM, Jim - FooBar(); wrote:
>>>
>>> But you still cannot "train" anything (maxent/perceptron) on the
>>> dictionary, can you???
>>> One needs training data for that yes?
>>
>> The dictionary is used to produce additional features to our standard
>> feature set.
>> Therefor you need training data to train our statistical tokenizer,
>> even so the feature
>> generation can use a dictionary to produce features.
>>
>> Jörn
>
> aha ok, that makes sense...
>
> Jim
--
Joan Codina Filbà
Departament de Tecnologia
Universitat Pompeu Fabra
_______________________________________________________________________________
Abans d'imprimir aquest e-mail, pensa si realment és necessari, i en cas
de que ho sigui, pensa que si ho fas a doble cara estalvies un 25% del
paper, els arbres t'ho agrairan.
_______________________________________________________________________________
/La informació d'aquest missatge electrònic és confidencial, personal i
intransferible i només està dirigida a la/les adreça/ces indicades a
dalt. Si vostè llegeix aquest missatge per equivocació, l'informem que
queda prohibida la seva divulgació, ús o distribució, completa o en
part, i li preguem esborri el missatge original juntament amb els seus
fitxers annexos sense llegir-lo ni gravar-lo./
/Gràcies./
Re: abbreviation diccionary format
Posted by "Jim - FooBar();" <ji...@gmail.com>.
On 10/04/12 14:18, Jörn Kottmann wrote:
> On 04/10/2012 03:15 PM, Jim - FooBar(); wrote:
>>
>> But you still cannot "train" anything (maxent/perceptron) on the
>> dictionary, can you???
>> One needs training data for that yes?
>
> The dictionary is used to produce additional features to our standard
> feature set.
> Therefor you need training data to train our statistical tokenizer,
> even so the feature
> generation can use a dictionary to produce features.
>
> Jörn
aha ok, that makes sense...
Jim
Re: abbreviation diccionary format
Posted by Jörn Kottmann <ko...@gmail.com>.
On 04/10/2012 03:15 PM, Jim - FooBar(); wrote:
>
> But you still cannot "train" anything (maxent/perceptron) on the
> dictionary, can you???
> One needs training data for that yes?
The dictionary is used to produce additional features to our standard
feature set.
Therefor you need training data to train our statistical tokenizer, even
so the feature
generation can use a dictionary to produce features.
Jörn
Re: abbreviation diccionary format
Posted by "Jim - FooBar();" <ji...@gmail.com>.
On 10/04/12 14:05, william.colen@gmail.com wrote:
> I checked the English models from download page. They were not trained
> using an abbreviation dictionary. If they were you would be able to see it
> if you extract the model like a zip file. So we don't have a basic English
> abbreviation dictionary for you to start with, you will need to create
> yours from scratch.
But you still cannot "train" anything (maxent/perceptron) on the
dictionary, can you???
One needs training data for that yes?
Jim
Re: abbreviation diccionary format
Posted by "william.colen@gmail.com" <wi...@gmail.com>.
I checked the English models from download page. They were not trained
using an abbreviation dictionary. If they were you would be able to see it
if you extract the model like a zip file. So we don't have a basic English
abbreviation dictionary for you to start with, you will need to create
yours from scratch.
To create your own abbreviation dictionary use *DictionaryBuilder* tool:
$ *bin/opennlp DictionaryBuilder*
Usage: opennlp DictionaryBuilder -inputFile in -outputFile out [-encoding
charsetName]
Arguments description:
-inputFile in
Plain file with one entry per line
-outputFile out
The dictionary file.
-encoding charsetName
specifies the encoding which should be used for reading and writing text.
If not specified the system default will be used.
The output looks like this:
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/abb.xml?view=markup
On Tue, Apr 10, 2012 at 6:31 AM, Jim - FooBar(); <ji...@gmail.com>wrote:
> To train models of any type you need training data...The pretrained
> english tokenizer was trained on the CoNNL shared task if i remember
> correctly...Maybe one of the developers can shine some light on
> this...Anyway i don't think you need a dictionary but training data of the
> following form :
>
> Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a
> nonexecutive director Nov. 29<SPLIT>.
> Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing
> group<SPLIT>.
> Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated
> Gold Fields PLC<SPLIT>, was named a nonexecutive director of this British
> industrial conglomerate<SPLIT>.
>
> Hope that helps,
>
> Jim
>
> p.s: Did you mean an abbreviation dictionary? Well, you can't really train
> a model using an abbreviation dictionary...
>
>
> On 10/04/12 09:02, Joan Codina wrote:
>
>>
>> I sent this some days before, but I got no answer :-(( :
>>
>> To train a tokenizer I can use a dictionary, but
>> where is the dictionary used to train the current English model? and
>> where can I find information about the dictionary format? , so I can, at
>> least, generate my own one.
>>
>> thanks
>> Joan Codina
>>
>>
>
Re: abbreviation diccionary format
Posted by "Jim - FooBar();" <ji...@gmail.com>.
To train models of any type you need training data...The pretrained
english tokenizer was trained on the CoNNL shared task if i remember
correctly...Maybe one of the developers can shine some light on
this...Anyway i don't think you need a dictionary but training data of
the following form :
Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a
nonexecutive director Nov. 29<SPLIT>.
Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing
group<SPLIT>.
Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated
Gold Fields PLC<SPLIT>, was named a nonexecutive director of this
British industrial conglomerate<SPLIT>.
Hope that helps,
Jim
p.s: Did you mean an abbreviation dictionary? Well, you can't really
train a model using an abbreviation dictionary...
On 10/04/12 09:02, Joan Codina wrote:
>
> I sent this some days before, but I got no answer :-(( :
>
> To train a tokenizer I can use a dictionary, but
> where is the dictionary used to train the current English model? and
> where can I find information about the dictionary format? , so I can,
> at least, generate my own one.
>
> thanks
> Joan Codina
>
abbreviation diccionary format
Posted by Joan Codina <Jo...@upf.edu>.
I sent this some days before, but I got no answer :-(( :
To train a tokenizer I can use a dictionary, but
where is the dictionary used to train the current English model? and
where can I find information about the dictionary format? , so I can,
at least, generate my own one.
thanks
Joan Codina