You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by "Ryan L. Sun" <li...@gmail.com> on 2011/11/16 19:44:36 UTC

English word splitting with opennlp?

Hi all,

I'm facing a problem to split concatenated English text, more
specifically, domain name.
For example:
boysandgirls.com -> boy(s)|and|girl(s)|.com
haveaniceday.net -> have|a|nice|day|.net

Can I use opennlp to do this? I checked the opennlp documentation and
looks like "Learnable Tokenizer" is promising, but i couldn't get it
to work.
Any help is appreciated.

Re: English word splitting with opennlp?

Posted by Alexandre Patry <al...@keatext.com>.

Peter Norvig did an excellent presentation where he exposed one solution 
for this problem. You can look at it 
(http://videolectures.net/cikm08_norvig_slatuad/) from the slide "Text 
Data".

Hope this help,

Alexandre

On 11-11-16 01:44 PM, Ryan L. Sun wrote:
> Hi all,
>
> I'm facing a problem to split concatenated English text, more
> specifically, domain name.
> For example:
> boysandgirls.com ->  boy(s)|and|girl(s)|.com
> haveaniceday.net ->  have|a|nice|day|.net
>
> Can I use opennlp to do this? I checked the opennlp documentation and
> looks like "Learnable Tokenizer" is promising, but i couldn't get it
> to work.
> Any help is appreciated.


-- 
Alexandre Patry
Ingénieur-Chercheur
http://KeaText.com

>>  Transformez vos documents en outils de décision
<<  Turn your documents into decison tools

Re: Re: English word splitting with opennlp?

Posted by li...@gmail.com.

Thanks a lot everyone, it's working for me now.

On , Jörn Kottmann <ko...@gmail.com> wrote:
> On 11/16/11 8:26 PM, lishengs@gmail.com wrote:


> That's what i thought how the "Learnable Tokenizer" works, but it doesn't  
> work for for some reason.

> What I did:

> 1) edit a test.train file with following content:

> boysandgirls.

> boysandgirls.

> boysandgirls.

> ... repeat 30 times ...



> 2) train a model by:

> bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt -data  
> test.train -model test.bin



> 3) evaluate model by:

> echo "boysandgirls" | bin/opennlp TokenizerME test.bin



> The result i got:

> ------------------------------------------------------------------------

> Loading Tokenizer model ... done (0.019s)

> boysandgirls





> Average: 500.0 sent/s

> Total: 1 sent

> Runtime: 0.0020s

> ------------------------------------------------------------------------



> So the text is still not segmented to words.

> Any thoughts?




> You shouldn't repeat your training data, since you don't

> add any information by doing that. Instead you should either manually

> label such data for at least a few hundred domains, or construct

> it out of tokenized text, or try an approach as suggest by Aliaksandr.



> The reason why it doesn't split boysandgirls is that you enabled the  
> alpha numerical

> optimization, which is a performance optimization. This one does skip the  
> processing

> of white space separated strings which only contain letters.



> If you disable it, the model will decide on each character in your test  
> string

> if it is a valid split or not (expect the last one).



> Jörn

Re: English word splitting with opennlp?

Posted by Jörn Kottmann <ko...@gmail.com>.

On 11/16/11 8:26 PM, lishengs@gmail.com wrote:
> That's what i thought how the "Learnable Tokenizer" works, but it 
> doesn't work for for some reason.
> What I did:
> 1) edit a test.train file with following content:
> boys<SPLIT>and<SPLIT>girls.
> boys<SPLIT>and<SPLIT>girls.
> boys<SPLIT>and<SPLIT>girls.
> ... repeat 30 times ...
>
> 2) train a model by:
> bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt 
> -data test.train -model test.bin
>
> 3) evaluate model by:
> echo "boysandgirls" | bin/opennlp TokenizerME test.bin
>
> The result i got:
> ------------------------------------------------------------------------
> Loading Tokenizer model ... done (0.019s)
> boysandgirls
>
>
> Average: 500.0 sent/s
> Total: 1 sent
> Runtime: 0.0020s
> ------------------------------------------------------------------------
>
> So the text is still not segmented to words.
> Any thoughts?

You shouldn't repeat your training data, since you don't
add any information by doing that. Instead you should either manually
label such data for at least a few hundred domains, or construct
it out of tokenized text, or try an approach as suggest by Aliaksandr.

The reason why it doesn't split boysandgirls is that you enabled the 
alpha numerical
optimization, which is a performance optimization. This one does skip 
the processing
of white space separated strings which only contain letters.

If you disable it, the model will decide on each character in your test 
string
if it is a valid split or not (expect the last one).

Jörn

Re: English word splitting with opennlp?

Posted by Jörn Kottmann <ko...@gmail.com>.

It will train a model where it does not split the trailing dot.

Jörn

On 11/16/11 8:31 PM, Alexandre Patry wrote:
> I do not know if it might affect your results, but there is a trailing 
> dot in your training instance that is not in your test instance.
>
> On 11-11-16 02:26 PM, lishengs@gmail.com wrote:
>> That's what i thought how the "Learnable Tokenizer" works, but it 
>> doesn't work for for some reason.
>> What I did:
>> 1) edit a test.train file with following content:
>> boys<SPLIT>and<SPLIT>girls.
>> boys<SPLIT>and<SPLIT>girls.
>> boys<SPLIT>and<SPLIT>girls.
>> ... repeat 30 times ...
>>
>> 2) train a model by:
>> bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt 
>> -data test.train -model test.bin
>>
>> 3) evaluate model by:
>> echo "boysandgirls" | bin/opennlp TokenizerME test.bin
>>
>> The result i got:
>> ------------------------------------------------------------------------
>> Loading Tokenizer model ... done (0.019s)
>> boysandgirls
>>
>>
>> Average: 500.0 sent/s
>> Total: 1 sent
>> Runtime: 0.0020s
>> ------------------------------------------------------------------------
>>
>> So the text is still not segmented to words.
>> Any thoughts?
>>
>> On , Jörn Kottmann <ko...@gmail.com> wrote:
>>> The spaces are only used to cut words. This he already did
>>
>>> by just looking at a domain at a time.
>>
>>
>>
>>> To go back to his sample:
>>
>>> boysandgirls.com
>>
>>
>>
>>> This you can easily turn into training data like this:
>>
>>> boysandgirlsSPLIT>.com
>>
>>
>>
>>> I would try to get some good amount of English text,
>>
>>> perform tokenization on it, and then just assume every
>>
>>> token is written together without a space in between.
>>
>>> Then you should be able to generate training strings like
>>
>>> the one above. The TLD can easily be attached randomly.
>>
>>
>>
>>> I guess that might already work well.
>>
>>
>>
>>> To evaluate it, you should make a file with real domains
>>
>>> and split them manually. The tokenizer has an evaluator
>>
>>> which can calculate for you how accurate it is.
>>
>>
>>
>>> Hope this helps,
>>
>>> Jörn
>>
>>
>>
>>
>>
>>> On 11/16/11 8:02 PM, Aliaksandr Autayeu wrote:
>>
>>
>>> Hi Ryan,
>>
>>
>>
>>> Learnable tokenizer is trained on standard text, where words are 
>>> separated
>>
>>> by fair amount of spaces. Your data looks different and one way to 
>>> tackle
>>
>>> it is to tag a fair amount of samples, creating your own corpus, and 
>>> then
>>
>>> train a model on it. Tagging might take some time, though. Another 
>>> approach
>>
>>> might be to use a dictionary, like WordNet, and look up potential 
>>> tokens
>>
>>> there. A fairly simple approach might be starting from empty string, 
>>> adding
>>
>>> char-by-char to it and looking up in WordNet. If it returns something -
>>
>>> make that string a token and start again from empty string. The 
>>> suffixes
>>
>>> (.com, .net, etc) are well-known and can be cut. With this approach 
>>> you'll
>>
>>> encounter difficulties with something like "hotelchain" -> "hot" is 
>>> a word
>>
>>> and is present in WordNet. Well, these might not be the only 
>>> approaches out
>>
>>> there, this is just what came to mind quickly.
>>
>>
>>
>>> Aliaksandr
>>
>>
>>
>>> On Wed, Nov 16, 2011 at 7:44 PM, Ryan L. Sunlishengs@gmail.com> wrote:
>>
>>
>>
>>
>>> Hi all,
>>
>>
>>
>>> I'm facing a problem to split concatenated English text, more
>>
>>> specifically, domain name.
>>
>>> For example:
>>
>>> boysandgirls.com -> boy(s)|and|girl(s)|.com
>>
>>> haveaniceday.net -> have|a|nice|day|.net
>>
>>
>>
>>> Can I use opennlp to do this? I checked the opennlp documentation and
>>
>>> looks like "Learnable Tokenizer" is promising, but i couldn't get it
>>
>>> to work.
>>
>>> Any help is appreciated.
>>
>>
>>
>>
>>
>>
>>
>
>

Re: English word splitting with opennlp?

Posted by Alexandre Patry <al...@keatext.com>.

I do not know if it might affect your results, but there is a trailing 
dot in your training instance that is not in your test instance.

On 11-11-16 02:26 PM, lishengs@gmail.com wrote:
> That's what i thought how the "Learnable Tokenizer" works, but it 
> doesn't work for for some reason.
> What I did:
> 1) edit a test.train file with following content:
> boys<SPLIT>and<SPLIT>girls.
> boys<SPLIT>and<SPLIT>girls.
> boys<SPLIT>and<SPLIT>girls.
> ... repeat 30 times ...
>
> 2) train a model by:
> bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt 
> -data test.train -model test.bin
>
> 3) evaluate model by:
> echo "boysandgirls" | bin/opennlp TokenizerME test.bin
>
> The result i got:
> ------------------------------------------------------------------------
> Loading Tokenizer model ... done (0.019s)
> boysandgirls
>
>
> Average: 500.0 sent/s
> Total: 1 sent
> Runtime: 0.0020s
> ------------------------------------------------------------------------
>
> So the text is still not segmented to words.
> Any thoughts?
>
> On , Jörn Kottmann <ko...@gmail.com> wrote:
>> The spaces are only used to cut words. This he already did
>
>> by just looking at a domain at a time.
>
>
>
>> To go back to his sample:
>
>> boysandgirls.com
>
>
>
>> This you can easily turn into training data like this:
>
>> boysandgirlsSPLIT>.com
>
>
>
>> I would try to get some good amount of English text,
>
>> perform tokenization on it, and then just assume every
>
>> token is written together without a space in between.
>
>> Then you should be able to generate training strings like
>
>> the one above. The TLD can easily be attached randomly.
>
>
>
>> I guess that might already work well.
>
>
>
>> To evaluate it, you should make a file with real domains
>
>> and split them manually. The tokenizer has an evaluator
>
>> which can calculate for you how accurate it is.
>
>
>
>> Hope this helps,
>
>> Jörn
>
>
>
>
>
>> On 11/16/11 8:02 PM, Aliaksandr Autayeu wrote:
>
>
>> Hi Ryan,
>
>
>
>> Learnable tokenizer is trained on standard text, where words are 
>> separated
>
>> by fair amount of spaces. Your data looks different and one way to 
>> tackle
>
>> it is to tag a fair amount of samples, creating your own corpus, and 
>> then
>
>> train a model on it. Tagging might take some time, though. Another 
>> approach
>
>> might be to use a dictionary, like WordNet, and look up potential tokens
>
>> there. A fairly simple approach might be starting from empty string, 
>> adding
>
>> char-by-char to it and looking up in WordNet. If it returns something -
>
>> make that string a token and start again from empty string. The suffixes
>
>> (.com, .net, etc) are well-known and can be cut. With this approach 
>> you'll
>
>> encounter difficulties with something like "hotelchain" -> "hot" is a 
>> word
>
>> and is present in WordNet. Well, these might not be the only 
>> approaches out
>
>> there, this is just what came to mind quickly.
>
>
>
>> Aliaksandr
>
>
>
>> On Wed, Nov 16, 2011 at 7:44 PM, Ryan L. Sunlishengs@gmail.com> wrote:
>
>
>
>
>> Hi all,
>
>
>
>> I'm facing a problem to split concatenated English text, more
>
>> specifically, domain name.
>
>> For example:
>
>> boysandgirls.com -> boy(s)|and|girl(s)|.com
>
>> haveaniceday.net -> have|a|nice|day|.net
>
>
>
>> Can I use opennlp to do this? I checked the opennlp documentation and
>
>> looks like "Learnable Tokenizer" is promising, but i couldn't get it
>
>> to work.
>
>> Any help is appreciated.
>
>
>
>
>
>
>


-- 
Alexandre Patry
Ingénieur-Chercheur
http://KeaText.com

>>  Transformez vos documents en outils de décision
<<  Turn your documents into decison tools

Re: Re: English word splitting with opennlp?

Posted by li...@gmail.com.

That's what i thought how the "Learnable Tokenizer" works, but it doesn't  
work for for some reason.
What I did:
1) edit a test.train file with following content:
boys<SPLIT>and<SPLIT>girls.
boys<SPLIT>and<SPLIT>girls.
boys<SPLIT>and<SPLIT>girls.
... repeat 30 times ...

2) train a model by:
bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt -data  
test.train -model test.bin

3) evaluate model by:
echo "boysandgirls" | bin/opennlp TokenizerME test.bin

The result i got:
------------------------------------------------------------------------
Loading Tokenizer model ... done (0.019s)
boysandgirls


Average: 500.0 sent/s
Total: 1 sent
Runtime: 0.0020s
------------------------------------------------------------------------

So the text is still not segmented to words.
Any thoughts?

On , Jörn Kottmann <ko...@gmail.com> wrote:
> The spaces are only used to cut words. This he already did

> by just looking at a domain at a time.



> To go back to his sample:

> boysandgirls.com



> This you can easily turn into training data like this:

> boysandgirlsSPLIT>.com



> I would try to get some good amount of English text,

> perform tokenization on it, and then just assume every

> token is written together without a space in between.

> Then you should be able to generate training strings like

> the one above. The TLD can easily be attached randomly.



> I guess that might already work well.



> To evaluate it, you should make a file with real domains

> and split them manually. The tokenizer has an evaluator

> which can calculate for you how accurate it is.



> Hope this helps,

> Jörn





> On 11/16/11 8:02 PM, Aliaksandr Autayeu wrote:


> Hi Ryan,



> Learnable tokenizer is trained on standard text, where words are separated

> by fair amount of spaces. Your data looks different and one way to tackle

> it is to tag a fair amount of samples, creating your own corpus, and then

> train a model on it. Tagging might take some time, though. Another  
> approach

> might be to use a dictionary, like WordNet, and look up potential tokens

> there. A fairly simple approach might be starting from empty string,  
> adding

> char-by-char to it and looking up in WordNet. If it returns something -

> make that string a token and start again from empty string. The suffixes

> (.com, .net, etc) are well-known and can be cut. With this approach you'll

> encounter difficulties with something like "hotelchain" -> "hot" is a word

> and is present in WordNet. Well, these might not be the only approaches  
> out

> there, this is just what came to mind quickly.



> Aliaksandr



> On Wed, Nov 16, 2011 at 7:44 PM, Ryan L. Sunlishengs@gmail.com> wrote:




> Hi all,



> I'm facing a problem to split concatenated English text, more

> specifically, domain name.

> For example:

> boysandgirls.com -> boy(s)|and|girl(s)|.com

> haveaniceday.net -> have|a|nice|day|.net



> Can I use opennlp to do this? I checked the opennlp documentation and

> looks like "Learnable Tokenizer" is promising, but i couldn't get it

> to work.

> Any help is appreciated.

Re: English word splitting with opennlp?

Posted by Jörn Kottmann <ko...@gmail.com>.

The spaces are only used to cut words. This he already did
by just looking at a domain at a time.

To go back to his sample:
boysandgirls.com

This you can easily turn into training data like this:
boys<SPLIT>and<SPLIT>girls<SPLIT>.<SPLIT>com

I would try to get some good amount of English text,
perform tokenization on it, and then just assume every
token is written together without a space in between.
Then you should be able to generate training strings like
the one above. The TLD can easily be attached randomly.

I guess that might already work well.

To evaluate it, you should make a file with real domains
and split them manually. The tokenizer has an evaluator
which can calculate for you how accurate it is.

Hope this helps,
Jörn


On 11/16/11 8:02 PM, Aliaksandr Autayeu wrote:
> Hi Ryan,
>
> Learnable tokenizer is trained on standard text, where words are separated
> by fair amount of spaces. Your data looks different and one way to tackle
> it is to tag a fair amount of samples, creating your own corpus, and then
> train a model on it. Tagging might take some time, though. Another approach
> might be to use a dictionary, like WordNet, and look up potential tokens
> there. A fairly simple approach might be starting from empty string, adding
> char-by-char to it and looking up in WordNet. If it returns something -
> make that string a token and start again from empty string. The suffixes
> (.com, .net, etc) are well-known and can be cut. With this approach you'll
> encounter difficulties with something like "hotelchain" ->  "hot" is a word
> and is present in WordNet. Well, these might not be the only approaches out
> there, this is just what came to mind quickly.
>
> Aliaksandr
>
> On Wed, Nov 16, 2011 at 7:44 PM, Ryan L. Sun<li...@gmail.com>  wrote:
>
>> Hi all,
>>
>> I'm facing a problem to split concatenated English text, more
>> specifically, domain name.
>> For example:
>> boysandgirls.com ->  boy(s)|and|girl(s)|.com
>> haveaniceday.net ->  have|a|nice|day|.net
>>
>> Can I use opennlp to do this? I checked the opennlp documentation and
>> looks like "Learnable Tokenizer" is promising, but i couldn't get it
>> to work.
>> Any help is appreciated.
>>

Re: English word splitting with opennlp?

Posted by Aliaksandr Autayeu <al...@autayeu.com>.

Hi Ryan,

Learnable tokenizer is trained on standard text, where words are separated
by fair amount of spaces. Your data looks different and one way to tackle
it is to tag a fair amount of samples, creating your own corpus, and then
train a model on it. Tagging might take some time, though. Another approach
might be to use a dictionary, like WordNet, and look up potential tokens
there. A fairly simple approach might be starting from empty string, adding
char-by-char to it and looking up in WordNet. If it returns something -
make that string a token and start again from empty string. The suffixes
(.com, .net, etc) are well-known and can be cut. With this approach you'll
encounter difficulties with something like "hotelchain" -> "hot" is a word
and is present in WordNet. Well, these might not be the only approaches out
there, this is just what came to mind quickly.

Aliaksandr

On Wed, Nov 16, 2011 at 7:44 PM, Ryan L. Sun <li...@gmail.com> wrote:

> Hi all,
>
> I'm facing a problem to split concatenated English text, more
> specifically, domain name.
> For example:
> boysandgirls.com -> boy(s)|and|girl(s)|.com
> haveaniceday.net -> have|a|nice|day|.net
>
> Can I use opennlp to do this? I checked the opennlp documentation and
> looks like "Learnable Tokenizer" is promising, but i couldn't get it
> to work.
> Any help is appreciated.
>