You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2006/09/19 18:21:55 UTC

Analysis/tokenization of compound words

Hi,

How do people typically analyze/tokenize text with compounds (e.g. German)?  I took a look at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer doesn't treat compounds in any special way at all.

One way to go about this is to have a word dictionary and a tokenizer that processes input one character at a time, looking for a word match in the dictionary after each processed characters.  Then, CompoundWordLikeThis could be broken down into multiple tokens/words and returned at a set of tokens at the same position.  However, somehow this doesn't strike me as a very smart and fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd love to hear about it.

Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analysis/tokenization of compound words

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sep 19, 2006, at 9:21 AM, Otis Gospodnetic wrote:

> How do people typically analyze/tokenize text with compounds (e.g.  
> German)?  I took a look at GermanAnalyzer hoping to see how one can  
> deal with that, but it turns out GermanAnalyzer doesn't treat  
> compounds in any special way at all.
>
> One way to go about this is to have a word dictionary and a  
> tokenizer that processes input one character at a time, looking for  
> a word match in the dictionary after each processed characters.   
> Then, CompoundWordLikeThis could be broken down into multiple  
> tokens/words and returned at a set of tokens at the same position.   
> However, somehow this doesn't strike me as a very smart and fast  
> approach.

This came up on the KinoSearch list a few weeks ago, and best  
solution I could think of used essentially the same algorithm you  
describe.

During the discussion, we found this:

http://www.glue.umd.edu/~oard/courses/708a/fall01/838/P2/

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analysis/tokenization of compound words

Posted by karl wettin <ka...@gmail.com>.

On Tue, 2006-09-19 at 09:21 -0700, Otis Gospodnetic wrote:
> 
> How do people typically analyze/tokenize text with compounds (e.g.
> German)?  I took a look at GermanAnalyzer hoping to see how one can
> deal with that, but it turns out GermanAnalyzer doesn't treat
> compounds in any special way at all.

I've been looking close at this, but for Swedish. The major problem in
that case is how a composite word generally have totaly diffrent
sematincs compared to the parts. Here is a classic school example:

"En brun hårig sjuk sköterska" 
A brown hairy sick care taker

"En brunhårig sjuksköterska"
A brunette nurse

Thus it is not very helpful to index the composite parts by them self.
It is really a problem to be handled by a spell checker. So I wrote and
posted the Jira issue 626, an adaptive session analysing spell checker.
It makes recommendations based on how previous users changed their
queries. davinci -> da vinci. heroes iii -> heroes 3. And so on.

This strategy does however require quite some user traffic.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analysis/tokenization of compound words

Posted by Pasquale Imbemba <p....@gmail.com>.

Otis,

I forgot to mention that I make use of Lucene for noun retrieval from 
the lexicon.

Pasquale

Pasquale Imbemba ha scritto:
> Hi Otis,
>
> I am completing my bachelor thesis at the Free University of Bolzano 
> (www.unibz.it). My project is exactly about what you need: a word 
> splitter for German compound words. Raffaella Bernardi who is reading 
> in CC is my supervisor.
> As some from the lucene mailing list has already suggested, I have 
> used the lexicon of German nouns extracted from Morphy 
> (http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). As for 
> the splitting algorithm, I have used the one Maaten De Rijke and 
> Christof Monz have published in /Shallow Morphological Analysis in 
> Monolingual
> Information Retrieval for Dutch, German and Italian /(website here 
> <http://www.dcs.qmul.ac.uk/%7Echristof/>, document here 
> <http://www.dcs.qmul.ac.uk/%7Echristof/publications/clef-2001-post.pdf>). 
> I did some testing and minor improvement on it (as I needed to 
> "adjust" it for the solution I was working on) and could send you my 
> thesis paper (actually still in draft state), which contains 
> statistical data on correctness.
>
> Let me know
> Pasquale
>
> Otis Gospodnetic ha scritto:
>> Hi,
>>
>> How do people typically analyze/tokenize text with compounds (e.g. 
>> German)?  I took a look at GermanAnalyzer hoping to see how one can 
>> deal with that, but it turns out GermanAnalyzer doesn't treat 
>> compounds in any special way at all.
>>
>> One way to go about this is to have a word dictionary and a tokenizer 
>> that processes input one character at a time, looking for a word 
>> match in the dictionary after each processed characters.  Then, 
>> CompoundWordLikeThis could be broken down into multiple tokens/words 
>> and returned at a set of tokens at the same position.  However, 
>> somehow this doesn't strike me as a very smart and fast approach.
>> What are some better approaches?
>> If anyone has implemented anything that deals with this problem, I'd 
>> love to hear about it.
>>
>> Thanks,
>> Otis
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>   
>

-- 
"As far as the laws of mathematics refer to reality, they are not certain, as far as they are certain, they do not refer to reality."

(Albert Einstein)


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analysis/tokenization of compound words

Posted by Pasquale Imbemba <p....@gmail.com>.

Hi Otis,

I am completing my bachelor thesis at the Free University of Bolzano 
(www.unibz.it). My project is exactly about what you need: a word 
splitter for German compound words. Raffaella Bernardi who is reading in 
CC is my supervisor.
As some from the lucene mailing list has already suggested, I have used 
the lexicon of German nouns extracted from Morphy 
(http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). As for the 
splitting algorithm, I have used the one Maaten De Rijke and Christof 
Monz have published in /Shallow Morphological Analysis in Monolingual
Information Retrieval for Dutch, German and Italian /(website here 
<http://www.dcs.qmul.ac.uk/%7Echristof/>, document here 
<http://www.dcs.qmul.ac.uk/%7Echristof/publications/clef-2001-post.pdf>). 
I did some testing and minor improvement on it (as I needed to "adjust" 
it for the solution I was working on) and could send you my thesis paper 
(actually still in draft state), which contains statistical data on 
correctness.

Let me know
Pasquale

Otis Gospodnetic ha scritto:
> Hi,
>
> How do people typically analyze/tokenize text with compounds (e.g. German)?  I took a look at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer doesn't treat compounds in any special way at all.
>
> One way to go about this is to have a word dictionary and a tokenizer that processes input one character at a time, looking for a word match in the dictionary after each processed characters.  Then, CompoundWordLikeThis could be broken down into multiple tokens/words and returned at a set of tokens at the same position.  However, somehow this doesn't strike me as a very smart and fast approach.
> What are some better approaches?
> If anyone has implemented anything that deals with this problem, I'd love to hear about it.
>
> Thanks,
> Otis
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>   

-- 
"As far as the laws of mathematics refer to reality, they are not certain, as far as they are certain, they do not refer to reality."

(Albert Einstein)


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analysis/tokenization of compound words

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sep 20, 2006, at 12:07 AM, Daniel Naber wrote:

> Writing a decomposer is difficult as you need both a large dictionary
> *without* compounds and a set of rules to avoid splitting at too many
> positions.

Conceptually, how different is the problem of decompounding German  
from tokenizing languages such as Thai and Japanese, where "words"  
are not separated by spaces and may consist of multiple characters?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analysis/tokenization of compound words

Posted by Daniel Naber <lu...@danielnaber.de>.

On Tuesday 19 September 2006 22:15, eks dev wrote:

> Daniel Naber made some work with German dictionaries as well, if I
> recall well, maybe he has something that helps

The company I work for offers a commercial Java component for decomposing 
and lemmatizing German words, see http://demo.intrafind.org/LiSa/ for an 
online demo (sorry, page is in German only).

Writing a decomposer is difficult as you need both a large dictionary 
*without* compounds and a set of rules to avoid splitting at too many 
positions. For those who speak German: write a decomposer and use 
"Kotflügel" to test it :-)

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analysis/tokenization of compound words (German, Chinese, etc.)

Posted by Bob Carpenter <ca...@alias-i.com>.

eks dev wrote:

> Depends what yo need to do with it, if you need this to be only used as "kind of stemming" for searching documents, solution is not all that complex. If you need linguisticly correct splitting than it gets complicated.

This is a very good point. Stemming for
high recall is much easier than fine-grained
linguistic morphology.

Often the best solution is a combination of
best-guess based on linguistic rules/statistical
models/heuristics combined with weaker substring
measures.

> For beter solutions that would cover fuzzy errors, contact Bob Carpenter from Alias-I, his SpellChecker can do this rather easily, unfortunatelly (for us) for money (Warning: I am in no relatatin to Bob or Alias-I at all)...

The implementation we have is a simple character-level
noisy channel model. We even have a tutorial for
how to do this in Chinese:

http://www.alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html

As pointed out in another thread, this requires a set of
training data consisting of the parts of the German
words. And you may need to allow things other than
spaces to be dropped in cases of epenthesis (adding
a vowel between words).

It's also possible to bootstrap directly from
raw data, though only for the stemming for
high recall case -- you won't get close to the
true morphology this way.

Just to clarify, our LingPipe license is a dual
royalty-free/commercial license. Our source is
downloadable online. The royalty free license
is very much like GPL with the added restriction that you
have to make public the data over which you run LingPipe.

- Bob Carpenter
Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analysis/tokenization of compound words

Posted by Daniel Naber <lu...@danielnaber.de>.

On Tuesday 19 September 2006 22:41, eks dev wrote:

> ahh, another one, when you strip suffix, check if last char on remaining
> "stem" is "s" (magic thing in German), delete it if not the only
> letter.... do not ask why, long unexplained mistery of German language

This is called "Fugenelement" and there are more characters than just the 
"s", although it might be enough to remove the "s" when trying to detect 
compounds. There are also cases where characters are removed (Wolle + 
Decke => Wolldecke).

Also see http://de.wikipedia.org/wiki/Fugenelement and 
http://en.wikipedia.org/wiki/Epenthesis (which emphasises pronunciation, 
but that's not a good explanation for the existence of these characters in 
German).

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analysis/tokenization of compound words

Posted by eks dev <ek...@yahoo.co.uk>.

I just remembered now on minor thing that made our life easier, recusive loop has some primitive 
stripEndings() method that removes most of variable endings all these ungs/ungen/... before looking up in SuffixTree. This reduces your dictionary needs dramatically. I think this is partially done in GermanStemmer in Lucene...

ahh, another one, when you strip suffix, check if last char on remaining "stem" is "s" (magic thing in German), delete it if not the only letter.... do not ask why, long unexplained mistery of German language

this approach works in 99% cases, and special linguistic tricks are anyhow not so relevant for most situations for searching. Regular stemmer makes much greater distorsion than this

Must find this code somewhere, I probably left something out in these emails


----- Original Message ----
From: eks dev <ek...@yahoo.co.uk>
To: java-user@lucene.apache.org
Sent: Tuesday, 19 September, 2006 10:15:04 PM
Subject: Re: Analysis/tokenization of compound words

Hi Otis,
Depends what yo need to do with it, if you need this to be only used as "kind of stemming" for searching documents, solution is not all that complex. If you need linguisticly correct splitting than it gets complicated.

for the first case:
Build SuffixTree with your dictionary (hope you have many inflections for german words in your dictionary...(feminin, masculin, plural, n-ending, 4 cases...), Tanzerin Tanzer). find longest suffix that is in your dictionary and recursively strip word that ends original word... It is fast.

If I remember correctly, in lucene util is some SuffixTree implementation (not really good for large dictionaries)

Thigs to be aware of, your recall will drop down in case you use simple fuzzy things that are normally found.

- "Balletttänzerin" -> "Ballett" "tänzerin", so if your request does not get split due to typos no chance to find it, e.g. "Ballettänzerim"->"Ballettänzerim"

- You need good dictionary with all inflections (google morphy or something like this to help you generate all forms )

- try to be carefull with short prefix in this case as this leads to totally wrong splitting "umbau"->"um" "bau" (changes emning, and if you have preposition "um" as stopword...)

For beter solutions that would cover fuzzy errors, contact Bob Carpenter from Alias-I, his SpellChecker can do this rather easily, unfortunatelly (for us) for money (Warning: I am in no relatatin to Bob or Alias-I at all)...

Daniel Naber made some work with German dictionaries as well, if I recall well, maybe he has something that helps

Anyhow, if you opt for the first option, I will try to dig something out in our archives, we did something similar ages ago ("stemming like" splitting of word in German)

Have fun, e.

----- Original Message ----
From: Otis Gospodnetic <ot...@yahoo.com>
To: java-user@lucene.apache.org
Sent: Tuesday, 19 September, 2006 6:21:55 PM
Subject: Analysis/tokenization of compound words

Hi,

How do people typically analyze/tokenize text with compounds (e.g. German)?  I took a look at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer doesn't treat compounds in any special way at all.

One way to go about this is to have a word dictionary and a tokenizer that processes input one character at a time, looking for a word match in the dictionary after each processed characters.  Then, CompoundWordLikeThis could be broken down into multiple tokens/words and returned at a set of tokens at the same position.  However, somehow this doesn't strike me as a very smart and fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd love to hear about it.

Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analysis/tokenization of compound words

Posted by eks dev <ek...@yahoo.co.uk>.

Hi Otis,
Depends what yo need to do with it, if you need this to be only used as "kind of stemming" for searching documents, solution is not all that complex. If you need linguisticly correct splitting than it gets complicated.

for the first case:
Build SuffixTree with your dictionary (hope you have many inflections for german words in your dictionary...(feminin, masculin, plural, n-ending, 4 cases...), Tanzerin Tanzer). find longest suffix that is in your dictionary and recursively strip word that ends original word... It is fast.

If I remember correctly, in lucene util is some SuffixTree implementation (not really good for large dictionaries)

Thigs to be aware of, your recall will drop down in case you use simple fuzzy things that are normally found.

- "Balletttänzerin" -> "Ballett" "tänzerin", so if your request does not get split due to typos no chance to find it, e.g. "Ballettänzerim"->"Ballettänzerim"

- You need good dictionary with all inflections (google morphy or something like this to help you generate all forms )

- try to be carefull with short prefix in this case as this leads to totally wrong splitting "umbau"->"um" "bau" (changes emning, and if you have preposition "um" as stopword...)

For beter solutions that would cover fuzzy errors, contact Bob Carpenter from Alias-I, his SpellChecker can do this rather easily, unfortunatelly (for us) for money (Warning: I am in no relatatin to Bob or Alias-I at all)...

Daniel Naber made some work with German dictionaries as well, if I recall well, maybe he has something that helps

Anyhow, if you opt for the first option, I will try to dig something out in our archives, we did something similar ages ago ("stemming like" splitting of word in German)

Have fun, e.

----- Original Message ----
From: Otis Gospodnetic <ot...@yahoo.com>
To: java-user@lucene.apache.org
Sent: Tuesday, 19 September, 2006 6:21:55 PM
Subject: Analysis/tokenization of compound words

Hi,

How do people typically analyze/tokenize text with compounds (e.g. German)?  I took a look at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer doesn't treat compounds in any special way at all.

One way to go about this is to have a word dictionary and a tokenizer that processes input one character at a time, looking for a word match in the dictionary after each processed characters.  Then, CompoundWordLikeThis could be broken down into multiple tokens/words and returned at a set of tokens at the same position.  However, somehow this doesn't strike me as a very smart and fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd love to hear about it.

Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analysis/tokenization of compound words

Posted by Jonathan O'Connor <jo...@xcom.de>.

Otis,
I can't offer you any practical advice, but as a student of German, I can
tell you that beginners find it difficult to read German words and split
them properly. The larger your vocabulary the easier it is. The whole topic
sounds like an AI problem:
A possible algorithm for German (no idea if this would also work for
English or agglutinative languages like Turkish) might be:
1. Search for the whole word in the dictionary. If found end
2. Split the word into syllables (this might be another AI project too).
3. Join the syllables together and see if they make words in the
dictionary.
4. If all the syllables are used in known words, then you have success.
5. An heuristic to use is to create words as long as possible.

E.g. "Balletttänzerin" (Balletttaenzerin if you can't read umlauts).
Syllables: "Ball", "ett", "taenz", "er", "in"
Joining the syllables, we see that "Ball" is in our dictionary, but
"etttaenzerin", "etttaenzer" , "etttaenz" and "ett" are not. So on we go:
"Ballett" is in our dictionary, and "taenzerin" is also. Note if we went
for the short words first, then we could split it into: Ballett | taenzer |
in.

As usual, its an interesting project with no 100% perfect solution. Best of
luck
Jonathan O'Connor
XCOM Dublin


                                                                           
             Otis Gospodnetic                                              
             <otis_gospodnetic                                             
             @yahoo.com>                                                To 
                                       java-user@lucene.apache.org         
             19/09/2006 17:21                                           cc 
                                                                           
                                                                   Subject 
             Please respond to         Analysis/tokenization of compound   
             java-user@lucene.         words                               
                apache.org                                                 
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




Hi,

How do people typically analyze/tokenize text with compounds (e.g. German)?
I took a look at GermanAnalyzer hoping to see how one can deal with that,
but it turns out GermanAnalyzer doesn't treat compounds in any special way
at all.

One way to go about this is to have a word dictionary and a tokenizer that
processes input one character at a time, looking for a word match in the
dictionary after each processed characters.  Then, CompoundWordLikeThis
could be broken down into multiple tokens/words and returned at a set of
tokens at the same position.  However, somehow this doesn't strike me as a
very smart and fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd love
to hear about it.

Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein
für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist
das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten,
eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns
eine Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole use
of the intended recipient. Any review, distribution by others or forwarding
without express permission is strictly prohibited. If you are not the
intended recipient, please contact the sender and delete all copies.

Hauptsitz: Bahnstrasse 33, D-47877 Willich, USt-IdNr.: DE 812 885 664
Kommunikation: Telefon +49 2154 9209-70, Telefax +49 2154 9209-900,
www.xcom.de
Handelsregister: Amtsgericht Krefeld, HRB 10340
Vorstand: Matthias Albrecht, Renate Becker-Grope, Marco Marty, Dr. Rainer
Fuchs
Vorsitzender des Aufsichtsrates: Stephan Steuer

RE: Analysis/tokenization of compound words

Posted by "Binkley, Peter" <Pe...@ualberta.ca>.

Aspell has some support for compound words that might be useful to look
at:

http://aspell.sourceforge.net/man-html/Compound-Words.html#Compound-Word
s

Peter

Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
4-30 Cameron Library
University of Alberta Libraries
Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: peter.binkley@ualberta.ca




 

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Tuesday, September 19, 2006 10:22 AM
To: java-user@lucene.apache.org
Subject: Analysis/tokenization of compound words

Hi,

How do people typically analyze/tokenize text with compounds (e.g.
German)?  I took a look at GermanAnalyzer hoping to see how one can deal
with that, but it turns out GermanAnalyzer doesn't treat compounds in
any special way at all.

One way to go about this is to have a word dictionary and a tokenizer
that processes input one character at a time, looking for a word match
in the dictionary after each processed characters.  Then,
CompoundWordLikeThis could be broken down into multiple tokens/words and
returned at a set of tokens at the same position.  However, somehow this
doesn't strike me as a very smart and fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd
love to hear about it.

Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org