You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Michael Ludwig <m....@epages.com> on 2012/04/12 11:52:48 UTC

Lexical analysis tools for German language data

Given an input of "Windjacke" (probably "wind jacket" in English), I'd
like the code that prepares the data for the index (tokenizer etc) to
understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
would include the "Windjacke" document in its result set.

It appears to me that such an analysis requires a dictionary-backed
approach, which doesn't have to be perfect at all; a list of the most
common 2000 words would probably do the job and fulfil a criterion of
reasonable usefulness.

Do you know of any implementation techniques or working implementations
to do this kind of lexical analysis for German language data? (Or other
languages, for that matter?) What are they, where can I find them?

I'm sure there is something out (commercial or free) because I've seen
lots of engines grokking German and the way it builds words.

Failing that, what are the proper terms do refer to these techniques so
you can search more successfully?

Michael

AW: Lexical analysis tools for German language data

Posted by Michael Ludwig <m....@epages.com>.
> Von: Valeriy Felberg

> If you want that query "jacke" matches a document containing the word
> "windjacke" or "kinderjacke", you could use a custom update processor.
> This processor could search the indexed text for words matching the
> pattern ".*jacke" and inject the word "jacke" into an additional field
> which you can search against. You would need a whole list of possible
> suffixes, of course.

Merci, Valeriy - I agree on the feasability of such an approach. The
list would likely have to be composed of the most frequently used terms
for your specific domain.

In our case, it's things people would buy in shops. Reducing overly
complicated and convoluted product descriptions to proper basic terms -
that would do the job. It's like going to a restaurant boasting fancy
and unintelligible names for the dishes you may order when they are
really just ordinary stuff like pork and potatoes.

Thinking some more about it, giving sufficient boost to the attached
category data might also do the job. That would shift the burden of
supplying proper semantics to the guys doing the categorization.

> It would slow down the update process but you don't need to split
> words during search.

> > Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :
> >
> >> Given an input of "Windjacke" (probably "wind jacket" in English),
> >> I'd like the code that prepares the data for the index (tokenizer
> >> etc) to understand that this is a "Jacke" ("jacket") so that a
> >> query for "Jacke" would include the "Windjacke" document in its
> >> result set.

A query for "Windjacke" or "Kinderjacke" would probably not have to be
de-specialized to "Jacke" because, well, that's the user input and users
looking for specific things are probably doing so for a reason. If no
matches are found you can still tell them to just broaden their search.

Michael

Re: Lexical analysis tools for German language data

Posted by Valeriy Felberg <va...@gmail.com>.
If you want that query "jacke" matches a document containing the word
"windjacke" or "kinderjacke", you could use a custom update processor.
This processor could search the indexed text for words matching the
pattern ".*jacke" and inject the word "jacke" into an additional field
which you can search against. You would need a whole list of possible
suffixes, of course. It would slow down the update process but you
don't need to split words during search.

Best,
Valeriy

On Thu, Apr 12, 2012 at 12:39 PM, Paul Libbrecht <pa...@hoplahup.net> wrote:
>
> Michael,
>
> I'm on this list and the lucene list since several years and have not found this yet.
> It's been one "neglected topics" to my taste.
>
> There is a CompoundAnalyzer but it requires the compounds to be dictionary based, as you indicate.
>
> I am convinced there's a way to build the de-compounding words efficiently from a broad corpus but I have never seen it (and the experts at DFKI I asked for for also told me they didn't know of one).
>
> paul
>
> Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :
>
>> Given an input of "Windjacke" (probably "wind jacket" in English), I'd
>> like the code that prepares the data for the index (tokenizer etc) to
>> understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
>> would include the "Windjacke" document in its result set.
>>
>> It appears to me that such an analysis requires a dictionary-backed
>> approach, which doesn't have to be perfect at all; a list of the most
>> common 2000 words would probably do the job and fulfil a criterion of
>> reasonable usefulness.
>>
>> Do you know of any implementation techniques or working implementations
>> to do this kind of lexical analysis for German language data? (Or other
>> languages, for that matter?) What are they, where can I find them?
>>
>> I'm sure there is something out (commercial or free) because I've seen
>> lots of engines grokking German and the way it builds words.
>>
>> Failing that, what are the proper terms do refer to these techniques so
>> you can search more successfully?
>>
>> Michael
>

AW: Lexical analysis tools for German language data

Posted by Michael Ludwig <m....@epages.com>.
> Von: Markus Jelsma

> We've done a lot of tests with the HyphenationCompoundWordTokenFilter
> using a from TeX generated FOP XML file for the Dutch language and
> have seen decent results. A bonus was that now some tokens can be
> stemmed properly because not all compounds are listed in the
> dictionary for the HunspellStemFilter.

Thank you for pointing me to these two filter classes.

> It does introduce a recall/precision problem but it at least returns
> results for those many users that do not properly use compounds in
> their search query.

Could you define what the term "recall" should be taken to mean in this
context? I've also encountered it on the BASIStech website. Okay, I
found a definition:

http://en.wikipedia.org/wiki/Precision_and_recall

Dank je wel!

Michael

Re: Lexical analysis tools for German language data

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

We've done a lot of tests with the HyphenationCompoundWordTokenFilter using a 
from TeX generated FOP XML file for the Dutch language and have seen decent 
results. A bonus was that now some tokens can be stemmed properly because not 
all compounds are listed in the dictionary for the HunspellStemFilter.

It does introduce a recall/precision problem but it at least returns results 
for those many users that do not properly use compounds in their search query.

There seem to be a small issue with the filter where minSubwordSize=N yields 
subwords of size N-1.

Cheers,

On Thursday 12 April 2012 12:39:44 Paul Libbrecht wrote:
> Michael,
> 
> I'm on this list and the lucene list since several years and have not found
> this yet. It's been one "neglected topics" to my taste.
> 
> There is a CompoundAnalyzer but it requires the compounds to be dictionary
> based, as you indicate.
> 
> I am convinced there's a way to build the de-compounding words efficiently
> from a broad corpus but I have never seen it (and the experts at DFKI I
> asked for for also told me they didn't know of one).
> 
> paul
> 
> Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :
> > Given an input of "Windjacke" (probably "wind jacket" in English), I'd
> > like the code that prepares the data for the index (tokenizer etc) to
> > understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
> > would include the "Windjacke" document in its result set.
> > 
> > It appears to me that such an analysis requires a dictionary-backed
> > approach, which doesn't have to be perfect at all; a list of the most
> > common 2000 words would probably do the job and fulfil a criterion of
> > reasonable usefulness.
> > 
> > Do you know of any implementation techniques or working implementations
> > to do this kind of lexical analysis for German language data? (Or other
> > languages, for that matter?) What are they, where can I find them?
> > 
> > I'm sure there is something out (commercial or free) because I've seen
> > lots of engines grokking German and the way it builds words.
> > 
> > Failing that, what are the proper terms do refer to these techniques so
> > you can search more successfully?
> > 
> > Michael

-- 
Markus Jelsma - CTO - Openindex

Re: Lexical analysis tools for German language data

Posted by Paul Libbrecht <pa...@hoplahup.net>.
Michael,

I'm on this list and the lucene list since several years and have not found this yet.
It's been one "neglected topics" to my taste.

There is a CompoundAnalyzer but it requires the compounds to be dictionary based, as you indicate.

I am convinced there's a way to build the de-compounding words efficiently from a broad corpus but I have never seen it (and the experts at DFKI I asked for for also told me they didn't know of one).

paul

Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :

> Given an input of "Windjacke" (probably "wind jacket" in English), I'd
> like the code that prepares the data for the index (tokenizer etc) to
> understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
> would include the "Windjacke" document in its result set.
> 
> It appears to me that such an analysis requires a dictionary-backed
> approach, which doesn't have to be perfect at all; a list of the most
> common 2000 words would probably do the job and fulfil a criterion of
> reasonable usefulness.
> 
> Do you know of any implementation techniques or working implementations
> to do this kind of lexical analysis for German language data? (Or other
> languages, for that matter?) What are they, where can I find them?
> 
> I'm sure there is something out (commercial or free) because I've seen
> lots of engines grokking German and the way it builds words.
> 
> Failing that, what are the proper terms do refer to these techniques so
> you can search more successfully?
> 
> Michael


AW: Lexical analysis tools for German language data

Posted by Michael Ludwig <m....@epages.com>.
> Given an input of "Windjacke" (probably "wind jacket" in English),
> I'd like the code that prepares the data for the index (tokenizer
> etc) to understand that this is a "Jacke" ("jacket") so that a
> query for "Jacke" would include the "Windjacke" document in its
> result set.
> 
> It appears to me that such an analysis requires a dictionary-
> backed approach, which doesn't have to be perfect at all; a list
> of the most common 2000 words would probably do the job and fulfil
> a criterion of reasonable usefulness.

A simple approach would obviously be a word list and a regular
expression. There will, however, be nuts and bolts to take care of.
A more sophisticated and tested approach might be known to you.

Michael

Re: AW: Lexical analysis tools for German language data

Posted by Markus Jelsma <ma...@openindex.io>.
On Thursday 12 April 2012 18:00:14 Paul Libbrecht wrote:
> Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit :
> >> Some compounds probably should not be decompounded, like "Fahrrad"
> >> (farhren/Rad). With a dictionary-based stemmer, you might decide to
> >> avoid decompounding for words in the dictionary.
> > 
> > Good point.
> 
> More or less, Fahrrad is generally abbreviated as Rad.
> (even though Rad can mean wheel and bike)
> 
> >> Note that highlighting gets pretty weird when you are matching only
> >> part of a word.
> > 
> > Guess it'll be a weird when you get it wrong, like "Noten" in
> > "Notentriegelung".
> 
> This decomposition should not happen because Noten-triegelung does not have
> a correct second term.
> 
> >> The Basis Technology linguistic analyzers aren't cheap or small, but
> >> they work well.
> > 
> > We will consider our needs and options. Thanks for your thoughts.
> 
> My question remains as to which domain it aims at covering.
> We had such need for mathematics texts... I would be pleasantly surprised
> if, for example, Differenzen-quotient  would be decompounded.

The HyphenationCompoundWordTokenFilter can do those things but those words 
must be listed in the dictionary or you'll get strange results. It still 
yields strange results when it emits tokens that are subwords of a subword.

> 
> paul

-- 
Markus Jelsma - CTO - Openindex

Re: AW: Lexical analysis tools for German language data

Posted by Walter Underwood <wu...@wunderwood.org>.
On Apr 12, 2012, at 8:46 AM, Michael Ludwig wrote:

> I remember from my linguistics studies that the terminus technicus for
> these is "Fugenmorphem" (interstitial or joint morpheme). 

That is some excellent linguistic jargon. I'll file that with "hapax legomenon".

If you don't highlight, you can get good results with pretty rough analyzers, but highlighting exposes those, even when they don't affect relevance. For example, you can get good relevance just indexing bigrams in Chinese, but it looks awful when you highlight them. As soon as you highlight, you need a dictionary-based segmenter.

wunder
--
Walter Underwood
wunder@wunderwood.org




Re: AW: Lexical analysis tools for German language data

Posted by Walter Underwood <wu...@wunderwood.org>.
On Apr 12, 2012, at 9:00 AM, Paul Libbrecht wrote:

> More or less, Fahrrad is generally abbreviated as Rad.
> (even though Rad can mean wheel and bike)

A synonym could handle this, since "farhren" would not be a good match. It is judgement call, but this seems more like an equivalence "Fahrrad = Rad" than decompounding.

wunder
--
Walter Underwood
wunder@wunderwood.org




Re: AW: Lexical analysis tools for German language data

Posted by Paul Libbrecht <pa...@hoplahup.net>.
Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit :
>> Some compounds probably should not be decompounded, like "Fahrrad"
>> (farhren/Rad). With a dictionary-based stemmer, you might decide to
>> avoid decompounding for words in the dictionary.
> 
> Good point.

More or less, Fahrrad is generally abbreviated as Rad.
(even though Rad can mean wheel and bike)

>> Note that highlighting gets pretty weird when you are matching only
>> part of a word.
> 
> Guess it'll be a weird when you get it wrong, like "Noten" in
> "Notentriegelung".

This decomposition should not happen because Noten-triegelung does not have a correct second term.

>> The Basis Technology linguistic analyzers aren't cheap or small, but
>> they work well.
> 
> We will consider our needs and options. Thanks for your thoughts.

My question remains as to which domain it aims at covering.
We had such need for mathematics texts... I would be pleasantly surprised if, for example, Differenzen-quotient  would be decompounded.

paul

AW: Lexical analysis tools for German language data

Posted by Michael Ludwig <m....@epages.com>.
> Von: Tomas Zerolo

> > > There can be transformations or inflections, like the "s" in
> > > "Weinachtsbaum" (Weinachten/Baum).
> >
> > I remember from my linguistics studies that the terminus technicus
> > for these is "Fugenmorphem" (interstitial or joint morpheme) [...]
> 
> IANAL (I am not a linguist -- pun intended ;) but I've always read
> that as a genitive. Any pointers?

Admittedly, that's what you'd think, and despite linguistics telling me
otherwise I'd maintain there's some truth in it. For this case, however,
consider: "die Weihnacht" declines like "die Nacht", so:

nom. die Weihnacht
gen. der Weihnacht
dat. der Weihnacht
akk. die Weihnacht

As you can see, there's no "s" to be found anywhere, not even in the
genitive. But my gut feeling, like yours, is that this should indicate
genitive, and I would make a point of well-argued gut feeling being at
least as relevant as formalist analysis.

Michael

Re: Lexical analysis tools for German language data

Posted by Tomas Zerolo <to...@axelspringer.de>.
On Thu, Apr 12, 2012 at 03:46:56PM +0000, Michael Ludwig wrote:
> > Von: Walter Underwood
> 
> > German noun decompounding is a little more complicated than it might
> > seem.
> > 
> > There can be transformations or inflections, like the "s" in
> > "Weinachtsbaum" (Weinachten/Baum).
> 
> I remember from my linguistics studies that the terminus technicus for
> these is "Fugenmorphem" (interstitial or joint morpheme) [...]

IANAL (I am not a linguist -- pun intended ;) but I've always read that
as a genitive. Any pointers?

Regards
-- 
Tomás Zerolo
Axel Springer AG
Axel Springer media Systems
BILD Produktionssysteme
Axel-Springer-Straße 65
10888 Berlin
Tel.: +49 (30) 2591-72875
tomas.zerolo@axelspringer.de
www.axelspringer.de

Axel Springer AG, Sitz Berlin, Amtsgericht Charlottenburg, HRB 4998
Vorsitzender des Aufsichtsrats: Dr. Giuseppe Vita
Vorstand: Dr. Mathias Döpfner (Vorsitzender)
Jan Bayer, Ralph Büchi, Lothar Lanz, Dr. Andreas Wiele

AW: Lexical analysis tools for German language data

Posted by Michael Ludwig <m....@epages.com>.
> Von: Walter Underwood

> German noun decompounding is a little more complicated than it might
> seem.
> 
> There can be transformations or inflections, like the "s" in
> "Weinachtsbaum" (Weinachten/Baum).

I remember from my linguistics studies that the terminus technicus for
these is "Fugenmorphem" (interstitial or joint morpheme). But there's
not many of them - phrased in a regex, it's /e?[ns]/. The Weinachtsbaum
in the example above is from the singular (die Weihnacht), then "s",
then Baum. Still, it's much more complex then, say, English or Italian.

> Internal nouns should be recapitalized, like "Baum" above.

Casing won't matter for indexing, I think. The way I would go about
obtaining stems from compound words is by using a dictionary of stems
and a regex. We'll see how far that'll take us.

> Some compounds probably should not be decompounded, like "Fahrrad"
> (farhren/Rad). With a dictionary-based stemmer, you might decide to
> avoid decompounding for words in the dictionary.

Good point.

> Note that highlighting gets pretty weird when you are matching only
> part of a word.

Guess it'll be a weird when you get it wrong, like "Noten" in
"Notentriegelung".

> Luckily, a lot of compounds are simple, and you could well get a
> measurable improvement with a very simple algorithm. There isn't
> anything complicated about compounds like Orgelmusik or
> Netzwerkbetreuer.

Exactly.

> The Basis Technology linguistic analyzers aren't cheap or small, but
> they work well.

We will consider our needs and options. Thanks for your thoughts.

Michael

Re: Lexical analysis tools for German language data

Posted by Walter Underwood <wu...@wunderwood.org>.
German noun decompounding is a little more complicated than it might seem.

There can be transformations or inflections, like the "s" in "Weinachtsbaum" (Weinachten/Baum).

Internal nouns should be recapitalized, like "Baum" above.

Some compounds probably should not be decompounded, like "Fahrrad" (farhren/Rad). With a dictionary-based stemmer, you might decide to avoid decompounding for words in the dictionary.

Verbs get more complicated inflections, and might need to be decapitalized, like "farhren" above.

Und so weiter.

Note that highlighting gets pretty weird when you are matching only part of a word.

Luckily, a lot of compounds are simple, and you could well get a measurable improvement with a very simple algorithm. There isn't anything complicated about compounds like Orgelmusik or Netzwerkbetreuer.

The Basis Technology linguistic analyzers aren't cheap or small, but they work well. 

wunder

On Apr 12, 2012, at 3:58 AM, Paul Libbrecht wrote:

> Bernd,
> 
> can you please say a little more?
> I think this list is ok to contain some description for commercial solutions that satisfy a request formulated on list.
> 
> Is there any product at BASIS Tech that provides a compound-analyzer with a big dictionary of decomposed compounds in German? If yes, for which domain? The Google Search result (I wonder if this is politically correct to not have yours ;-)) shows me that there's an amount of job done in this direction (e.g. Gärten to match Garten) but being precise for this question would be more helpful!
> 
> paul
> 
> 
> Le 12 avr. 2012 à 12:46, Bernd Fehling a écrit :
> 
>> 
>> You might have a look at:
>> http://www.basistech.com/lucene/
>> 
>> 
>> Am 12.04.2012 11:52, schrieb Michael Ludwig:
>>> Given an input of "Windjacke" (probably "wind jacket" in English), I'd
>>> like the code that prepares the data for the index (tokenizer etc) to
>>> understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
>>> would include the "Windjacke" document in its result set.
>>> 
>>> It appears to me that such an analysis requires a dictionary-backed
>>> approach, which doesn't have to be perfect at all; a list of the most
>>> common 2000 words would probably do the job and fulfil a criterion of
>>> reasonable usefulness.
>>> 
>>> Do you know of any implementation techniques or working implementations
>>> to do this kind of lexical analysis for German language data? (Or other
>>> languages, for that matter?) What are they, where can I find them?
>>> 
>>> I'm sure there is something out (commercial or free) because I've seen
>>> lots of engines grokking German and the way it builds words.
>>> 
>>> Failing that, what are the proper terms do refer to these techniques so
>>> you can search more successfully?
>>> 
>>> Michael





Re: Lexical analysis tools for German language data

Posted by Bernd Fehling <be...@uni-bielefeld.de>.
Paul,

nearly two years ago I requested an evaluation license and tested BASIS Tech
Rosette for Lucene & Solr. Was working excellent but the price much much to high.

Yes, they also have compound analysis for several languages including German.
Just configure your pipeline in solr and setup the processing pipeline in
Rosette Language Processing (RLP) and thats it.

Example from my very old schema.xml config:

<fieldtype name="text_rlp" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
     <tokenizer class="com.basistech.rlp.solr.RLPTokenizerFactory"
                rlpContext="solr/conf/rlp-index-context.xml"
                postPartOfSpeech="false"
                postLemma="true"
                postStem="true"
                postCompoundComponents="true"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>
   <analyzer type="query">
     <tokenizer class="com.basistech.rlp.solr.RLPTokenizerFactory"
                rlpContext="solr/conf/rlp-query-context.xml"
                postPartOfSpeech="false"
                postLemma="true"
                postCompoundComponents="true"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldtype>

So you just point tokenizer to RLP and have two RLP pipelines configured,
one for indexing (rlp-index-context.xml) and one for querying (rlp-query-context.xml).

Example form my rlp-index-context.xml config:

<contextconfig>
  <properties>
    <property name="com.basistech.rex.optimize" value="false"/>
    <property name="com.basistech.ela.retokenize_for_rex" value="true"/>
  </properties>
  <languageprocessors>
    <languageprocessor>Unicode Converter</languageprocessor>
    <languageprocessor>Language Identifier</languageprocessor>
    <languageprocessor>Encoding and Character Normalizer</languageprocessor>
    <languageprocessor>European Language Analyzer</languageprocessor>
<!--    <languageprocessor>Script Region Locator</languageprocessor>
    <languageprocessor>Japanese Language Analyzer</languageprocessor>
    <languageprocessor>Chinese Language Analyzer</languageprocessor>
    <languageprocessor>Korean Language Analyzer</languageprocessor>
    <languageprocessor>Sentence Breaker</languageprocessor>
    <languageprocessor>Word Breaker</languageprocessor>
    <languageprocessor>Arabic Language Analyzer</languageprocessor>
    <languageprocessor>Persian Language Analyzer</languageprocessor>
    <languageprocessor>Urdu Language Analyzer</languageprocessor> -->
    <languageprocessor>Stopword Locator</languageprocessor>
    <languageprocessor>Base Noun Phrase Locator</languageprocessor>
<!--    <languageprocessor>Statistical Entity Extractor</languageprocessor> -->
    <languageprocessor>Exact Match Entity Extractor</languageprocessor>
    <languageprocessor>Pattern Match Entity Extractor</languageprocessor>
    <languageprocessor>Entity Redactor</languageprocessor>
    <languageprocessor>REXML Writer</languageprocessor>
  </languageprocessors>
</contextconfig>

As you can see I used the "European Language Analyzer".

Bernd



Am 12.04.2012 12:58, schrieb Paul Libbrecht:
> Bernd,
> 
> can you please say a little more?
> I think this list is ok to contain some description for commercial solutions that satisfy a request formulated on list.
> 
> Is there any product at BASIS Tech that provides a compound-analyzer with a big dictionary of decomposed compounds in German? 
> If yes, for which domain? 
> The Google Search result (I wonder if this is politically correct to not have yours ;-)) shows me that there's an amount 
> of job done in this direction (e.g. Gärten to match Garten) but being precise for this question would be more helpful!
> 
> paul
> 
> 

Re: Lexical analysis tools for German language data

Posted by Paul Libbrecht <pa...@hoplahup.net>.
Bernd,

can you please say a little more?
I think this list is ok to contain some description for commercial solutions that satisfy a request formulated on list.

Is there any product at BASIS Tech that provides a compound-analyzer with a big dictionary of decomposed compounds in German? If yes, for which domain? The Google Search result (I wonder if this is politically correct to not have yours ;-)) shows me that there's an amount of job done in this direction (e.g. Gärten to match Garten) but being precise for this question would be more helpful!

paul


Le 12 avr. 2012 à 12:46, Bernd Fehling a écrit :

> 
> You might have a look at:
> http://www.basistech.com/lucene/
> 
> 
> Am 12.04.2012 11:52, schrieb Michael Ludwig:
>> Given an input of "Windjacke" (probably "wind jacket" in English), I'd
>> like the code that prepares the data for the index (tokenizer etc) to
>> understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
>> would include the "Windjacke" document in its result set.
>> 
>> It appears to me that such an analysis requires a dictionary-backed
>> approach, which doesn't have to be perfect at all; a list of the most
>> common 2000 words would probably do the job and fulfil a criterion of
>> reasonable usefulness.
>> 
>> Do you know of any implementation techniques or working implementations
>> to do this kind of lexical analysis for German language data? (Or other
>> languages, for that matter?) What are they, where can I find them?
>> 
>> I'm sure there is something out (commercial or free) because I've seen
>> lots of engines grokking German and the way it builds words.
>> 
>> Failing that, what are the proper terms do refer to these techniques so
>> you can search more successfully?
>> 
>> Michael


Re: Lexical analysis tools for German language data

Posted by Bernd Fehling <be...@uni-bielefeld.de>.
You might have a look at:
http://www.basistech.com/lucene/


Am 12.04.2012 11:52, schrieb Michael Ludwig:
> Given an input of "Windjacke" (probably "wind jacket" in English), I'd
> like the code that prepares the data for the index (tokenizer etc) to
> understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
> would include the "Windjacke" document in its result set.
> 
> It appears to me that such an analysis requires a dictionary-backed
> approach, which doesn't have to be perfect at all; a list of the most
> common 2000 words would probably do the job and fulfil a criterion of
> reasonable usefulness.
> 
> Do you know of any implementation techniques or working implementations
> to do this kind of lexical analysis for German language data? (Or other
> languages, for that matter?) What are they, where can I find them?
> 
> I'm sure there is something out (commercial or free) because I've seen
> lots of engines grokking German and the way it builds words.
> 
> Failing that, what are the proper terms do refer to these techniques so
> you can search more successfully?
> 
> Michael