You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Benjamin Douglas <bb...@basistech.com> on 2009/10/21 02:35:19 UTC

Using org.apache.lucene.analysis.compound

Hello,

I've found a number of posts in different places talking about how to perform decompounding, but I haven't found too many discussing how to use the results of decompounding. If anyone can answer this question or point me to an existing discussion it would be very helpful.

In the description of the org.apache.lucene.analysis.compound package, it gives the following example:

	Rindfleischüberwachungsgesetz, 0, 29
	Rind, 0, 4, posIncr=0
	fleisch, 4, 11, posIncr=0
	überwachung, 11, 22, posIncr=0
	gesetz, 23, 29, posIncr=0

And I see how this allows me to find single components such as "gesetz" or "Rind". But what if I want to find combinations of components such as "Rindfleisch" or "überwachungsgesetz"? It seems that the pattern of using posIncr=0 for all components excludes the possibility of finding sub-strings that are made up of multiple components.

Any comments or thoughts would be appreciated.

Ben Douglas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using org.apache.lucene.analysis.compound

Posted by Robert Muir <rc...@gmail.com>.

there is some information on this topic in the pkg summary:

http://lucene.apache.org/java/2_9_0/api/contrib-analyzers/org/apache/lucene/analysis/compound/package-summary.html

in short, for a large list (there is no limit in the code), you will want to
make use of a hyphenation grammar as well:
HyphenationCompoundWordTokenFilter instead of the brute-force dictionary
approach, for better speed.

there is also a pointer to some dictionaries at openoffice, i'd also look
around at spellcheckers and stuff too elsewhere if you cant find one that
fits your needs.

On Wed, Oct 21, 2009 at 4:19 PM, Paul Libbrecht <pa...@activemath.org> wrote:

> Great,
>
> now the next question: which dictionary to do you guys use? How big can it
> be?
> Is 50000 words acceptable?
>
> paul
>
>
> Le 21-oct.-09 à 21:23, Robert Muir a écrit :
>
>
>  Paul, i think in general scoring should take care of this too, its all
>> about
>> your dictionary, same as the previous example.
>> this is because überwachungsgesetz matches 3 tokens: überwachungsgesetz,
>> überwachung, gesetz
>> but überwachung gesetz only matches 2.
>>
>> überwachungsgesetz
>> 0.37040412 = (MATCH) sum of:
>>  0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>>   0.5 = queryWeight(field:überwachungsgesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
>> of:
>>     1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.5 = fieldNorm(field=field, doc=0)
>>  0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>>   0.5 = queryWeight(field:überwachung), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>     1.0 = tf(termFreq(field:überwachung)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.5 = fieldNorm(field=field, doc=0)
>>  0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>>   0.5 = queryWeight(field:überwachungsgesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
>> of:
>>     1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.5 = fieldNorm(field=field, doc=0)
>>  0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>>   0.5 = queryWeight(field:gesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>     1.0 = tf(termFreq(field:gesetz)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.5 = fieldNorm(field=field, doc=0)
>>
>> überwachung gesetz
>> 0.30685282 = (MATCH) sum of:
>>  0.15342641 = (MATCH) sum of:
>>   0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>>     0.5 = queryWeight(field:überwachung), product of:
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       1.6294457 = queryNorm
>>     0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>       1.0 = tf(termFreq(field:überwachung)=1)
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       0.5 = fieldNorm(field=field, doc=0)
>>   0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>>     0.5 = queryWeight(field:überwachung), product of:
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       1.6294457 = queryNorm
>>     0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>       1.0 = tf(termFreq(field:überwachung)=1)
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       0.5 = fieldNorm(field=field, doc=0)
>>  0.15342641 = (MATCH) sum of:
>>   0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>>     0.5 = queryWeight(field:gesetz), product of:
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       1.6294457 = queryNorm
>>     0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>       1.0 = tf(termFreq(field:gesetz)=1)
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       0.5 = fieldNorm(field=field, doc=0)
>>   0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>>     0.5 = queryWeight(field:gesetz), product of:
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       1.6294457 = queryNorm
>>     0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>       1.0 = tf(termFreq(field:gesetz)=1)
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       0.5 = fieldNorm(field=field, doc=0)
>>
>> On Wed, Oct 21, 2009 at 3:16 PM, Paul Libbrecht <pa...@activemath.org>
>> wrote:
>>
>>  Can the dictionary have weights?
>>>
>>> überwachungsgesetz alone probably needs a higher rank than überwachung
>>> and
>>> gesetzt or?
>>>
>>> paul
>>>
>>>
>>> Le 21-oct.-09 à 21:09, Benjamin Douglas a écrit :
>>>
>>>
>>> OK, that makes sense. So I just need to add all of the sub-compounds that
>>>
>>>> are real words at posIncr=0, even if they are combinations of other
>>>> sub-compounds.
>>>>
>>>> Thanks!
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>> Sent: Wednesday, October 21, 2009 11:49 AM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: Using org.apache.lucene.analysis.compound
>>>>
>>>> yes, your dictionary :)
>>>>
>>>> if überwachungsgesetz is a real word, add it to your dictionary.
>>>>
>>>> for example, if your dictionary is { "Rind", "Fleisch", "Draht",
>>>> "Schere",
>>>> "Gesetz", "Aufgabe", "Überwachung" }, and you index
>>>> Rindfleischüberwachungsgesetz, then all 3 queries will have the same
>>>> score.
>>>> but if you expand the dictionary to { "Rind", "Fleisch", "Draht",
>>>> "Schere",
>>>> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this
>>>> makes
>>>> a big difference.
>>>>
>>>> all 3 queries will still match, but überwachungsgesetz will have a
>>>> higher
>>>> score. this is because now things are analyzed differently:
>>>> Rindfleischüberwachungsgesetz will be decompounded as before, but with
>>>> an
>>>> additional token: Überwachungsgesetz.
>>>> so back to your original question, these 'concatenations' of multiple
>>>> components, yes compounds will do that, if they are real words. but it
>>>> won't
>>>> just make them up.
>>>>
>>>> "überwachungsgesetz"
>>>> 0.23013961 = (MATCH) sum of:
>>>> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>>>>  0.5 = queryWeight(field:überwachungsgesetz), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   1.6294457 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),
>>>> product
>>>> of:
>>>>   1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>> 0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
>>>>  0.5 = queryWeight(field:überwachung), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   1.6294457 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>>>   1.0 = tf(termFreq(field:überwachung)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>>>>  0.5 = queryWeight(field:überwachungsgesetz), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   1.6294457 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),
>>>> product
>>>> of:
>>>>   1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>> 0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
>>>>  0.5 = queryWeight(field:gesetz), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   1.6294457 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>>   1.0 = tf(termFreq(field:gesetz)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>>
>>>> "gesetzüberwachung"
>>>> 0.064782135 = (MATCH) sum of:
>>>> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>>>  0.2814906 = queryWeight(field:gesetz), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.9173473 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>>   1.0 = tf(termFreq(field:gesetz)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>> 0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
>>>>  0.2814906 = queryWeight(field:überwachung), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.9173473 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>>>   1.0 = tf(termFreq(field:überwachung)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>>
>>>> "fleischgesetz"
>>>> 0.064782135 = (MATCH) sum of:
>>>> 0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
>>>>  0.2814906 = queryWeight(field:fleisch), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.9173473 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
>>>>   1.0 = tf(termFreq(field:fleisch)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>>>  0.2814906 = queryWeight(field:gesetz), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.9173473 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>>   1.0 = tf(termFreq(field:gesetz)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
>>>> <bb...@basistech.com>wrote:
>>>>
>>>> Thanks for all of the answers so far!
>>>>
>>>>>
>>>>> Paul's question is similar to another aspect I am curious about:
>>>>>
>>>>> Given the way the sample word is analyzed, is there anything in the
>>>>> scoring
>>>>> mechanism that would rank "überwachungsgesetz" higher than
>>>>> "gesetzüberwachung" or "fleischgesetz"?
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>> Robert Muir
>>>> rcmuir@gmail.com
>>>>
>>>>
>>>
>>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: Using org.apache.lucene.analysis.compound

Posted by Paul Libbrecht <pa...@activemath.org>.

Great,

now the next question: which dictionary to do you guys use? How big  
can it be?
Is 50000 words acceptable?

paul


Le 21-oct.-09 à 21:23, Robert Muir a écrit :

> Paul, i think in general scoring should take care of this too, its  
> all about
> your dictionary, same as the previous example.
> this is because überwachungsgesetz matches 3 tokens:  
> überwachungsgesetz,
> überwachung, gesetz
> but überwachung gesetz only matches 2.
>
> überwachungsgesetz
> 0.37040412 = (MATCH) sum of:
>  0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product  
> of:
>    0.5 = queryWeight(field:überwachungsgesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),  
> product
> of:
>      1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.5 = fieldNorm(field=field, doc=0)
>  0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>    0.5 = queryWeight(field:überwachung), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product  
> of:
>      1.0 = tf(termFreq(field:überwachung)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.5 = fieldNorm(field=field, doc=0)
>  0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product  
> of:
>    0.5 = queryWeight(field:überwachungsgesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),  
> product
> of:
>      1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.5 = fieldNorm(field=field, doc=0)
>  0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>    0.5 = queryWeight(field:gesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 = tf(termFreq(field:gesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.5 = fieldNorm(field=field, doc=0)
>
> überwachung gesetz
> 0.30685282 = (MATCH) sum of:
>  0.15342641 = (MATCH) sum of:
>    0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>      0.5 = queryWeight(field:überwachung), product of:
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        1.6294457 = queryNorm
>      0.15342641 = (MATCH) fieldWeight(field:überwachung in 0),  
> product of:
>        1.0 = tf(termFreq(field:überwachung)=1)
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        0.5 = fieldNorm(field=field, doc=0)
>    0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>      0.5 = queryWeight(field:überwachung), product of:
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        1.6294457 = queryNorm
>      0.15342641 = (MATCH) fieldWeight(field:überwachung in 0),  
> product of:
>        1.0 = tf(termFreq(field:überwachung)=1)
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        0.5 = fieldNorm(field=field, doc=0)
>  0.15342641 = (MATCH) sum of:
>    0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>      0.5 = queryWeight(field:gesetz), product of:
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        1.6294457 = queryNorm
>      0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>        1.0 = tf(termFreq(field:gesetz)=1)
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        0.5 = fieldNorm(field=field, doc=0)
>    0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>      0.5 = queryWeight(field:gesetz), product of:
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        1.6294457 = queryNorm
>      0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>        1.0 = tf(termFreq(field:gesetz)=1)
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        0.5 = fieldNorm(field=field, doc=0)
>
> On Wed, Oct 21, 2009 at 3:16 PM, Paul Libbrecht  
> <pa...@activemath.org> wrote:
>
>> Can the dictionary have weights?
>>
>> überwachungsgesetz alone probably needs a higher rank than  
>> überwachung and
>> gesetzt or?
>>
>> paul
>>
>>
>> Le 21-oct.-09 à 21:09, Benjamin Douglas a écrit :
>>
>>
>> OK, that makes sense. So I just need to add all of the sub- 
>> compounds that
>>> are real words at posIncr=0, even if they are combinations of other
>>> sub-compounds.
>>>
>>> Thanks!
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Wednesday, October 21, 2009 11:49 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Using org.apache.lucene.analysis.compound
>>>
>>> yes, your dictionary :)
>>>
>>> if überwachungsgesetz is a real word, add it to your dictionary.
>>>
>>> for example, if your dictionary is { "Rind", "Fleisch", "Draht",  
>>> "Schere",
>>> "Gesetz", "Aufgabe", "Überwachung" }, and you index
>>> Rindfleischüberwachungsgesetz, then all 3 queries will have the same
>>> score.
>>> but if you expand the dictionary to { "Rind", "Fleisch", "Draht",
>>> "Schere",
>>> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then  
>>> this
>>> makes
>>> a big difference.
>>>
>>> all 3 queries will still match, but überwachungsgesetz will have a  
>>> higher
>>> score. this is because now things are analyzed differently:
>>> Rindfleischüberwachungsgesetz will be decompounded as before, but  
>>> with an
>>> additional token: Überwachungsgesetz.
>>> so back to your original question, these 'concatenations' of  
>>> multiple
>>> components, yes compounds will do that, if they are real words.  
>>> but it
>>> won't
>>> just make them up.
>>>
>>> "überwachungsgesetz"
>>> 0.23013961 = (MATCH) sum of:
>>> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0),  
>>> product of:
>>>  0.5 = queryWeight(field:überwachungsgesetz), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    1.6294457 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),  
>>> product
>>> of:
>>>    1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>> 0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
>>>  0.5 = queryWeight(field:überwachung), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    1.6294457 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product  
>>> of:
>>>    1.0 = tf(termFreq(field:überwachung)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0),  
>>> product of:
>>>  0.5 = queryWeight(field:überwachungsgesetz), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    1.6294457 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),  
>>> product
>>> of:
>>>    1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>> 0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
>>>  0.5 = queryWeight(field:gesetz), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    1.6294457 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>    1.0 = tf(termFreq(field:gesetz)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>>
>>> "gesetzüberwachung"
>>> 0.064782135 = (MATCH) sum of:
>>> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>>  0.2814906 = queryWeight(field:gesetz), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.9173473 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>    1.0 = tf(termFreq(field:gesetz)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>> 0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
>>>  0.2814906 = queryWeight(field:überwachung), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.9173473 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product  
>>> of:
>>>    1.0 = tf(termFreq(field:überwachung)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>>
>>> "fleischgesetz"
>>> 0.064782135 = (MATCH) sum of:
>>> 0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
>>>  0.2814906 = queryWeight(field:fleisch), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.9173473 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
>>>    1.0 = tf(termFreq(field:fleisch)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>>  0.2814906 = queryWeight(field:gesetz), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.9173473 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>    1.0 = tf(termFreq(field:gesetz)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>>
>>>
>>>
>>>
>>> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
>>> <bb...@basistech.com>wrote:
>>>
>>> Thanks for all of the answers so far!
>>>>
>>>> Paul's question is similar to another aspect I am curious about:
>>>>
>>>> Given the way the sample word is analyzed, is there anything in the
>>>> scoring
>>>> mechanism that would rank "überwachungsgesetz" higher than
>>>> "gesetzüberwachung" or "fleischgesetz"?
>>>>
>>>>
>>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>
>>
>
>
> -- 
> Robert Muir
> rcmuir@gmail.com

Re: Using org.apache.lucene.analysis.compound

Posted by Robert Muir <rc...@gmail.com>.

Paul, i think in general scoring should take care of this too, its all about
your dictionary, same as the previous example.
this is because überwachungsgesetz matches 3 tokens: überwachungsgesetz,
überwachung, gesetz
but überwachung gesetz only matches 2.

überwachungsgesetz
0.37040412 = (MATCH) sum of:
  0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
    0.5 = queryWeight(field:überwachungsgesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
      1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.5 = fieldNorm(field=field, doc=0)
  0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
    0.5 = queryWeight(field:überwachung), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
      1.0 = tf(termFreq(field:überwachung)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.5 = fieldNorm(field=field, doc=0)
  0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
    0.5 = queryWeight(field:überwachungsgesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
      1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.5 = fieldNorm(field=field, doc=0)
  0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
    0.5 = queryWeight(field:gesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
      1.0 = tf(termFreq(field:gesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.5 = fieldNorm(field=field, doc=0)

überwachung gesetz
0.30685282 = (MATCH) sum of:
  0.15342641 = (MATCH) sum of:
    0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
      0.5 = queryWeight(field:überwachung), product of:
        0.30685282 = idf(docFreq=1, maxDocs=1)
        1.6294457 = queryNorm
      0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
        1.0 = tf(termFreq(field:überwachung)=1)
        0.30685282 = idf(docFreq=1, maxDocs=1)
        0.5 = fieldNorm(field=field, doc=0)
    0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
      0.5 = queryWeight(field:überwachung), product of:
        0.30685282 = idf(docFreq=1, maxDocs=1)
        1.6294457 = queryNorm
      0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
        1.0 = tf(termFreq(field:überwachung)=1)
        0.30685282 = idf(docFreq=1, maxDocs=1)
        0.5 = fieldNorm(field=field, doc=0)
  0.15342641 = (MATCH) sum of:
    0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
      0.5 = queryWeight(field:gesetz), product of:
        0.30685282 = idf(docFreq=1, maxDocs=1)
        1.6294457 = queryNorm
      0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
        1.0 = tf(termFreq(field:gesetz)=1)
        0.30685282 = idf(docFreq=1, maxDocs=1)
        0.5 = fieldNorm(field=field, doc=0)
    0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
      0.5 = queryWeight(field:gesetz), product of:
        0.30685282 = idf(docFreq=1, maxDocs=1)
        1.6294457 = queryNorm
      0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
        1.0 = tf(termFreq(field:gesetz)=1)
        0.30685282 = idf(docFreq=1, maxDocs=1)
        0.5 = fieldNorm(field=field, doc=0)

On Wed, Oct 21, 2009 at 3:16 PM, Paul Libbrecht <pa...@activemath.org> wrote:

> Can the dictionary have weights?
>
> überwachungsgesetz alone probably needs a higher rank than überwachung and
> gesetzt or?
>
> paul
>
>
> Le 21-oct.-09 à 21:09, Benjamin Douglas a écrit :
>
>
>  OK, that makes sense. So I just need to add all of the sub-compounds that
>> are real words at posIncr=0, even if they are combinations of other
>> sub-compounds.
>>
>> Thanks!
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Wednesday, October 21, 2009 11:49 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Using org.apache.lucene.analysis.compound
>>
>> yes, your dictionary :)
>>
>> if überwachungsgesetz is a real word, add it to your dictionary.
>>
>> for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere",
>> "Gesetz", "Aufgabe", "Überwachung" }, and you index
>> Rindfleischüberwachungsgesetz, then all 3 queries will have the same
>> score.
>> but if you expand the dictionary to { "Rind", "Fleisch", "Draht",
>> "Schere",
>> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this
>> makes
>> a big difference.
>>
>> all 3 queries will still match, but überwachungsgesetz will have a higher
>> score. this is because now things are analyzed differently:
>> Rindfleischüberwachungsgesetz will be decompounded as before, but with an
>> additional token: Überwachungsgesetz.
>> so back to your original question, these 'concatenations' of multiple
>> components, yes compounds will do that, if they are real words. but it
>> won't
>> just make them up.
>>
>> "überwachungsgesetz"
>> 0.23013961 = (MATCH) sum of:
>>  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>>   0.5 = queryWeight(field:überwachungsgesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
>> of:
>>     1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>  0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
>>   0.5 = queryWeight(field:überwachung), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>     1.0 = tf(termFreq(field:überwachung)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>>   0.5 = queryWeight(field:überwachungsgesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
>> of:
>>     1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>  0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
>>   0.5 = queryWeight(field:gesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>     1.0 = tf(termFreq(field:gesetz)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>
>> "gesetzüberwachung"
>> 0.064782135 = (MATCH) sum of:
>>  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>   0.2814906 = queryWeight(field:gesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.9173473 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>     1.0 = tf(termFreq(field:gesetz)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>  0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
>>   0.2814906 = queryWeight(field:überwachung), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.9173473 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>     1.0 = tf(termFreq(field:überwachung)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>
>> "fleischgesetz"
>> 0.064782135 = (MATCH) sum of:
>>  0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
>>   0.2814906 = queryWeight(field:fleisch), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.9173473 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
>>     1.0 = tf(termFreq(field:fleisch)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>   0.2814906 = queryWeight(field:gesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.9173473 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>     1.0 = tf(termFreq(field:gesetz)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>
>>
>>
>>
>> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
>> <bb...@basistech.com>wrote:
>>
>>  Thanks for all of the answers so far!
>>>
>>> Paul's question is similar to another aspect I am curious about:
>>>
>>> Given the way the sample word is analyzed, is there anything in the
>>> scoring
>>> mechanism that would rank "überwachungsgesetz" higher than
>>> "gesetzüberwachung" or "fleischgesetz"?
>>>
>>>
>>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: Using org.apache.lucene.analysis.compound

Posted by Paul Libbrecht <pa...@activemath.org>.

Can the dictionary have weights?

überwachungsgesetz alone probably needs a higher rank than überwachung  
and gesetzt or?

paul


Le 21-oct.-09 à 21:09, Benjamin Douglas a écrit :

> OK, that makes sense. So I just need to add all of the sub-compounds  
> that are real words at posIncr=0, even if they are combinations of  
> other sub-compounds.
>
> Thanks!
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Wednesday, October 21, 2009 11:49 AM
> To: java-user@lucene.apache.org
> Subject: Re: Using org.apache.lucene.analysis.compound
>
> yes, your dictionary :)
>
> if überwachungsgesetz is a real word, add it to your dictionary.
>
> for example, if your dictionary is { "Rind", "Fleisch", "Draht",  
> "Schere",
> "Gesetz", "Aufgabe", "Überwachung" }, and you index
> Rindfleischüberwachungsgesetz, then all 3 queries will have the same  
> score.
> but if you expand the dictionary to { "Rind", "Fleisch", "Draht",  
> "Schere",
> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then  
> this makes
> a big difference.
>
> all 3 queries will still match, but überwachungsgesetz will have a  
> higher
> score. this is because now things are analyzed differently:
> Rindfleischüberwachungsgesetz will be decompounded as before, but  
> with an
> additional token: Überwachungsgesetz.
> so back to your original question, these 'concatenations' of multiple
> components, yes compounds will do that, if they are real words. but  
> it won't
> just make them up.
>
> "überwachungsgesetz"
> 0.23013961 = (MATCH) sum of:
>  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0),  
> product of:
>    0.5 = queryWeight(field:überwachungsgesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),  
> product
> of:
>      1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
>    0.5 = queryWeight(field:überwachung), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product  
> of:
>      1.0 = tf(termFreq(field:überwachung)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0),  
> product of:
>    0.5 = queryWeight(field:überwachungsgesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),  
> product
> of:
>      1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
>    0.5 = queryWeight(field:gesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 = tf(termFreq(field:gesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>
> "gesetzüberwachung"
> 0.064782135 = (MATCH) sum of:
>  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>    0.2814906 = queryWeight(field:gesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 = tf(termFreq(field:gesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
>    0.2814906 = queryWeight(field:überwachung), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product  
> of:
>      1.0 = tf(termFreq(field:überwachung)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>
> "fleischgesetz"
> 0.064782135 = (MATCH) sum of:
>  0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
>    0.2814906 = queryWeight(field:fleisch), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
>      1.0 = tf(termFreq(field:fleisch)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>    0.2814906 = queryWeight(field:gesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 = tf(termFreq(field:gesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>
>
>
>
> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
> <bb...@basistech.com>wrote:
>
>> Thanks for all of the answers so far!
>>
>> Paul's question is similar to another aspect I am curious about:
>>
>> Given the way the sample word is analyzed, is there anything in the  
>> scoring
>> mechanism that would rank "überwachungsgesetz" higher than
>> "gesetzüberwachung" or "fleischgesetz"?
>>
>>
>
> -- 
> Robert Muir
> rcmuir@gmail.com

Re: Using org.apache.lucene.analysis.compound

Posted by Robert Muir <rc...@gmail.com>.

just add them to the dictionary, the compound filter will do this
automatically.

if you want to tweak it even further, you can also tell compounds to NOT
emit the subwords if they form a bigger compound with the onlyLongestMatch
parameter i spoke of earlier.
I haven't played with this option much but I think this is what its supposed
to do:

if the dictionary is
soft
ball
softball

then "softball" (or compounds containing it) won't emit "soft" and "ball",
because "softball" is in the dictionary and its a longest match.
with the option off, you'd get softball, ball, soft

On Wed, Oct 21, 2009 at 3:09 PM, Benjamin Douglas
<bb...@basistech.com>wrote:

> OK, that makes sense. So I just need to add all of the sub-compounds that
> are real words at posIncr=0, even if they are combinations of other
> sub-compounds.
>
> Thanks!
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Wednesday, October 21, 2009 11:49 AM
> To: java-user@lucene.apache.org
> Subject: Re: Using org.apache.lucene.analysis.compound
>
> yes, your dictionary :)
>
> if überwachungsgesetz is a real word, add it to your dictionary.
>
> for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere",
> "Gesetz", "Aufgabe", "Überwachung" }, and you index
> Rindfleischüberwachungsgesetz, then all 3 queries will have the same score.
> but if you expand the dictionary to { "Rind", "Fleisch", "Draht", "Schere",
> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this makes
> a big difference.
>
> all 3 queries will still match, but überwachungsgesetz will have a higher
> score. this is because now things are analyzed differently:
> Rindfleischüberwachungsgesetz will be decompounded as before, but with an
> additional token: Überwachungsgesetz.
> so back to your original question, these 'concatenations' of multiple
> components, yes compounds will do that, if they are real words. but it
> won't
> just make them up.
>
> "überwachungsgesetz"
> 0.23013961 = (MATCH) sum of:
>  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>    0.5 = queryWeight(field:überwachungsgesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
> of:
>      1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
>    0.5 = queryWeight(field:überwachung), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>      1.0 = tf(termFreq(field:überwachung)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>    0.5 = queryWeight(field:überwachungsgesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
> of:
>      1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
>    0.5 = queryWeight(field:gesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 = tf(termFreq(field:gesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>
> "gesetzüberwachung"
> 0.064782135 = (MATCH) sum of:
>  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>    0.2814906 = queryWeight(field:gesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 = tf(termFreq(field:gesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
>    0.2814906 = queryWeight(field:überwachung), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>      1.0 = tf(termFreq(field:überwachung)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>
> "fleischgesetz"
> 0.064782135 = (MATCH) sum of:
>  0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
>    0.2814906 = queryWeight(field:fleisch), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
>      1.0 = tf(termFreq(field:fleisch)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>    0.2814906 = queryWeight(field:gesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 = tf(termFreq(field:gesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>
>
>
>
> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
> <bb...@basistech.com>wrote:
>
> > Thanks for all of the answers so far!
> >
> > Paul's question is similar to another aspect I am curious about:
> >
> > Given the way the sample word is analyzed, is there anything in the
> scoring
> > mechanism that would rank "überwachungsgesetz" higher than
> > "gesetzüberwachung" or "fleischgesetz"?
> >
> >
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

RE: Using org.apache.lucene.analysis.compound

Posted by Benjamin Douglas <bb...@basistech.com>.

OK, that makes sense. So I just need to add all of the sub-compounds that are real words at posIncr=0, even if they are combinations of other sub-compounds.

Thanks!

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Wednesday, October 21, 2009 11:49 AM
To: java-user@lucene.apache.org
Subject: Re: Using org.apache.lucene.analysis.compound

yes, your dictionary :)

if überwachungsgesetz is a real word, add it to your dictionary.

for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere",
"Gesetz", "Aufgabe", "Überwachung" }, and you index
Rindfleischüberwachungsgesetz, then all 3 queries will have the same score.
but if you expand the dictionary to { "Rind", "Fleisch", "Draht", "Schere",
"Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this makes
a big difference.

all 3 queries will still match, but überwachungsgesetz will have a higher
score. this is because now things are analyzed differently:
Rindfleischüberwachungsgesetz will be decompounded as before, but with an
additional token: Überwachungsgesetz.
so back to your original question, these 'concatenations' of multiple
components, yes compounds will do that, if they are real words. but it won't
just make them up.

"überwachungsgesetz"
0.23013961 = (MATCH) sum of:
  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
    0.5 = queryWeight(field:überwachungsgesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
      1.0 = tf(termFreq(field:überwachungsgesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
    0.5 = queryWeight(field:überwachung), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
      1.0 = tf(termFreq(field:überwachung)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
    0.5 = queryWeight(field:überwachungsgesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
      1.0 = tf(termFreq(field:überwachungsgesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
    0.5 = queryWeight(field:gesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
      1.0 = tf(termFreq(field:gesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)

"gesetzüberwachung"
0.064782135 = (MATCH) sum of:
  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
    0.2814906 = queryWeight(field:gesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
      1.0 = tf(termFreq(field:gesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
    0.2814906 = queryWeight(field:überwachung), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
      1.0 = tf(termFreq(field:überwachung)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)

"fleischgesetz"
0.064782135 = (MATCH) sum of:
  0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
    0.2814906 = queryWeight(field:fleisch), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
      1.0 = tf(termFreq(field:fleisch)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
    0.2814906 = queryWeight(field:gesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
      1.0 = tf(termFreq(field:gesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)




On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
<bb...@basistech.com>wrote:

> Thanks for all of the answers so far!
>
> Paul's question is similar to another aspect I am curious about:
>
> Given the way the sample word is analyzed, is there anything in the scoring
> mechanism that would rank "überwachungsgesetz" higher than
> "gesetzüberwachung" or "fleischgesetz"?
>
>

-- 
Robert Muir
rcmuir@gmail.com

Re: Using org.apache.lucene.analysis.compound

Posted by Robert Muir <rc...@gmail.com>.

yes, your dictionary :)

if überwachungsgesetz is a real word, add it to your dictionary.

for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere",
"Gesetz", "Aufgabe", "Überwachung" }, and you index
Rindfleischüberwachungsgesetz, then all 3 queries will have the same score.
but if you expand the dictionary to { "Rind", "Fleisch", "Draht", "Schere",
"Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this makes
a big difference.

all 3 queries will still match, but überwachungsgesetz will have a higher
score. this is because now things are analyzed differently:
Rindfleischüberwachungsgesetz will be decompounded as before, but with an
additional token: Überwachungsgesetz.
so back to your original question, these 'concatenations' of multiple
components, yes compounds will do that, if they are real words. but it won't
just make them up.

"überwachungsgesetz"
0.23013961 = (MATCH) sum of:
  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
    0.5 = queryWeight(field:überwachungsgesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
      1.0 = tf(termFreq(field:überwachungsgesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
    0.5 = queryWeight(field:überwachung), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
      1.0 = tf(termFreq(field:überwachung)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
    0.5 = queryWeight(field:überwachungsgesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
      1.0 = tf(termFreq(field:überwachungsgesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
    0.5 = queryWeight(field:gesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
      1.0 = tf(termFreq(field:gesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)

"gesetzüberwachung"
0.064782135 = (MATCH) sum of:
  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
    0.2814906 = queryWeight(field:gesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
      1.0 = tf(termFreq(field:gesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
    0.2814906 = queryWeight(field:überwachung), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
      1.0 = tf(termFreq(field:überwachung)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)

"fleischgesetz"
0.064782135 = (MATCH) sum of:
  0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
    0.2814906 = queryWeight(field:fleisch), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
      1.0 = tf(termFreq(field:fleisch)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
    0.2814906 = queryWeight(field:gesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
      1.0 = tf(termFreq(field:gesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)




On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
<bb...@basistech.com>wrote:

> Thanks for all of the answers so far!
>
> Paul's question is similar to another aspect I am curious about:
>
> Given the way the sample word is analyzed, is there anything in the scoring
> mechanism that would rank "überwachungsgesetz" higher than
> "gesetzüberwachung" or "fleischgesetz"?
>
>

-- 
Robert Muir
rcmuir@gmail.com

RE: Using org.apache.lucene.analysis.compound

Posted by Benjamin Douglas <bb...@basistech.com>.

Thanks for all of the answers so far!

Paul's question is similar to another aspect I am curious about:

Given the way the sample word is analyzed, is there anything in the scoring mechanism that would rank "überwachungsgesetz" higher than "gesetzüberwachung" or "fleischgesetz"?

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Wednesday, October 21, 2009 5:12 AM
To: java-user@lucene.apache.org
Subject: Re: Using org.apache.lucene.analysis.compound

Paul, there are two implementations in compounds, one is dictionary-based,
the other is hyphenation-grammar + dictionary (it restricts the
decompounding based on hyphenation rules). You could also subclass the
compound base class and implement your own.

I haven't seen any user-measures (relevance, etc), would be a cool thing to
see though.

I'm not sure I understand your last question, can you elaborate?
it might be that to improve some cases, you want to use the onlyLongestMatch
parameter:
@param onlyLongestMatch Add only the longest matching subword to the stream

for scoring, I think lucene's scoring might help too, because the original
word, without decompounding, is left as a token so if you search on an exact
match it should be ranked higher. (not sure if this is answering your
question)

On Wed, Oct 21, 2009 at 5:27 AM, Paul Libbrecht <pa...@activemath.org> wrote:

>
> I'm interested to this analyzer.. it had escaped me and solves an old
> problem!
> Could you report about its usage:
> - did you have to feed words in a dictionary?
> - does anyone have user-measures already?
> ... and the last question for the research fun: is there any approach
> towards preferring Überwachunggesetz as a concept than, say,
> Fleischüberwachung? (again, that could be based on a dictionary probably).
>
> thanks in advance
>
> paul
>
>
> Le 21-oct.-09 à 04:00, Robert Muir a écrit :
>
>
>  hi, it will work because it will also decompound "Rindfleish" into Rind
>> and
>> fleish, with posIncr=0
>>
>> so if you index Rindfleischüberwachungsgesetz, then query with
>> "Rindfleish",
>> its matching because Rindfleish also gets decompounded into Rind and
>> fleish.
>>
>> On Tue, Oct 20, 2009 at 8:35 PM, Benjamin Douglas
>> <bb...@basistech.com>wrote:
>>
>>  Hello,
>>>
>>> I've found a number of posts in different places talking about how to
>>> perform decompounding, but I haven't found too many discussing how to use
>>> the results of decompounding. If anyone can answer this question or point
>>> me
>>> to an existing discussion it would be very helpful.
>>>
>>> In the description of the org.apache.lucene.analysis.compound package, it
>>> gives the following example:
>>>
>>>      Rindfleischüberwachungsgesetz, 0, 29
>>>      Rind, 0, 4, posIncr=0
>>>      fleisch, 4, 11, posIncr=0
>>>      überwachung, 11, 22, posIncr=0
>>>      gesetz, 23, 29, posIncr=0
>>>
>>> And I see how this allows me to find single components such as "gesetz"
>>> or
>>> "Rind". But what if I want to find combinations of components such as
>>> "Rindfleisch" or "überwachungsgesetz"? It seems that the pattern of using
>>> posIncr=0 for all components excludes the possibility of finding
>>> sub-strings
>>> that are made up of multiple components.
>>>
>>> Any comments or thoughts would be appreciated.
>>>
>>> Ben Douglas
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>

-- 
Robert Muir
rcmuir@gmail.com

Re: Using org.apache.lucene.analysis.compound

Posted by Robert Muir <rc...@gmail.com>.

Paul, there are two implementations in compounds, one is dictionary-based,
the other is hyphenation-grammar + dictionary (it restricts the
decompounding based on hyphenation rules). You could also subclass the
compound base class and implement your own.

I haven't seen any user-measures (relevance, etc), would be a cool thing to
see though.

I'm not sure I understand your last question, can you elaborate?
it might be that to improve some cases, you want to use the onlyLongestMatch
parameter:
@param onlyLongestMatch Add only the longest matching subword to the stream

for scoring, I think lucene's scoring might help too, because the original
word, without decompounding, is left as a token so if you search on an exact
match it should be ranked higher. (not sure if this is answering your
question)

On Wed, Oct 21, 2009 at 5:27 AM, Paul Libbrecht <pa...@activemath.org> wrote:

>
> I'm interested to this analyzer.. it had escaped me and solves an old
> problem!
> Could you report about its usage:
> - did you have to feed words in a dictionary?
> - does anyone have user-measures already?
> ... and the last question for the research fun: is there any approach
> towards preferring Überwachunggesetz as a concept than, say,
> Fleischüberwachung? (again, that could be based on a dictionary probably).
>
> thanks in advance
>
> paul
>
>
> Le 21-oct.-09 à 04:00, Robert Muir a écrit :
>
>
>  hi, it will work because it will also decompound "Rindfleish" into Rind
>> and
>> fleish, with posIncr=0
>>
>> so if you index Rindfleischüberwachungsgesetz, then query with
>> "Rindfleish",
>> its matching because Rindfleish also gets decompounded into Rind and
>> fleish.
>>
>> On Tue, Oct 20, 2009 at 8:35 PM, Benjamin Douglas
>> <bb...@basistech.com>wrote:
>>
>>  Hello,
>>>
>>> I've found a number of posts in different places talking about how to
>>> perform decompounding, but I haven't found too many discussing how to use
>>> the results of decompounding. If anyone can answer this question or point
>>> me
>>> to an existing discussion it would be very helpful.
>>>
>>> In the description of the org.apache.lucene.analysis.compound package, it
>>> gives the following example:
>>>
>>>      Rindfleischüberwachungsgesetz, 0, 29
>>>      Rind, 0, 4, posIncr=0
>>>      fleisch, 4, 11, posIncr=0
>>>      überwachung, 11, 22, posIncr=0
>>>      gesetz, 23, 29, posIncr=0
>>>
>>> And I see how this allows me to find single components such as "gesetz"
>>> or
>>> "Rind". But what if I want to find combinations of components such as
>>> "Rindfleisch" or "überwachungsgesetz"? It seems that the pattern of using
>>> posIncr=0 for all components excludes the possibility of finding
>>> sub-strings
>>> that are made up of multiple components.
>>>
>>> Any comments or thoughts would be appreciated.
>>>
>>> Ben Douglas
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: Using org.apache.lucene.analysis.compound

Posted by Paul Libbrecht <pa...@activemath.org>.

I'm interested to this analyzer.. it had escaped me and solves an old  
problem!
Could you report about its usage:
- did you have to feed words in a dictionary?
- does anyone have user-measures already?
... and the last question for the research fun: is there any approach  
towards preferring Überwachunggesetz as a concept than, say,  
Fleischüberwachung? (again, that could be based on a dictionary  
probably).

thanks in advance

paul


Le 21-oct.-09 à 04:00, Robert Muir a écrit :

> hi, it will work because it will also decompound "Rindfleish" into  
> Rind and
> fleish, with posIncr=0
>
> so if you index Rindfleischüberwachungsgesetz, then query with  
> "Rindfleish",
> its matching because Rindfleish also gets decompounded into Rind and  
> fleish.
>
> On Tue, Oct 20, 2009 at 8:35 PM, Benjamin Douglas
> <bb...@basistech.com>wrote:
>
>> Hello,
>>
>> I've found a number of posts in different places talking about how to
>> perform decompounding, but I haven't found too many discussing how  
>> to use
>> the results of decompounding. If anyone can answer this question or  
>> point me
>> to an existing discussion it would be very helpful.
>>
>> In the description of the org.apache.lucene.analysis.compound  
>> package, it
>> gives the following example:
>>
>>       Rindfleischüberwachungsgesetz, 0, 29
>>       Rind, 0, 4, posIncr=0
>>       fleisch, 4, 11, posIncr=0
>>       überwachung, 11, 22, posIncr=0
>>       gesetz, 23, 29, posIncr=0
>>
>> And I see how this allows me to find single components such as  
>> "gesetz" or
>> "Rind". But what if I want to find combinations of components such as
>> "Rindfleisch" or "überwachungsgesetz"? It seems that the pattern of  
>> using
>> posIncr=0 for all components excludes the possibility of finding  
>> sub-strings
>> that are made up of multiple components.
>>
>> Any comments or thoughts would be appreciated.
>>
>> Ben Douglas
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> -- 
> Robert Muir
> rcmuir@gmail.com

Re: Using org.apache.lucene.analysis.compound

Posted by Robert Muir <rc...@gmail.com>.

hi, it will work because it will also decompound "Rindfleish" into Rind and
fleish, with posIncr=0

so if you index Rindfleischüberwachungsgesetz, then query with "Rindfleish",
its matching because Rindfleish also gets decompounded into Rind and fleish.

On Tue, Oct 20, 2009 at 8:35 PM, Benjamin Douglas
<bb...@basistech.com>wrote:

> Hello,
>
> I've found a number of posts in different places talking about how to
> perform decompounding, but I haven't found too many discussing how to use
> the results of decompounding. If anyone can answer this question or point me
> to an existing discussion it would be very helpful.
>
> In the description of the org.apache.lucene.analysis.compound package, it
> gives the following example:
>
>        Rindfleischüberwachungsgesetz, 0, 29
>        Rind, 0, 4, posIncr=0
>        fleisch, 4, 11, posIncr=0
>        überwachung, 11, 22, posIncr=0
>        gesetz, 23, 29, posIncr=0
>
> And I see how this allows me to find single components such as "gesetz" or
> "Rind". But what if I want to find combinations of components such as
> "Rindfleisch" or "überwachungsgesetz"? It seems that the pattern of using
> posIncr=0 for all components excludes the possibility of finding sub-strings
> that are made up of multiple components.
>
> Any comments or thoughts would be appreciated.
>
> Ben Douglas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com