You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Berkes Adam <ad...@intland.com> on 2009/08/27 14:22:33 UTC

FuzzyLikeThis query and exact matches

Hi,

In our java project we uses a (slightly modifed) version of 
FuzzyLikeThis query which

"For each source term the fuzzy variants are held in a BooleanQuery with 
no coord factor (because
 we are not looking for matches on multiple variants in any one doc). 
Additionally, a specialized
 TermQuery is used for variants and does not use that variant term's IDF 
because this would favour rarer
 terms eg misspellings. Instead, all variants use the same IDF ranking 
(the one for the source query
 term) and this is factored into the variant's boost. If the source 
query term does not exist in the
 index the average IDF of the variants is used."

In most cases it performs well but if there is short query term with (as 
usual) big number of variants the exact matches will be stay spreaded 
among the others which is not so useful: it should be "sorted" like (or 
forcibly set more relevant) exact matches and variant matches according 
to relevancy.
Is there any simple solution or already implemented contrib query class 
for this problem?

Best regards,
Adam Berkes,
Intland Software

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: FuzzyLikeThis query and exact matches

Posted by Berkes Adam <ad...@intland.com>.
After searching for term "desy" which has lot of variants in our index a 
rewritten (sub)query will look like this:

(text:dey^0.22828968 text:des^0.22828968 text:dest^1.1557184 
text:desk^1.1557184 text:desi^1.1557184 text:desf^1.1557184 
text:desc^1.1557184 text:deny^1.1557184 text:defy^1.1557184 
text:desy^8.218443)

but what I would like to achive to have all exact matches (even if 
rankings "validly" send it to the end of matches) on top (or highest 
possible) while let variants to follow them according to their relevancy.

Maybe I understand wrongly but the edit distance is not a factor in that 
query type: index is search for terms with edit distance within a 
certain limit, eliminate IDF (with the factors above) and then create a 
coordinationless boolean query. I might play around (post modify) 
scoring for exact match subterm but I'm not sure that is a working solution.

Best regards,
Adam
> Despite making IDF a constant the edit distance should remain a factor 
> in the rankings so I would have thought this would give you what you 
> need.
>
> Can you supply a more detailed example? Either print the rewritten 
> query or use the explain function
>
> Cheers
> Mark
>
> On 27 Aug 2009, at 13:22, Berkes Adam wrote:
>
>> Hi,
>>
>> In our java project we uses a (slightly modifed) version of 
>> FuzzyLikeThis query which
>>
>> "For each source term the fuzzy variants are held in a BooleanQuery 
>> with no coord factor (because
>> we are not looking for matches on multiple variants in any one doc). 
>> Additionally, a specialized
>> TermQuery is used for variants and does not use that variant term's 
>> IDF because this would favour rarer
>> terms eg misspellings. Instead, all variants use the same IDF ranking 
>> (the one for the source query
>> term) and this is factored into the variant's boost. If the source 
>> query term does not exist in the
>> index the average IDF of the variants is used."
>>
>> In most cases it performs well but if there is short query term with 
>> (as usual) big number of variants the exact matches will be stay 
>> spreaded among the others which is not so useful: it should be 
>> "sorted" like (or forcibly set more relevant) exact matches and 
>> variant matches according to relevancy.
>> Is there any simple solution or already implemented contrib query 
>> class for this problem?
>>
>> Best regards,
>> Adam Berkes,
>> Intland Software
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: FuzzyLikeThis query and exact matches

Posted by Mark Harwood <ma...@yahoo.co.uk>.
Despite making IDF a constant the edit distance should remain a factor  
in the rankings so I would have thought this would give you what you  
need.

Can you supply a more detailed example? Either print the rewritten  
query or use the explain function

Cheers
Mark

On 27 Aug 2009, at 13:22, Berkes Adam wrote:

> Hi,
>
> In our java project we uses a (slightly modifed) version of  
> FuzzyLikeThis query which
>
> "For each source term the fuzzy variants are held in a BooleanQuery  
> with no coord factor (because
> we are not looking for matches on multiple variants in any one doc).  
> Additionally, a specialized
> TermQuery is used for variants and does not use that variant term's  
> IDF because this would favour rarer
> terms eg misspellings. Instead, all variants use the same IDF  
> ranking (the one for the source query
> term) and this is factored into the variant's boost. If the source  
> query term does not exist in the
> index the average IDF of the variants is used."
>
> In most cases it performs well but if there is short query term with  
> (as usual) big number of variants the exact matches will be stay  
> spreaded among the others which is not so useful: it should be  
> "sorted" like (or forcibly set more relevant) exact matches and  
> variant matches according to relevancy.
> Is there any simple solution or already implemented contrib query  
> class for this problem?
>
> Best regards,
> Adam Berkes,
> Intland Software
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org