You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucy.apache.org by Nick Wellnhofer <we...@aevum.de> on 2017/12/04 15:20:12 UTC

Re: [lucy-user] C library - Phrase Searches

On 28/11/2017 18:55, serkanmulayim@gmail.com wrote:
> My question is how is such queries being handled in the library. Is it by looking at the consecutive term positions in documents?

Yes.

> What is the performance impact for such queries?

This depends on how you quantify "performance impact", but in general, 
performance should be similar to an ANDQuery of all terms in the phrase.

> Secondly how are they being scored? Is it still tf/idf? If so what is the definition of tf and of idf, for these queries?

It's still tf/idf. For idf, the sum of each term's idf is used. For tf, it's 
the number of phrases in a document.

For more details, see PhraseQuery.c and PhraseMatcher.c in core/Lucy/Search.

Nick

Re: [lucy-user] C library - Phrase Searches

Posted by "serkanmulayim@gmail.com" <se...@gmail.com>.

Hi Nick,

Thank you very much for your response. I think you are right in saying that it should be very similar to an ANDQuery. I looked at the PhraseMatcher and saw that there is an additional for loop (over the query terms) in order to check the positions of the terms are consecutive (to ensure that it is a phrase). I was concerned about the implication of this for loop, but thinking one more time it multiplies the complexity of an ANDQuery with a small value (which is the number of terms in the phrase query). 

Thanks again,
Serkan

On 2017-12-04 07:20, Nick Wellnhofer <we...@aevum.de> wrote: 
> On 28/11/2017 18:55, serkanmulayim@gmail.com wrote:
> > My question is how is such queries being handled in the library. Is it by looking at the consecutive term positions in documents?
> 
> Yes.
> 
> > What is the performance impact for such queries?
> 
> This depends on how you quantify "performance impact", but in general, 
> performance should be similar to an ANDQuery of all terms in the phrase.
> 
> > Secondly how are they being scored? Is it still tf/idf? If so what is the definition of tf and of idf, for these queries?
> 
> It's still tf/idf. For idf, the sum of each term's idf is used. For tf, it's 
> the number of phrases in a document.
> 
> For more details, see PhraseQuery.c and PhraseMatcher.c in core/Lucy/Search.
> 
> Nick
>