You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by "serkanmulayim@gmail.com" <se...@gmail.com> on 2017/11/21 01:09:02 UTC

[lucy-user] C library - Scoring mechanism

Hi guys,

I have a question regarding the scoring mechanism for relevancy. Is the scoring mechanism tf/idf when the field indexed with the EasyAnalyzer in the schema? What happens when multiple terms are used? Are tf/idf's summed? How does the incorporate the location of the words to the scoring mechanism for queries with multiple words?

How about the fields which has RegexTokenizer? Is it still the same mechanism? Does the type of the tokenizer affect the scoring?  I believe the important thing is the generated tokens (and not related to the tokenizer), and maybe the order of the tokens in a document.

One more thing, if I were to change the scoring mechanism for different fields, how can I do it? Are there any predefined mechanisms eg. tf/idf doc2vec etc. Or if I want to go further and come up with my own how can I do it?

Thanks,
Serkan



Re: [lucy-user] C library - Scoring mechanism

Posted by "serkanmulayim@gmail.com" <se...@gmail.com>.
Thank you very much Nick and Marvin. Your replies were really helpful.

On 2017-11-23 11:38, Marvin Humphrey <ma...@rectangular.com> wrote: 
> On Wed, Nov 22, 2017 at 5:28 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> > On 21/11/2017 18:42, serkanmulayim@gmail.com wrote:
> 
> >> 2- (same question but for multiple indexes and polysearcher) If I use
> >> polysearcher with 2 or more indexes, will the tf/idf scores be consistent?
> >> Or would they be calculated separately for each index?
> >
> > I don't know off top of my head. It's possible that indexes are searched
> > separately and the results are simply merged by normalized score. I'd have
> > to look at the code to answer the question, but maybe Marvin can chime in.
> 
> The scores will be consistent.
> 
> To calculate IDF for a term accurately across a composite corpus
> formed from multiple indexes, you need to know two things:
> 
> 1. The total number of documents in the corpus. (Doc_Max())
> 2. The total number of documents which contain the term. (Doc_Freq(field, term))
> 
> Both PolySearcher and ClusterSearcher calculate their doc_max on
> construction by summing the doc_max totals of all subsearchers.
> Similarly, both calculate Doc_Freq for a term by summing Doc_Freq
> responses for all subsearchers.
> 
> https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L69
> https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L119
> https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L73
> https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L348
> 
> This approach trades away some performance for the sake of accuracy,
> particularly with Doc_Freq -- query normalization takes longer when
> you have to wait for a lot of subsearchers to report Doc_Freq numbers
> for N terms. However, the alternative is occasional bizarre search
> results.
> 
> The best anecdote I ever heard illustrating why it's important to
> calculate aggregate IDF consistently was an application searching a
> multi-shard index containing news articles split by year.  If you
> searched for "iphone", it would be a very common term after the first
> release of the Apple iPhone. However, in the years prior to the Apple
> iPhone's release, if "iphone" existed in a shard it was likely a typo,
> so it would be very rare **and thus heavily weighted**. So the top hit
> for "iphone", without consistent IDF calculation, would be a typo'd
> article.
> 
> (A performance improvement on this stratagem is to create a shared
> Doc_Freq source. So long as it contains all the common terms across
> all shards, it doesn't have to be updated often -- Doc_Freq values
> don't change very fast as indexes are updated.)
> 
> Marvin Humphrey
> 

Re: [lucy-user] C library - Scoring mechanism

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Nov 22, 2017 at 5:28 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> On 21/11/2017 18:42, serkanmulayim@gmail.com wrote:

>> 2- (same question but for multiple indexes and polysearcher) If I use
>> polysearcher with 2 or more indexes, will the tf/idf scores be consistent?
>> Or would they be calculated separately for each index?
>
> I don't know off top of my head. It's possible that indexes are searched
> separately and the results are simply merged by normalized score. I'd have
> to look at the code to answer the question, but maybe Marvin can chime in.

The scores will be consistent.

To calculate IDF for a term accurately across a composite corpus
formed from multiple indexes, you need to know two things:

1. The total number of documents in the corpus. (Doc_Max())
2. The total number of documents which contain the term. (Doc_Freq(field, term))

Both PolySearcher and ClusterSearcher calculate their doc_max on
construction by summing the doc_max totals of all subsearchers.
Similarly, both calculate Doc_Freq for a term by summing Doc_Freq
responses for all subsearchers.

https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L69
https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L119
https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L73
https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L348

This approach trades away some performance for the sake of accuracy,
particularly with Doc_Freq -- query normalization takes longer when
you have to wait for a lot of subsearchers to report Doc_Freq numbers
for N terms. However, the alternative is occasional bizarre search
results.

The best anecdote I ever heard illustrating why it's important to
calculate aggregate IDF consistently was an application searching a
multi-shard index containing news articles split by year.  If you
searched for "iphone", it would be a very common term after the first
release of the Apple iPhone. However, in the years prior to the Apple
iPhone's release, if "iphone" existed in a shard it was likely a typo,
so it would be very rare **and thus heavily weighted**. So the top hit
for "iphone", without consistent IDF calculation, would be a typo'd
article.

(A performance improvement on this stratagem is to create a shared
Doc_Freq source. So long as it contains all the common terms across
all shards, it doesn't have to be updated often -- Doc_Freq values
don't change very fast as indexes are updated.)

Marvin Humphrey

Re: [lucy-user] C library - Scoring mechanism

Posted by Nick Wellnhofer <we...@aevum.de>.
On 21/11/2017 18:42, serkanmulayim@gmail.com wrote:
> 1- Are the tf/idf scores consistent accross the all segments in a non-optimized index? Or is it being calculated separately for each segment (tf would not change but idf might be different)?

tf/idf is computed for the whole index.

> 2- (same question but for multiple indexes and polysearcher) If I use polysearcher with 2 or more indexes, will the tf/idf scores be consistent? Or would they be calculated separately for each index?

I don't know off top of my head. It's possible that indexes are searched 
separately and the results are simply merged by normalized score. I'd have to 
look at the code to answer the question, but maybe Marvin can chime in.

Nick

Re: [lucy-user] C library - Scoring mechanism

Posted by "serkanmulayim@gmail.com" <se...@gmail.com>.
Thank you very much Nick for your response.

I would like to ask two more questions:
1- Are the tf/idf scores consistent accross the all segments in a non-optimized index? Or is it being calculated separately for each segment (tf would not change but idf might be different)?
2- (same question but for multiple indexes and polysearcher) If I use polysearcher with 2 or more indexes, will the tf/idf scores be consistent? Or would they be calculated separately for each index?

Regards,
Serkan

On 2017-11-21 01:49, Nick Wellnhofer <we...@aevum.de> wrote: 
> 
> On Nov 21, 2017, at 02:09 , serkanmulayim@gmail.com wrote:
> > I have a question regarding the scoring mechanism for relevancy. Is the scoring mechanism tf/idf when the field indexed with the EasyAnalyzer in the schema? What happens when multiple terms are used? Are tf/idf's summed?
> 
> Lucy uses Lucene's Practical Scoring Function by default:
> 
> https://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html
> 
> Essentially, tf/idf values are summed after being multiplied with each term's boost and normalization factor.
> 
> > How does the incorporate the location of the words to the scoring mechanism for queries with multiple words?
> 
> > How about the fields which has RegexTokenizer? Is it still the same mechanism? Does the type of the tokenizer affect the scoring?  I believe the important thing is the generated tokens (and not related to the tokenizer), and maybe the order of the tokens in a document.
> 
> If you use the core Tokenizers, the type of Tokenizer or the location of terms in a document don’t affect scoring. But you can write a custom Tokenizer that sets different boost values for each Token, for example depending on the location within the document.
> 
> > One more thing, if I were to change the scoring mechanism for different fields, how can I do it? Are there any predefined mechanisms eg. tf/idf doc2vec etc. Or if I want to go further and come up with my own how can I do it?
> 
> You can tweak the scoring formula by supplying your own Similarity subclass for each FieldType, possibly in conjunction with your own Query/Compiler/Matcher subclasses:
> 
> https://lucy.apache.org/docs/c/Lucy/Index/Similarity.html
> 
> The public documentation for Similarity is incomplete, unfortunately. But the class is similar to Lucene’s. The .cfh file contains more details:
> 
> https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/Similarity.cfh;h=15ec409dee06b19af1b855db50b4fef229dd314e;hb=HEAD
> 
> You’d typically override methods TF, IDF, Coord, Length_Norm, or Query_Norm.
> 
> Nick
> 
> 

Re: [lucy-user] C library - Scoring mechanism

Posted by Nick Wellnhofer <we...@aevum.de>.
On Nov 21, 2017, at 02:09 , serkanmulayim@gmail.com wrote:
> I have a question regarding the scoring mechanism for relevancy. Is the scoring mechanism tf/idf when the field indexed with the EasyAnalyzer in the schema? What happens when multiple terms are used? Are tf/idf's summed?

Lucy uses Lucene's Practical Scoring Function by default:

https://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html

Essentially, tf/idf values are summed after being multiplied with each term's boost and normalization factor.

> How does the incorporate the location of the words to the scoring mechanism for queries with multiple words?

> How about the fields which has RegexTokenizer? Is it still the same mechanism? Does the type of the tokenizer affect the scoring?  I believe the important thing is the generated tokens (and not related to the tokenizer), and maybe the order of the tokens in a document.

If you use the core Tokenizers, the type of Tokenizer or the location of terms in a document don’t affect scoring. But you can write a custom Tokenizer that sets different boost values for each Token, for example depending on the location within the document.

> One more thing, if I were to change the scoring mechanism for different fields, how can I do it? Are there any predefined mechanisms eg. tf/idf doc2vec etc. Or if I want to go further and come up with my own how can I do it?

You can tweak the scoring formula by supplying your own Similarity subclass for each FieldType, possibly in conjunction with your own Query/Compiler/Matcher subclasses:

https://lucy.apache.org/docs/c/Lucy/Index/Similarity.html

The public documentation for Similarity is incomplete, unfortunately. But the class is similar to Lucene’s. The .cfh file contains more details:

https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/Similarity.cfh;h=15ec409dee06b19af1b855db50b4fef229dd314e;hb=HEAD

You’d typically override methods TF, IDF, Coord, Length_Norm, or Query_Norm.

Nick