You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Saurabh Gokhale <sa...@gmail.com> on 2011/06/19 07:06:12 UTC

Index result percentile variation based on Index time Norm and No-Norm

Hi All,

I have a question regarding index time parameter Index.ANALYZED_NO_NORMS
and Index.ANALYZED

As per the definition, ANALYZED_NO_NORMS is no different than ANALYZED
option except that it does not store norm (boost) information in the index.

So if I create 2 indexes: 1 with ANALYZED option and another
with ANALYZED_NO_NORMS option, for the same indexed document and for the
same search query both should bring same result with same matching
percentile.

But the percentile for the index created using NO_NORM is higher than the
one created with NORM. Why is that so?

Attached is the java example with 2 indexes created and the same document
searched using BooleanQuery.

The result indicates that though the matched documents are same, their
percentiles differ? can some one pls help me understand norm better? is it
because in the index with norm, lucene adds some boost for index fields?



Result of running the program is:
--------------------------------------------------------------------------------------------
Adding Doc to NORM Index :: author : author1
---------------------------------------
Adding Doc to NO NORM Index :: author : author1
=======================================
Adding Doc to NORM Index :: author : author2
---------------------------------------
Adding Doc to NO NORM Index :: author : author2
=======================================
Adding Doc to NORM Index :: author : author3
---------------------------------------
Adding Doc to NO NORM Index :: author : author3
=======================================
Adding Doc to NORM Index :: author : author4
---------------------------------------
Adding Doc to NO NORM Index :: author : author4
=======================================

Search for the book with Author: author1
NORM:: WITH NORM  Query :: (author:author1) (subject:book subject:first
subject:my) -isbn:123
 Match: 19.738173%  || Doc Author: author2 || Doc subject: My next book ||
Doc ISBN: 333
 Match: 6.168179%  || Doc Author: author3 || Doc subject: this first text ||
Doc ISBN: 444

Search for the book with Author: author1
NORM:: WITHOUT NORM  Query :: (author:author1) (subject:book subject:first
subject:my) -isbn:123
 Match: 39.476345%  || Doc Author: author2 || Doc subject: My next book ||
Doc ISBN: 333
 Match: 9.869086%  || Doc Author: author3 || Doc subject: this first text ||
Doc ISBN: 444



Thanks

Saurabh

Re: Index result percentile variation based on Index time Norm and No-Norm

Posted by Simon Willnauer <si...@googlemail.com>.

On Sun, Jun 19, 2011 at 5:39 PM, Saurabh Gokhale
<sa...@gmail.com> wrote:
> Yes, it makes sense. So in the case of No_Norm I suppose, all the fields
> small or large, are treated the same instead of as per their sizes and
> implicit boost factor.

well yes they will use TF / IDF for scoring additional length
normalization and boost are omitted

simomn
>
> Thanks for the explanation
>
> Saurabh
>
> On Sun, Jun 19, 2011 at 9:26 AM, Simon Willnauer <
> simon.willnauer@googlemail.com> wrote:
>
>> if you use norms lucene uses the boost of the fields / document and
>> multiplies it with a length normalization factor -> 1.0 /
>> Math.sqrt(numTerms) so you scores should be different. Does that
>> explain what you are seeing?
>>
>> Simon
>>
>> On Sun, Jun 19, 2011 at 7:06 AM, Saurabh Gokhale
>> <sa...@gmail.com> wrote:
>> > Hi All,
>> > I have a question regarding index time parameter Index.ANALYZED_NO_NORMS
>> > and Index.ANALYZED
>> > As per the definition, ANALYZED_NO_NORMS is no different than ANALYZED
>> > option except that it does not store norm (boost) information in the
>> index.
>> > So if I create 2 indexes: 1 with ANALYZED option and another
>> > with ANALYZED_NO_NORMS option, for the same indexed document and for the
>> > same search query both should bring same result with same matching
>> > percentile.
>> > But the percentile for the index created using NO_NORM is higher than the
>> > one created with NORM. Why is that so?
>> > Attached is the java example with 2 indexes created and the same document
>> > searched using BooleanQuery.
>> > The result indicates that though the matched documents are same, their
>> > percentiles differ? can some one pls help me understand norm better? is
>> it
>> > because in the index with norm, lucene adds some boost for index fields?
>> >
>> >
>> > Result of running the program is:
>> >
>> --------------------------------------------------------------------------------------------
>> > Adding Doc to NORM Index :: author : author1
>> > ---------------------------------------
>> > Adding Doc to NO NORM Index :: author : author1
>> > =======================================
>> > Adding Doc to NORM Index :: author : author2
>> > ---------------------------------------
>> > Adding Doc to NO NORM Index :: author : author2
>> > =======================================
>> > Adding Doc to NORM Index :: author : author3
>> > ---------------------------------------
>> > Adding Doc to NO NORM Index :: author : author3
>> > =======================================
>> > Adding Doc to NORM Index :: author : author4
>> > ---------------------------------------
>> > Adding Doc to NO NORM Index :: author : author4
>> > =======================================
>> > Search for the book with Author: author1
>> > NORM:: WITH NORM  Query :: (author:author1) (subject:book subject:first
>> > subject:my) -isbn:123
>> >  Match: 19.738173%  || Doc Author: author2 || Doc subject: My next book
>> ||
>> > Doc ISBN: 333
>> >  Match: 6.168179%  || Doc Author: author3 || Doc subject: this first text
>> ||
>> > Doc ISBN: 444
>> > Search for the book with Author: author1
>> > NORM:: WITHOUT NORM  Query :: (author:author1) (subject:book
>> subject:first
>> > subject:my) -isbn:123
>> >  Match: 39.476345%  || Doc Author: author2 || Doc subject: My next book
>> ||
>> > Doc ISBN: 333
>> >  Match: 9.869086%  || Doc Author: author3 || Doc subject: this first text
>> ||
>> > Doc ISBN: 444
>> >
>> >
>> > Thanks
>> > Saurabh
>> >
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Index result percentile variation based on Index time Norm and No-Norm

Posted by Saurabh Gokhale <sa...@gmail.com>.

Yes, it makes sense. So in the case of No_Norm I suppose, all the fields
small or large, are treated the same instead of as per their sizes and
implicit boost factor.

Thanks for the explanation

Saurabh

On Sun, Jun 19, 2011 at 9:26 AM, Simon Willnauer <
simon.willnauer@googlemail.com> wrote:

> if you use norms lucene uses the boost of the fields / document and
> multiplies it with a length normalization factor -> 1.0 /
> Math.sqrt(numTerms) so you scores should be different. Does that
> explain what you are seeing?
>
> Simon
>
> On Sun, Jun 19, 2011 at 7:06 AM, Saurabh Gokhale
> <sa...@gmail.com> wrote:
> > Hi All,
> > I have a question regarding index time parameter Index.ANALYZED_NO_NORMS
> > and Index.ANALYZED
> > As per the definition, ANALYZED_NO_NORMS is no different than ANALYZED
> > option except that it does not store norm (boost) information in the
> index.
> > So if I create 2 indexes: 1 with ANALYZED option and another
> > with ANALYZED_NO_NORMS option, for the same indexed document and for the
> > same search query both should bring same result with same matching
> > percentile.
> > But the percentile for the index created using NO_NORM is higher than the
> > one created with NORM. Why is that so?
> > Attached is the java example with 2 indexes created and the same document
> > searched using BooleanQuery.
> > The result indicates that though the matched documents are same, their
> > percentiles differ? can some one pls help me understand norm better? is
> it
> > because in the index with norm, lucene adds some boost for index fields?
> >
> >
> > Result of running the program is:
> >
> --------------------------------------------------------------------------------------------
> > Adding Doc to NORM Index :: author : author1
> > ---------------------------------------
> > Adding Doc to NO NORM Index :: author : author1
> > =======================================
> > Adding Doc to NORM Index :: author : author2
> > ---------------------------------------
> > Adding Doc to NO NORM Index :: author : author2
> > =======================================
> > Adding Doc to NORM Index :: author : author3
> > ---------------------------------------
> > Adding Doc to NO NORM Index :: author : author3
> > =======================================
> > Adding Doc to NORM Index :: author : author4
> > ---------------------------------------
> > Adding Doc to NO NORM Index :: author : author4
> > =======================================
> > Search for the book with Author: author1
> > NORM:: WITH NORM  Query :: (author:author1) (subject:book subject:first
> > subject:my) -isbn:123
> >  Match: 19.738173%  || Doc Author: author2 || Doc subject: My next book
> ||
> > Doc ISBN: 333
> >  Match: 6.168179%  || Doc Author: author3 || Doc subject: this first text
> ||
> > Doc ISBN: 444
> > Search for the book with Author: author1
> > NORM:: WITHOUT NORM  Query :: (author:author1) (subject:book
> subject:first
> > subject:my) -isbn:123
> >  Match: 39.476345%  || Doc Author: author2 || Doc subject: My next book
> ||
> > Doc ISBN: 333
> >  Match: 9.869086%  || Doc Author: author3 || Doc subject: this first text
> ||
> > Doc ISBN: 444
> >
> >
> > Thanks
> > Saurabh
> >
>

Re: Index result percentile variation based on Index time Norm and No-Norm

Posted by Simon Willnauer <si...@googlemail.com>.

if you use norms lucene uses the boost of the fields / document and
multiplies it with a length normalization factor -> 1.0 /
Math.sqrt(numTerms) so you scores should be different. Does that
explain what you are seeing?

Simon

On Sun, Jun 19, 2011 at 7:06 AM, Saurabh Gokhale
<sa...@gmail.com> wrote:
> Hi All,
> I have a question regarding index time parameter Index.ANALYZED_NO_NORMS
> and Index.ANALYZED
> As per the definition, ANALYZED_NO_NORMS is no different than ANALYZED
> option except that it does not store norm (boost) information in the index.
> So if I create 2 indexes: 1 with ANALYZED option and another
> with ANALYZED_NO_NORMS option, for the same indexed document and for the
> same search query both should bring same result with same matching
> percentile.
> But the percentile for the index created using NO_NORM is higher than the
> one created with NORM. Why is that so?
> Attached is the java example with 2 indexes created and the same document
> searched using BooleanQuery.
> The result indicates that though the matched documents are same, their
> percentiles differ? can some one pls help me understand norm better? is it
> because in the index with norm, lucene adds some boost for index fields?
>
>
> Result of running the program is:
> --------------------------------------------------------------------------------------------
> Adding Doc to NORM Index :: author : author1
> ---------------------------------------
> Adding Doc to NO NORM Index :: author : author1
> =======================================
> Adding Doc to NORM Index :: author : author2
> ---------------------------------------
> Adding Doc to NO NORM Index :: author : author2
> =======================================
> Adding Doc to NORM Index :: author : author3
> ---------------------------------------
> Adding Doc to NO NORM Index :: author : author3
> =======================================
> Adding Doc to NORM Index :: author : author4
> ---------------------------------------
> Adding Doc to NO NORM Index :: author : author4
> =======================================
> Search for the book with Author: author1
> NORM:: WITH NORM  Query :: (author:author1) (subject:book subject:first
> subject:my) -isbn:123
>  Match: 19.738173%  || Doc Author: author2 || Doc subject: My next book ||
> Doc ISBN: 333
>  Match: 6.168179%  || Doc Author: author3 || Doc subject: this first text ||
> Doc ISBN: 444
> Search for the book with Author: author1
> NORM:: WITHOUT NORM  Query :: (author:author1) (subject:book subject:first
> subject:my) -isbn:123
>  Match: 39.476345%  || Doc Author: author2 || Doc subject: My next book ||
> Doc ISBN: 333
>  Match: 9.869086%  || Doc Author: author3 || Doc subject: this first text ||
> Doc ISBN: 444
>
>
> Thanks
> Saurabh
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org