You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Saurabh Gokhale <sa...@gmail.com> on 2011/08/03 00:39:07 UTC

Multiple Query clauses impacting result

Hi All,

As I add new clauses to the Boolean Query, my queryNorm value goes down
which is impacting the results.



For example: (The complete stand alone application attached with the email.
I am using Lucene 3.1.0)

I indexed following 6 documents

addDoc("author1", "My first book", "123"); --> 1st column == author name,
2nd = subject, 3rd column = isbn #
addDoc("author2", "My next book", "333");
addDoc("author2", "this first text", "444");
addDoc("author3", "test the knowledge", "456");
addDoc("author4", "knowledge is vertue", "789");
addDoc("author5", "saurabh", "222");

The Boolean Query given below generates following result:

Query = (author:author1) (subject:book subject:first subject:my) -isbn:123
Match: 26.498592%  || Doc Author: author2 || Doc subject: My next book ||
Doc ISBN: 333
Match: 8.280809%  || Doc Author: author2 || Doc subject: this first text ||
Doc ISBN: 444

Now to this boolean Query if I add a new query, in this case a spannear
Query with the search values which does not exists, my result percentage
goes down.

Query = (author:author1) (subject:book subject:first subject:my)
-isbn:123 spanNear([subject:not,
subject:found], 3, true)
Match: 9.584372%  || Doc Author: author2 || Doc subject: My next book || Doc
ISBN: 333
Match: 2.995116%  || Doc Author: author2 || Doc subject: this first text ||
Doc ISBN: 444

Now the problem is, same documents which matched with 26 and 8 percentile in
the first query result, now matched with 9 and 2 percentile. Ideally I do
not expect any change in the result percentage as all my clauses are with
Boolean OR parameter. But due to the queryNorm factor getting updated due to
the addition of new clause, my result is getting impacted. (You can see the
complete code in the attached java file)

Now in a scenario where my job is to find if 100 special words (either
single words or combination of multiple words) are present in the document
or no, my result will go way down because not all documents will have those
words and my queryNorm will be way low due to addition of 99 OR Boolean
clauses.

Is there a way I can get consistent result regardless of the OR clauses I
add to my query? I mean is there a way I can control the queryNorm if this
is what is the root cause?

Thanks

Saurabh

Re: Multiple Query clauses impacting result

Posted by Chris Hostetter <ho...@fucit.org>.
: So in a business scenario where we have to make a decision based on the
: "accepted" matching of a document (say perform activity A only when a
: document matches more than 50%), we wont be able to rely on the match score
: because the score will change based on our query and some times 80% matching
: may not be as close as 5% matching with a slightly different query. (I know
: I am going back to  % again :)
: 
: So how do we handle such a scenario?

you have to redefine your criteria.  "50% match" is meaninless -- you have 
to decide what that means: does it mean matching half of the clauses in a 
boolean query?  what if a doc matches only 1/3 of the clauses, but it 
matches them 100 times each? what if it matches 1/2 the clauses, 100 times 
each, but that only makes up a tiny fraction of the total terms in thta 
document (ie: it's got the entire contents of wikipedia in every field)?  
what if the query isn't a boolean query but a phrase query?

if you have a constrained set of possible queries, and you can define 
precisesly what rules you care about, you can modify your similarity class 
such that regardless of the index to produces scores that you *can* use to 
make inferences about given your rules.

See Also...
	http://www.gossamer-threads.com/lists/lucene/java-user/61075
	http://markmail.org/thread/3svvskbay4hpqyms
	http://markmail.org/message/lztdm4xosmceup5t
And a real oldy but goodie...
	http://markmail.org/message/5eipstcu6lky2h2j


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multiple Query clauses impacting result

Posted by Saurabh Gokhale <sa...@gmail.com>.
Hi Uwe,

Thanks for clarifying and the link given by you does have a satisfactory
explanation.

So in a business scenario where we have to make a decision based on the
"accepted" matching of a document (say perform activity A only when a
document matches more than 50%), we wont be able to rely on the match score
because the score will change based on our query and some times 80% matching
may not be as close as 5% matching with a slightly different query. (I know
I am going back to  % again :)

So how do we handle such a scenario?


Thanks

Saurabh


On Wed, Aug 3, 2011 at 1:34 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi Saurabh,
>
>
>
> There is nothing wrong with Lucene, the problem is generally that you try
> to
> see scores as percentages, which they aren't. Scores are arbitrary values,
> only used for sorting search results, but never to compare results between
> different queries. It's in fact easy possible to also get back values >1.0.
>
> Your examples do the right thing, the sorting is the same in both cases.
> The
> actual score values are *arbitrary*!
>
>
>
> See  <http://wiki.apache.org/lucene-java/ScoresAsPercentages>
> http://wiki.apache.org/lucene-java/ScoresAsPercentages for explanation.
>
>
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de <http://www.thetaphi.de/>
>
> eMail: uwe@thetaphi.de
>
>
>
> From: Saurabh Gokhale [mailto:saurabhgokhale@gmail.com]
> Sent: Wednesday, August 03, 2011 12:39 AM
> To: java-user@lucene.apache.org
> Subject: Multiple Query clauses impacting result
>
>
>
> Hi All,
>
>
>
> As I add new clauses to the Boolean Query, my queryNorm value goes down
> which is impacting the results.
>
>
>
>
>
>
>
> For example: (The complete stand alone application attached with the email.
> I am using Lucene 3.1.0)
>
>
>
> I indexed following 6 documents
>
>
>
> addDoc("author1", "My first book", "123"); --> 1st column == author name,
> 2nd = subject, 3rd column = isbn #
>
> addDoc("author2", "My next book", "333");
>
> addDoc("author2", "this first text", "444");
>
> addDoc("author3", "test the knowledge", "456");
>
> addDoc("author4", "knowledge is vertue", "789");
>
> addDoc("author5", "saurabh", "222");
>
>
>
> The Boolean Query given below generates following result:
>
>
>
> Query = (author:author1) (subject:book subject:first subject:my) -isbn:123
>
> Match: 26.498592%  || Doc Author: author2 || Doc subject: My next book ||
> Doc ISBN: 333
>
> Match: 8.280809%  || Doc Author: author2 || Doc subject: this first text ||
> Doc ISBN: 444
>
>
>
> Now to this boolean Query if I add a new query, in this case a spannear
> Query with the search values which does not exists, my result percentage
> goes down.
>
>
>
> Query = (author:author1) (subject:book subject:first subject:my) -isbn:123
> spanNear([subject:not, subject:found], 3, true)
>
> Match: 9.584372%  || Doc Author: author2 || Doc subject: My next book ||
> Doc
> ISBN: 333
>
> Match: 2.995116%  || Doc Author: author2 || Doc subject: this first text ||
> Doc ISBN: 444
>
>
>
> Now the problem is, same documents which matched with 26 and 8 percentile
> in
> the first query result, now matched with 9 and 2 percentile. Ideally I do
> not expect any change in the result percentage as all my clauses are with
> Boolean OR parameter. But due to the queryNorm factor getting updated due
> to
> the addition of new clause, my result is getting impacted. (You can see the
> complete code in the attached java file)
>
>
>
> Now in a scenario where my job is to find if 100 special words (either
> single words or combination of multiple words) are present in the document
> or no, my result will go way down because not all documents will have those
> words and my queryNorm will be way low due to addition of 99 OR Boolean
> clauses.
>
>
>
> Is there a way I can get consistent result regardless of the OR clauses I
> add to my query? I mean is there a way I can control the queryNorm if this
> is what is the root cause?
>
>
>
> Thanks
>
>
>
> Saurabh
>
>

RE: Multiple Query clauses impacting result

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi Saurabh,

 

There is nothing wrong with Lucene, the problem is generally that you try to
see scores as percentages, which they aren't. Scores are arbitrary values,
only used for sorting search results, but never to compare results between
different queries. It's in fact easy possible to also get back values >1.0.

Your examples do the right thing, the sorting is the same in both cases. The
actual score values are *arbitrary*!

 

See  <http://wiki.apache.org/lucene-java/ScoresAsPercentages>
http://wiki.apache.org/lucene-java/ScoresAsPercentages for explanation.

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: Saurabh Gokhale [mailto:saurabhgokhale@gmail.com] 
Sent: Wednesday, August 03, 2011 12:39 AM
To: java-user@lucene.apache.org
Subject: Multiple Query clauses impacting result

 

Hi All,

 

As I add new clauses to the Boolean Query, my queryNorm value goes down
which is impacting the results.

 

 

 

For example: (The complete stand alone application attached with the email.
I am using Lucene 3.1.0)

 

I indexed following 6 documents 

 

addDoc("author1", "My first book", "123"); --> 1st column == author name,
2nd = subject, 3rd column = isbn #

addDoc("author2", "My next book", "333");

addDoc("author2", "this first text", "444");

addDoc("author3", "test the knowledge", "456");

addDoc("author4", "knowledge is vertue", "789");

addDoc("author5", "saurabh", "222");

 

The Boolean Query given below generates following result:

 

Query = (author:author1) (subject:book subject:first subject:my) -isbn:123

Match: 26.498592%  || Doc Author: author2 || Doc subject: My next book ||
Doc ISBN: 333

Match: 8.280809%  || Doc Author: author2 || Doc subject: this first text ||
Doc ISBN: 444

 

Now to this boolean Query if I add a new query, in this case a spannear
Query with the search values which does not exists, my result percentage
goes down.

 

Query = (author:author1) (subject:book subject:first subject:my) -isbn:123
spanNear([subject:not, subject:found], 3, true)

Match: 9.584372%  || Doc Author: author2 || Doc subject: My next book || Doc
ISBN: 333

Match: 2.995116%  || Doc Author: author2 || Doc subject: this first text ||
Doc ISBN: 444

 

Now the problem is, same documents which matched with 26 and 8 percentile in
the first query result, now matched with 9 and 2 percentile. Ideally I do
not expect any change in the result percentage as all my clauses are with
Boolean OR parameter. But due to the queryNorm factor getting updated due to
the addition of new clause, my result is getting impacted. (You can see the
complete code in the attached java file)

 

Now in a scenario where my job is to find if 100 special words (either
single words or combination of multiple words) are present in the document
or no, my result will go way down because not all documents will have those
words and my queryNorm will be way low due to addition of 99 OR Boolean
clauses.

 

Is there a way I can get consistent result regardless of the OR clauses I
add to my query? I mean is there a way I can control the queryNorm if this
is what is the root cause?

 

Thanks

 

Saurabh