You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Luc Vanlerberghe (JIRA)" <ji...@apache.org> on 2005/11/21 18:37:41 UTC

[jira] Created: (LUCENE-469) (Parallel-)MultiSearcher: using Sort object changes the scores

(Parallel-)MultiSearcher: using Sort object changes the scores
--------------------------------------------------------------

         Key: LUCENE-469
         URL: http://issues.apache.org/jira/browse/LUCENE-469
     Project: Lucene - Java
        Type: Bug
  Components: Search  
    Versions: CVS Nightly - Specify date in submission    
 Environment: 21 november 2005, revision 345901
    Reporter: Luc Vanlerberghe


Example: 
Hits hits=multiSearcher.search(query);
returns different scores for some documents than
Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
(both for MultiSearcher and ParallelMultiSearcher)

The documents returned will be the same and in the same order, but the scores in the second case will seem out of order.

Inspecting the Explanation objects shows that the scores themselves are ok, but there's a bug in the normalization of the scores.

The document with the highest score should have score 1.0, so all document scores are divided by the highest score.  (Assuming the highest score was>1.0)

However, for MultiSearcher and ParallelMultiSearcher, this normalization factor is applied *per index*, before merging the results together (the merge itself is ok though).

An example: if you use
Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
for a MultiSearcher with two subsearchers, the first document will have score 1.0.
The next documents from the same subsearcher will have decreasing scores.
The first document from the other subsearcher will however have score 1.0 again !

The same applies for other Sort objects, but it is less visible.

I will post a TestCase demonstrating the problem and suggested patches to solve it in a moment...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-469) (Parallel-)MultiSearcher: using Sort object changes the scores

Posted by "Luc Vanlerberghe (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/LUCENE-469?page=all ]

Luc Vanlerberghe updated LUCENE-469:
------------------------------------

    Attachment: TestMultiSearcher.patch

This adds a test to TestMultiSearcher (and ParallelMultiSearcher since TestParallelMultiSearcher 
runs this code too) demonstrating the problem.

Two document sets are created, both with ten documents, and a query that matches exactly one of each.
Since the documents in the second set have more terms, the scores for those document should be lower.

Putting all documents in one index demonstrates this, and the scores from that are used to check the ones
obtained by MultiSearcher when the document sets are put in two different indexes.

Using searcher.search(query), the results are ok,
using searcher.search(query, Sort.RELEVANCE), they are not (both scores are 1.0)


> (Parallel-)MultiSearcher: using Sort object changes the scores
> --------------------------------------------------------------
>
>          Key: LUCENE-469
>          URL: http://issues.apache.org/jira/browse/LUCENE-469
>      Project: Lucene - Java
>         Type: Bug
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>  Environment: 21 november 2005, revision 345901
>     Reporter: Luc Vanlerberghe
>  Attachments: TestMultiSearcher.patch
>
> Example: 
> Hits hits=multiSearcher.search(query);
> returns different scores for some documents than
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> (both for MultiSearcher and ParallelMultiSearcher)
> The documents returned will be the same and in the same order, but the scores in the second case will seem out of order.
> Inspecting the Explanation objects shows that the scores themselves are ok, but there's a bug in the normalization of the scores.
> The document with the highest score should have score 1.0, so all document scores are divided by the highest score.  (Assuming the highest score was>1.0)
> However, for MultiSearcher and ParallelMultiSearcher, this normalization factor is applied *per index*, before merging the results together (the merge itself is ok though).
> An example: if you use
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> for a MultiSearcher with two subsearchers, the first document will have score 1.0.
> The next documents from the same subsearcher will have decreasing scores.
> The first document from the other subsearcher will however have score 1.0 again !
> The same applies for other Sort objects, but it is less visible.
> I will post a TestCase demonstrating the problem and suggested patches to solve it in a moment...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-469) (Parallel-)MultiSearcher: using Sort object changes the scores

Posted by "Luc Vanlerberghe (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/LUCENE-469?page=all ]

Luc Vanlerberghe updated LUCENE-469:
------------------------------------

    Attachment: MultiSearcherSort.patch

Here's the updated patch (at last....)

- the redundant changes to FieldDocSortedHitQueue are reverted.
- maxScore is private with public accessors.
- TopFieldDocs search (Weight weight, Filter filter, int n, Sort sort) is corrected in MultiSearcher and ParallelMultiSearcher.
- I didn't add compatible but deprecated constructors for TopDocs and TopFieldDocs.  Those constructors where package private, so they couldn't be used by derived classes anyway.  So I think risk of breaking backwards compatibility for existing applications by changing these classes is fairly low.


> (Parallel-)MultiSearcher: using Sort object changes the scores
> --------------------------------------------------------------
>
>          Key: LUCENE-469
>          URL: http://issues.apache.org/jira/browse/LUCENE-469
>      Project: Lucene - Java
>         Type: Bug
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>  Environment: 21 november 2005, revision 345901
>     Reporter: Luc Vanlerberghe
>  Attachments: MultiSearcherSort.patch, MultiSearcherSort.patch, TestMultiSearcher.patch
>
> Example: 
> Hits hits=multiSearcher.search(query);
> returns different scores for some documents than
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> (both for MultiSearcher and ParallelMultiSearcher)
> The documents returned will be the same and in the same order, but the scores in the second case will seem out of order.
> Inspecting the Explanation objects shows that the scores themselves are ok, but there's a bug in the normalization of the scores.
> The document with the highest score should have score 1.0, so all document scores are divided by the highest score.  (Assuming the highest score was>1.0)
> However, for MultiSearcher and ParallelMultiSearcher, this normalization factor is applied *per index*, before merging the results together (the merge itself is ok though).
> An example: if you use
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> for a MultiSearcher with two subsearchers, the first document will have score 1.0.
> The next documents from the same subsearcher will have decreasing scores.
> The first document from the other subsearcher will however have score 1.0 again !
> The same applies for other Sort objects, but it is less visible.
> I will post a TestCase demonstrating the problem and suggested patches to solve it in a moment...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-469) (Parallel-)MultiSearcher: using Sort object changes the scores

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/LUCENE-469?page=comments#action_12358189 ] 

Doug Cutting commented on LUCENE-469:
-------------------------------------

TopDocs.maxScore should be private with public accessors.

I also agree that score normalization should only appear in scores from Hits.java.  (FYI, the score normalization "feature" was added in the early days of Lucene when lots of folks seemed to find scores greater than 1.0 disturbing.)


> (Parallel-)MultiSearcher: using Sort object changes the scores
> --------------------------------------------------------------
>
>          Key: LUCENE-469
>          URL: http://issues.apache.org/jira/browse/LUCENE-469
>      Project: Lucene - Java
>         Type: Bug
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>  Environment: 21 november 2005, revision 345901
>     Reporter: Luc Vanlerberghe
>  Attachments: MultiSearcherSort.patch, TestMultiSearcher.patch
>
> Example: 
> Hits hits=multiSearcher.search(query);
> returns different scores for some documents than
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> (both for MultiSearcher and ParallelMultiSearcher)
> The documents returned will be the same and in the same order, but the scores in the second case will seem out of order.
> Inspecting the Explanation objects shows that the scores themselves are ok, but there's a bug in the normalization of the scores.
> The document with the highest score should have score 1.0, so all document scores are divided by the highest score.  (Assuming the highest score was>1.0)
> However, for MultiSearcher and ParallelMultiSearcher, this normalization factor is applied *per index*, before merging the results together (the merge itself is ok though).
> An example: if you use
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> for a MultiSearcher with two subsearchers, the first document will have score 1.0.
> The next documents from the same subsearcher will have decreasing scores.
> The first document from the other subsearcher will however have score 1.0 again !
> The same applies for other Sort objects, but it is less visible.
> I will post a TestCase demonstrating the problem and suggested patches to solve it in a moment...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-469) (Parallel-)MultiSearcher: using Sort object changes the scores

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/LUCENE-469?page=comments#action_12358183 ] 

Yonik Seeley commented on LUCENE-469:
-------------------------------------

Thanks Luc!
a couple of things:

- changing expert level search functions to not normalize.  +1 from me, but I'd like to hear from some others on a change like this.

- TopDocs and TopFieldDocs are public... this patch changes the constructors.  Although this is a great way to test if we got all the cases within Lucene, if anyone created their own instances outside lucene, it would break backward compatibility.  This is beyond expert level though... so perhaps it shouldn't worry us.

- I'm not sure if the MultiSearcher implementation is correct for other Sorts.
  FieldDocSortedHitQueue.getMaxScore() is only the max score of docs inserted... I think you need to reference  docs.maxScore() rather than relying on the  FieldDocSortedHitQueue in this case, right?


> (Parallel-)MultiSearcher: using Sort object changes the scores
> --------------------------------------------------------------
>
>          Key: LUCENE-469
>          URL: http://issues.apache.org/jira/browse/LUCENE-469
>      Project: Lucene - Java
>         Type: Bug
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>  Environment: 21 november 2005, revision 345901
>     Reporter: Luc Vanlerberghe
>  Attachments: MultiSearcherSort.patch, TestMultiSearcher.patch
>
> Example: 
> Hits hits=multiSearcher.search(query);
> returns different scores for some documents than
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> (both for MultiSearcher and ParallelMultiSearcher)
> The documents returned will be the same and in the same order, but the scores in the second case will seem out of order.
> Inspecting the Explanation objects shows that the scores themselves are ok, but there's a bug in the normalization of the scores.
> The document with the highest score should have score 1.0, so all document scores are divided by the highest score.  (Assuming the highest score was>1.0)
> However, for MultiSearcher and ParallelMultiSearcher, this normalization factor is applied *per index*, before merging the results together (the merge itself is ok though).
> An example: if you use
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> for a MultiSearcher with two subsearchers, the first document will have score 1.0.
> The next documents from the same subsearcher will have decreasing scores.
> The first document from the other subsearcher will however have score 1.0 again !
> The same applies for other Sort objects, but it is less visible.
> I will post a TestCase demonstrating the problem and suggested patches to solve it in a moment...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-469) (Parallel-)MultiSearcher: using Sort object changes the scores

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/LUCENE-469?page=all ]
     
Yonik Seeley resolved LUCENE-469:
---------------------------------

    Fix Version: 1.9
     Resolution: Fixed
      Assign To: Yonik Seeley

Fixed a small bug in the test (passing a null analyzer which causes a NPE with the new getPositionIncrement code), and committed.

Thanks Luke!

> (Parallel-)MultiSearcher: using Sort object changes the scores
> --------------------------------------------------------------
>
>          Key: LUCENE-469
>          URL: http://issues.apache.org/jira/browse/LUCENE-469
>      Project: Lucene - Java
>         Type: Bug
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>  Environment: 21 november 2005, revision 345901
>     Reporter: Luc Vanlerberghe
>     Assignee: Yonik Seeley
>      Fix For: 1.9
>  Attachments: MultiSearcherSort.patch, MultiSearcherSort.patch, TestMultiSearcher.patch
>
> Example: 
> Hits hits=multiSearcher.search(query);
> returns different scores for some documents than
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> (both for MultiSearcher and ParallelMultiSearcher)
> The documents returned will be the same and in the same order, but the scores in the second case will seem out of order.
> Inspecting the Explanation objects shows that the scores themselves are ok, but there's a bug in the normalization of the scores.
> The document with the highest score should have score 1.0, so all document scores are divided by the highest score.  (Assuming the highest score was>1.0)
> However, for MultiSearcher and ParallelMultiSearcher, this normalization factor is applied *per index*, before merging the results together (the merge itself is ok though).
> An example: if you use
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> for a MultiSearcher with two subsearchers, the first document will have score 1.0.
> The next documents from the same subsearcher will have decreasing scores.
> The first document from the other subsearcher will however have score 1.0 again !
> The same applies for other Sort objects, but it is less visible.
> I will post a TestCase demonstrating the problem and suggested patches to solve it in a moment...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-469) (Parallel-)MultiSearcher: using Sort object changes the scores

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/LUCENE-469?page=comments#action_12358176 ] 

Yonik Seeley commented on LUCENE-469:
-------------------------------------

I sent an email to dev on this a month ago:
http://www.mail-archive.com/java-dev@lucene.apache.org/msg02443.html

And I took little steps to make it easier to do the right thing (keeping track of the real maxScore and adding getMaxScore() to FieldSortedHitQueue)

My personal feeling is that the expert level search routines should not normalize the score, and there should be a maxScore property in TopDocs to allow normalization by other search methods (Hits).

> (Parallel-)MultiSearcher: using Sort object changes the scores
> --------------------------------------------------------------
>
>          Key: LUCENE-469
>          URL: http://issues.apache.org/jira/browse/LUCENE-469
>      Project: Lucene - Java
>         Type: Bug
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>  Environment: 21 november 2005, revision 345901
>     Reporter: Luc Vanlerberghe
>  Attachments: TestMultiSearcher.patch
>
> Example: 
> Hits hits=multiSearcher.search(query);
> returns different scores for some documents than
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> (both for MultiSearcher and ParallelMultiSearcher)
> The documents returned will be the same and in the same order, but the scores in the second case will seem out of order.
> Inspecting the Explanation objects shows that the scores themselves are ok, but there's a bug in the normalization of the scores.
> The document with the highest score should have score 1.0, so all document scores are divided by the highest score.  (Assuming the highest score was>1.0)
> However, for MultiSearcher and ParallelMultiSearcher, this normalization factor is applied *per index*, before merging the results together (the merge itself is ok though).
> An example: if you use
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> for a MultiSearcher with two subsearchers, the first document will have score 1.0.
> The next documents from the same subsearcher will have decreasing scores.
> The first document from the other subsearcher will however have score 1.0 again !
> The same applies for other Sort objects, but it is less visible.
> I will post a TestCase demonstrating the problem and suggested patches to solve it in a moment...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-469) (Parallel-)MultiSearcher: using Sort object changes the scores

Posted by "Luc Vanlerberghe (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/LUCENE-469?page=comments#action_12358233 ] 

Luc Vanlerberghe commented on LUCENE-469:
-----------------------------------------

I discovered the problem in my production system a while ago, assumed it would have been fixed in 1.9 and noticed it wasn't.

I created test cases to reproduce it and used them to find and eliminate the problem.  I didn't pay much attention to compatibility issues yet.

I backported the patches to 1.4.3 now, tested it and put it in production.
I'll post those 1.4.3 version of the patches later

- TopDocs and TopFieldDocs are indeed public.  I could add the old constructor again with a @deprecated tag that sets maxScore to 1.0

- Other Sorts: You are right, I made a mistake by concentrating too much on the Sort.RELEVANCE case.  A similar problem exists for ParallelMultiSearcher.

- I am also in favour of making maxScore private with public accessors.  I only made it public because the other members where public...

I'll post corrected patches later today...


> (Parallel-)MultiSearcher: using Sort object changes the scores
> --------------------------------------------------------------
>
>          Key: LUCENE-469
>          URL: http://issues.apache.org/jira/browse/LUCENE-469
>      Project: Lucene - Java
>         Type: Bug
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>  Environment: 21 november 2005, revision 345901
>     Reporter: Luc Vanlerberghe
>  Attachments: MultiSearcherSort.patch, TestMultiSearcher.patch
>
> Example: 
> Hits hits=multiSearcher.search(query);
> returns different scores for some documents than
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> (both for MultiSearcher and ParallelMultiSearcher)
> The documents returned will be the same and in the same order, but the scores in the second case will seem out of order.
> Inspecting the Explanation objects shows that the scores themselves are ok, but there's a bug in the normalization of the scores.
> The document with the highest score should have score 1.0, so all document scores are divided by the highest score.  (Assuming the highest score was>1.0)
> However, for MultiSearcher and ParallelMultiSearcher, this normalization factor is applied *per index*, before merging the results together (the merge itself is ok though).
> An example: if you use
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> for a MultiSearcher with two subsearchers, the first document will have score 1.0.
> The next documents from the same subsearcher will have decreasing scores.
> The first document from the other subsearcher will however have score 1.0 again !
> The same applies for other Sort objects, but it is less visible.
> I will post a TestCase demonstrating the problem and suggested patches to solve it in a moment...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-469) (Parallel-)MultiSearcher: using Sort object changes the scores

Posted by "Luc Vanlerberghe (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/LUCENE-469?page=all ]

Luc Vanlerberghe updated LUCENE-469:
------------------------------------

    Attachment: MultiSearcherSort.patch

My thoughts exactly!

For the expert routines, there's no mention in the javadoc about the scoring being applied, so nobody should depend on it.  If they did, well, they're experts, right? :)

This is my complete patch to correct the problem and I do indeed propose to add a field maxScore to TopDocs that is used in Hits to normalize the results at as the very last step.

All the TestCases pass after applying this patch.

> (Parallel-)MultiSearcher: using Sort object changes the scores
> --------------------------------------------------------------
>
>          Key: LUCENE-469
>          URL: http://issues.apache.org/jira/browse/LUCENE-469
>      Project: Lucene - Java
>         Type: Bug
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>  Environment: 21 november 2005, revision 345901
>     Reporter: Luc Vanlerberghe
>  Attachments: MultiSearcherSort.patch, TestMultiSearcher.patch
>
> Example: 
> Hits hits=multiSearcher.search(query);
> returns different scores for some documents than
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> (both for MultiSearcher and ParallelMultiSearcher)
> The documents returned will be the same and in the same order, but the scores in the second case will seem out of order.
> Inspecting the Explanation objects shows that the scores themselves are ok, but there's a bug in the normalization of the scores.
> The document with the highest score should have score 1.0, so all document scores are divided by the highest score.  (Assuming the highest score was>1.0)
> However, for MultiSearcher and ParallelMultiSearcher, this normalization factor is applied *per index*, before merging the results together (the merge itself is ok though).
> An example: if you use
> Hits hits=multiSearcher.search(query, Sort.RELEVANCE);
> for a MultiSearcher with two subsearchers, the first document will have score 1.0.
> The next documents from the same subsearcher will have decreasing scores.
> The first document from the other subsearcher will however have score 1.0 again !
> The same applies for other Sort objects, but it is less visible.
> I will post a TestCase demonstrating the problem and suggested patches to solve it in a moment...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org