You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Brian Coverstone (Jira)" <ji...@apache.org> on 2020/09/03 22:44:00 UTC

[jira] [Comment Edited] (LUCENE-9418) Ordered intervals can give inaccurate hits on interleaved terms

    [ https://issues.apache.org/jira/browse/LUCENE-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190420#comment-17190420 ] 

Brian Coverstone edited comment on LUCENE-9418 at 9/3/20, 10:43 PM:
--------------------------------------------------------------------

I believe this may still be an issue in 8.6.0, as I'm finding the last slot can often have an incorrect record.

I found a workaround, and that is to always select 1 more than needed.

Here is some pseudo code to demonstrate:
{quote}ComplexPhraseQueryParser cpqp = new ComplexPhraseQueryParser("somefield", analyzer);
 Query query = cpqp.parse("somevalue");

pageSize = 10;
 pageNum = 1;
 requestedRecords = pageSize * pageNum + 1; //+1 workaround
 startOffset = (pageNum - 1) * pageSize;

FieldComparatorSource fsc = new FieldComparatorSource() {
     @Override
     public FieldComparator<String> newComparator(String fieldname, int numhits, int sortPos, boolean reversed) {
        return new StringValComparatorIgnoreCase(numhits, fieldname);
    }
};

Sort sort = new Sort(new SortField("firstname", fsc, false));
 IndexSearcher searcher = new IndexSearcher(reader);
 TopFieldCollector tfcollector = TopFieldCollector.create(sort, requestedRecords + 1, Integer.MAX_VALUE);
 searcher.search(query, tfcollector);
 ScoreDoc[] hits = tfcollector.topDocs(startOffset, pageSize).scoreDocs;
{quote}
At this point "hits" is correct. However, if I remove the "+1" from the requestedRecords above, the last item in "hits" is often incorrect.

 


was (Author: brain2000):
I believe this may still be an issue in 8.6.0, as I'm finding the last slot can often have an incorrect record.

I found a workaround, and that is to always select 1 more than needed.

Here is some pseudo code to demonstrate:
{quote}ComplexPhraseQueryParser cpqp = new ComplexPhraseQueryParser("somefield", analyzer);
Query query = cpqp.parse("somevalue");

pageSize = 10;
pageNum = 1;
requestedRecords = pageSize * pageNum + 1; //+1 workaround
startOffset = (pageNum - 1) * pageSize;

FieldComparatorSource fsc = new FieldComparatorSource() {
    @Override
    public FieldComparator<String> newComparator(String fieldname, int numhits, int sortPos, boolean reversed) {
        return new StringValComparatorIgnoreCase(numhits, fieldname);
    }
};

Sort sort = new Sort(new SortField("firstname", fsc, false));
IndexSearcher searcher = new IndexSearcher(reader);
TopFieldCollector tfcollector = TopFieldCollector.create(sort, requestedRecords + 1, Integer.MAX_VALUE);
searcher.search(query, tfcollector);
ScoreDoc[] hits = tfcollector.topDocs(startOffset, pageSize).scoreDocs;
{quote}
At this point "hits" is correct. However, if I remove the "+1" from the requestedRecords above, the last item in "hits" is often incorrect.

 

> Ordered intervals can give inaccurate hits on interleaved terms
> ---------------------------------------------------------------
>
>                 Key: LUCENE-9418
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9418
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: 8.6
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given the text 'A B A C', an ordered interval over 'A B C' will return the inaccurate interval [2, 3], due to the way minimization is handled after matches are found.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org