You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hari Menon (JIRA)" <ji...@apache.org> on 2017/11/06 02:39:00 UTC

[jira] [Comment Edited] (LUCENE-8034) SpanNotWeight returns wrong results due to integer overflow

    [ https://issues.apache.org/jira/browse/LUCENE-8034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239836#comment-16239836 ] 

Hari Menon edited comment on LUCENE-8034 at 11/6/17 2:38 AM:
-------------------------------------------------------------

[~mikemccand] That's a good question, and actually something I could use help with. It would be awesome if you could let me know if there are any potential bottlenecks with the way I am trying to solve this problem. Let me know if I should instead post to the users@ mailing list. Here is the problem I am trying to solve:

I have to index documents of type A, which internally have sub-documents of type B. e.g A1 might contain sub-documents B11, B12, B13 etc. A2 can contain B21, B22, B23, B24 and so on. My search use case is such that I might want to have matches where all the search terms are within a particular B-document, or it could be within a particular A-document. Besides, I need the B-document Ids that matched in both cases. I know that my B-documents have a fixed max. number of words (say 500). The way I am solving this right now is:
- Use A as the lucene document to be indexed, with a field "text" containing text from the B sub-documents.
- The idea is to index B11 between position 0 and 499, B12 from 1000 to 1499, B13 from 2000 to 2499 and so on. I am using PositionIncrementTokenStream to fix the positions.
- Then use SpanQueries with slop of 500 if we want to search within B-documents, and slop of Int.MAX_VALUE if we want to search in the entire A-document. Using SpanQuery also gives me easy access to position, which I can then divide by 1000 to get the index of the actual B-document. This is where I was trying to use max span of Int.MAX_VALUE and ran into this issue.

Does this make sense? Let me know if you see any gaping holes or perf issues with this approach. I am still new to lucene and haven't done a full perf benchmark with this approach as I am still building a prototype.

[~rcmuir] Will it affect scores? I think it will just not select the given record, right?


was (Author: hshankar):
[~mikemccand] That's a good question, and actually something I could use help with. It would be awesome if you could let me know if there are any potential bottlenecks with the way I am trying to solve this problem. Let me know if I should instead post to the users@ mailing list. Here is the problem I am trying to solve:

I have to index documents of type A, which internally have sub-documents of type B. e.g A1 might contain sub-documents B11, B12, B13 etc. A2 can contain B21, B22, B23, B24 and so on. My search use case is such that I might want to have matches where all the search terms are within a particular B-document, or it could be within a particular A-document. Besides, I need the B-document Ids that matched in both cases. I know that my B-documents have a fixed max. number of words (say 500). The way I am solving this right now is:
- Use A as the lucene document to be indexed, with a field "text" containing text from the B sub-documents.
- The idea is to index B11 between position 0 and 499, B12 from 1000 to 1499, B13 from 2000 to 2499 and so on. I am using PositionIncrementTokenStream to fix the positions.
- Then use SpanQueries with max span of 500 if we want to search within B-documents, and max span of Int.MAX_VALUE if we want to search in the entire A-document. Using SpanQuery also gives me easy access to position, which I can then divide by 500 to get the index of the actual B-document. This is where I was trying to use max span of Int.MAX_VALUE and ran into this issue.

Does this make sense? Let me know if you see any gaping holes or perf issues with this approach. I am still new to lucene and haven't done a full perf benchmark with this approach as I am still building a prototype.

[~rcmuir] Will it affect scores? I think it will just not select the given record, right?

> SpanNotWeight returns wrong results due to integer overflow
> -----------------------------------------------------------
>
>                 Key: LUCENE-8034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8034
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/query/scoring, core/search
>            Reporter: Hari Menon
>            Priority: Minor
>              Labels: newbie, patch
>         Attachments: LUCENE-8034.patch
>
>
> In SpanNotQuery, there is an acceptance condition:
> {code:java}
> if (candidate.endPosition() + post <= excludeSpans.startPosition()) {
>     return AcceptStatus.YES;
> }
> {code}
> This overflows in case `candidate.endPosition() + post > Integer.MAX_VALUE`. I have a fix for this which I am working on. Basically I am flipping the add to a subtract.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org