You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2023/01/12 13:33:31 UTC

[GitHub] [lucene] mmatela opened a new issue, #12080: SynonymGraphFilter: wrong output token position when input positions overlap

mmatela opened a new issue, #12080:
URL: https://github.com/apache/lucene/issues/12080

   ### Description
   
   In my example, the query is 'test polskie'.
   I use MorfologikFilter for Polish stemming, it turns 'polskie' into 'polski' + 'polskie'.
   I also use SynonymGraphFilter which turns 'polski' into 'pol'. It's applied **only for query**.
   Here's what I see in quey analysis (token position in parenthesis):
   ```
   Tokenizer: test(1) polskie(2)
   MF: test(1) polskie(2) polski(2)
   SGF: test(1) polskie(2) pol(3) polski(3).
   ```
   When I search for "test polskie" with quotation marks, a document with the same text doesn't match, because SGF changes positions of tokens in query compared to index.
   
   In documentation, the description for the old `SynonymFilter` says "_The position value of the new tokens are set such they all occur at the same position as the original token._" In `SynonymGraphFilter` instead they are set to a position after the previous token. Is that an intentional change? Doesn't seem so, because it doesn't work as expected in my example.
   
   Looking at the code, it seems the problem is in https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymGraphFilter.java#L246:
   `nextNodeOut = lastNodeOut + posLenAtt.getPositionLength();`
   
   `nextNodeOut` is always set as the position after the current token, and that is later used as position of output token.
   I tried to remove this line and instead set this field right after the call to `input.incrementToken()`, in line 340:
   `nextNodeOut = lastNodeOut + posIncrAtt.getPositionIncrement();`
   This sets it to the original token's position. This way the final positions are `SGF: test(1) polskie(2) pol(2) polski(2).` and my document does match. I didn't experience any unexpected side effects.
   
   Hope this helps. I'm not familiar with the project enough to easilly submit a proper pull request, with tests and all.
   
   ### Version and environment details
   
   lucene 9.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mmatela commented on issue #12080: SynonymGraphFilter: wrong output token position when input positions overlap

Posted by "mmatela (via GitHub)" <gi...@apache.org>.
mmatela commented on issue #12080:
URL: https://github.com/apache/lucene/issues/12080#issuecomment-1434301348

   Turns out my initial solution lead to exceptions when a synonym appears at the beginning of the query or there are more tokens after the synonym. After some trial and error, it seems to work correctly with these changes: https://github.com/mmatela/lucene/commit/1d2df64e09cfbc89d42511274530248fe559befb


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org