You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/09/21 20:20:10 UTC

[GitHub] [lucene-solr] dsmiley commented on pull request #1740: LUCENE-9458: WDGF and WDF should tie-break by endOffset

dsmiley commented on pull request #1740:
URL: https://github.com/apache/lucene-solr/pull/1740#issuecomment-696351031

Sorry for the delay; I have had a solution for over a month locally but didn't share it yet. I'm pretty comfortable with what I just pushed.

I did some "fuzz testing" to determine if there was a token ordering difference given the same input but varying the `compare` logic. It revealed that I could simplify WDGF's `compare` logic to only look at the start and end offset. However, WDF's `compare` failed this exercise; my change introduced a new position in some cases, although not the ones I explicitly tested for. WDF is more complicated (to me any way), and furthermore WDF is deprecated so I'm not motivated to disturb the logic there.

I added the fuzz testing here as a comment. It's not really committable live because it's requires modifying WDGF's `compare` to actually do two compares and assert the same result. Without that, there is no assertion being checked -- no comparison.

Suggested CHANGES.txt would be improvement:
"WordDelimiterGraphFilter should tie-break by endOffset to emit longer tokens first. The same graph is produced."

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org