You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2017/03/01 12:43:45 UTC

[jira] [Commented] (LUCENE-7719) UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars

    [ https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890090#comment-15890090 ] 

Michael McCandless commented on LUCENE-7719:
--------------------------------------------

Wow, this is a great catch [~dmitrymalinin]!  Thank you for opening the precursor issue.

{{AutomatonQuery.getAutomaton}} really must return a UTF8-oriented
automaton because that matches how the terms are indexed into Lucene,
and what the automaton will be intersected with, to run the query.

We should fix the javadocs to say this.

And it is sort of annoying that these differences are not strongly
typed, but the {{Automaton}} class is really agnostic to what ints you are
putting onto its transitions.

But, yeah, for highlighting, we are operating in UTF16 space, and so I
think we need some way to have the {{CharacterRunAutomaton}} interface
on top of a UTF8 automaton?  Maybe we should abstract out a separate
interface that {{MultiTermHighlighting}} would use?  It seems it only
uses the {{run}} method, to test if a given term is accepted?  And
then, as you suggested, we could easily convert the incoming char[] to
UTF8 BytesRef and use the {{ByteRunAutomaton.run}} on that.


> UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-7719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7719
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>            Reporter: David Smiley
>
> In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not character oriented.  For ASCII terms, this is safe but it's not for multi-byte characters.  This is most likely going to rear it's head with a WildcardQuery, but due to special casing in MultiTermHighlighting, PrefixQuery isn't affected.  Nonetheless it'd be nice to get a general fix in so that MultiTermHighlighting can remove special cases for PrefixQuery and TermRangeQuery (both subclass AutomatonQuery).
> AFAICT, this bug was likely in the PostingsHighlighter since inception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org