You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2014/12/26 06:16:13 UTC

[jira] [Commented] (LUCENE-6139) TokenGroup.getStart|EndOffset should return matchStart|EndOffset not start|endOffset

    [ https://issues.apache.org/jira/browse/LUCENE-6139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258931#comment-14258931 ] 

David Smiley commented on LUCENE-6139:
--------------------------------------

I propose that TokenGroup's fields become private and Highlighter access them via it's getters -- the ones it already has, actually, no need for more.

This begs the question if the distinction of a "matchStartOffset" vs. "startOffset" (and "end" variants) serves any purpose.  That is, toss startOffset (& endOffset) then rename matchStartOffset (& matchEndOffset) to startOffset (& endOffset). They aren't used, and I doubt others are because I think the offset info, when needed, is accessed at the end via TextFragment (populated from TokenGroup.matchStartOffset & matchEndOffset).  FYI I didn't go that route because I want *all* matches and I found the custom Formatter approach to be more appealing than passing a very large numFragments, from an efficiency standpoint.

h4. Unrelated questions about Highlighter
Not directly related to this is a couple burning questions I have in Highlighter:
* Why oh why does Highlighter call formatter.highlightTerm for essentially *every* token?  If TokenGroup.getTotalScore() is 0, I argue it shouldn't. All the built-in Fragmenters (and one I just wrote) start with a zero score short-circuit.  
* Why does a 0-score fragment remains a part of the fragments priority queue; why it isn't tossed out when the fragment closes out?  One might argue it's needless when numFragments is small, which is the size of the PQ but it'd be nice to ask for 'all' fragments/matches without a huge PQ even if there is just one real match.
* Why is all text run through the encoder and appended to a "newText" StringBuilder, even when the fragment has no score?  If there's no point then it's a waste to do it and then not use it as it won't be a part of a returned fragment.  Again, I think 0-score fragments should be immediately dropped, and newText should only be for the current fragment.

> TokenGroup.getStart|EndOffset should return matchStart|EndOffset not start|endOffset
> ------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6139
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6139
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>            Reporter: David Smiley
>
> The default highlighter has a TokenGroup class that is passed to Formatter.highlightTerm().  TokenGroup also has getStartOffset() and getEndOffset() methods that ostensibly return the start and end offsets into the original text of the current term.  These getters aren't called by Lucene or Solr but they are made available and are useful to me.  _The problem is that they return the wrong offsets when there are tokens at the same position._  I believe this was an oversight of LUCENE-627 in which these getters should have been updated but weren't.  The fix is simple: return matchStartOffset and matchEndOffset from these getters, not startOffset and endOffset.  I think this oversight would not have occurred if Highlighter didn't have package-access to TokenGroup's fields.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org