You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jim Ferenczi (JIRA)" <ji...@apache.org> on 2017/01/06 11:28:58 UTC

[jira] [Commented] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper

    [ https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804338#comment-15804338 ] 

Jim Ferenczi commented on LUCENE-7620:
--------------------------------------

It looks good [~dsmiley] ! I've started to work on something similar but got caught into something else ;)
Though I wonder if we should also break the sentence if it's too long ? Maybe the wrapped breakiterator could always be a sentence one and we could use a WordBreakIterator to cut sentences that are too long ? This way it would produce snippets that are similar to the SimpleFragmenter.
It could also be done in another breakiterator on top of this one but this would make things over complicated, I guess.
For the implementation can you throw an exception on the method that should not be called ? For instance {noformat}next(n){noformat} cannot be implemented efficiently (you need to start from 0 if you want to know the Nth boundary) but currently it returns the Nth boundary of the wrapped break iterator. I think it's better to throw an exception, this way it is obvious that some methods should not be called. 

Additionally I think that we should have a way to change the start and end of a passage when we know all the match that it contains. This is what the FVH is doing and it should be doable in the UH because the passage are created on the fly in forward manner. This is of course not the purpose of this issue and it should be treated as a new feature but I think it would be great to have the same output than the FVH when the max length of the passage is set. 




> UnifiedHighlighter: add target character width BreakIterator wrapper
> --------------------------------------------------------------------
>
>                 Key: LUCENE-7620
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7620
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>         Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch
>
>
> The original Highlighter includes a {{SimpleFragmenter}} that delineates fragments (aka Passages) by a character width.  The default is 100 characters.
> It would be great to support something similar for the UnifiedHighlighter.  It's useful in its own right and of course it helps users transition to the UH.  I'd like to do it as a wrapper to another BreakIterator -- perhaps a sentence one.  In this way you get back Passages that are a number of sentences so they will look nice instead of breaking mid-way through a sentence.  And you get some control by specifying a target number of characters.  This BreakIterator wouldn't be a general purpose java.text.BreakIterator since it would assume it's called in a manner exactly as the UnifiedHighlighter uses it.  It would probably be compatible with the PostingsHighlighter too.
> I don't propose doing this by default; besides, it's easy enough to pick your BreakIterator config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org