You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2015/04/03 19:38:59 UTC
[jira] [Updated] (LUCENE-6392) Add offset limit to Highlighter's TokenStreamFromTermVector

     [ https://issues.apache.org/jira/browse/LUCENE-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Smiley updated LUCENE-6392:
---------------------------------
    Attachment: LUCENE-6392_highlight_term_vector_maxStartOffset.patch

(Patch attached).
Elaborating on the description:

This patch includes a tweak to the TokenLL[] array size initialization to consider this new limit when guessing a good size.

This patch includes memory saving optimizations to the information it accumulates.  Before the patch, each TokenLL had a char[], so there were a total of 2 objects per token (including the token itself).  Now I use a shared CharsRefBuilder with a pointer & length into it, so there's just 1 object now, plus byte savings by avoiding a char[] header.  I also reduced the bytes needed for a TokenLL instance from 40 to 32.  *It does assume that the char offset delta (endOffset - startOffset) can fit within a short*, which seems like a reasonable assumption to me. For safety I guard against overflow and substitute Short.MAX_VALUE.

Finally, to encourage users to supply a limit (even if "-1" to mean no limit), I decided to deprecate many of the methods in TokenSources for new ones that include a limit parameter.  But for those methods that fall-back to a provided Analyzer, _I have to wonder now if it makes sense for these methods to filter the analyzers_.  I think it does -- if you want to limit the tokens then it shouldn't matter where you got them from -- you want to limit them.  I haven't added that but I'm looking for feedback first.

> Add offset limit to Highlighter's TokenStreamFromTermVector
> -----------------------------------------------------------
>
>                 Key: LUCENE-6392
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6392
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>             Fix For: 5.2
>
>         Attachments: LUCENE-6392_highlight_term_vector_maxStartOffset.patch
>
>
> The Highlighter's TokenStreamFromTermVector utility, typically accessed via TokenSources, should have the ability to filter out tokens beyond a configured offset. There is a TODO there already, and this issue addresses it.  New methods in TokenSources now propagate a limit.
> This patch also includes some memory saving optimizations, to be described shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org