You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2015/04/03 19:38:59 UTC
[jira] [Updated] (LUCENE-6392) Add offset limit to Highlighter's
TokenStreamFromTermVector
[ https://issues.apache.org/jira/browse/LUCENE-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Smiley updated LUCENE-6392:
---------------------------------
Attachment: LUCENE-6392_highlight_term_vector_maxStartOffset.patch
(Patch attached).
Elaborating on the description:
This patch includes a tweak to the TokenLL[] array size initialization to consider this new limit when guessing a good size.
This patch includes memory saving optimizations to the information it accumulates. Before the patch, each TokenLL had a char[], so there were a total of 2 objects per token (including the token itself). Now I use a shared CharsRefBuilder with a pointer & length into it, so there's just 1 object now, plus byte savings by avoiding a char[] header. I also reduced the bytes needed for a TokenLL instance from 40 to 32. *It does assume that the char offset delta (endOffset - startOffset) can fit within a short*, which seems like a reasonable assumption to me. For safety I guard against overflow and substitute Short.MAX_VALUE.
Finally, to encourage users to supply a limit (even if "-1" to mean no limit), I decided to deprecate many of the methods in TokenSources for new ones that include a limit parameter. But for those methods that fall-back to a provided Analyzer, _I have to wonder now if it makes sense for these methods to filter the analyzers_. I think it does -- if you want to limit the tokens then it shouldn't matter where you got them from -- you want to limit them. I haven't added that but I'm looking for feedback first.
> Add offset limit to Highlighter's TokenStreamFromTermVector
> -----------------------------------------------------------
>
> Key: LUCENE-6392
> URL: https://issues.apache.org/jira/browse/LUCENE-6392
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Reporter: David Smiley
> Assignee: David Smiley
> Fix For: 5.2
>
> Attachments: LUCENE-6392_highlight_term_vector_maxStartOffset.patch
>
>
> The Highlighter's TokenStreamFromTermVector utility, typically accessed via TokenSources, should have the ability to filter out tokens beyond a configured offset. There is a TODO there already, and this issue addresses it. New methods in TokenSources now propagate a limit.
> This patch also includes some memory saving optimizations, to be described shortly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org