You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Timothy M. Rodriguez (JIRA)" <ji...@apache.org> on 2016/10/27 19:28:58 UTC
[jira] [Updated] (LUCENE-7526) Improvements to UnifiedHighlighter
OffsetStrategies
[ https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Timothy M. Rodriguez updated LUCENE-7526:
-----------------------------------------
Description:
This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies by reducing reliance on creating or re-creating TokenStreams.
The primary changes are as follows:
* AnalysisOffsetStrategy - split into two offset strategies
** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a MemoryIndex for producing Offsets
** TokenStreamOffsetStrategy - an offset strategy that avoids creating a MemoryIndex. Can only be used if the query distills down to terms and automata.
* TokenStream removal
** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill the memory index and then once consumed a new one was generated by uninverting the MemoryIndex back into a TokenStream if there were automata (wildcard/mtq queries) involved. Now this is avoided, which should save memory and avoid a second pass over the data.
** TermVectorOffsetStrategy - this was refactored in a similar way to avoid generating a TokenStream if automata are involved.
** PostingsWithTermVectorsOffsetStrategy - similar refactoring
* CompositePostingsEnum - aggregates several underlying PostingsEnums for wildcard/mtq queries. This should improve relevancy by providing unified metrics for a wildcard across all it's term matches
* Added a HighlightFlag for enabling the newly separated TokenStreamOffsetStrategy since it can adversely affect passage relevancy
was:
This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies by reducing reliance on creating or re-creating TokenStreams.
The primary changes are as follows:
* AnalysisOffsetStrategy - split into two offset strategies
* MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a MemoryIndex for producing Offsets
* TokenStreamOffsetStrategy - an offset strategy that avoids creating a MemoryIndex. Can only be used if the query distills down to terms and automata.
* TokenStream removal
* MemoryIndexOffsetStrategy - previously a TokenStream was created to fill the memory index and then once consumed a new one was generated by uninverting the MemoryIndex back into a TokenStream if there were automata (wildcard/mtq queries) involved. Now this is avoided, which should save memory and avoid a second pass over the data.
* TermVectorOffsetStrategy - this was refactored in a similar way to avoid generating a TokenStream if automata are involved.
* PostingsWithTermVectorsOffsetStrategy - similar refactoring
* CompositePostingsEnum - aggregates several underlying PostingsEnums for wildcard/mtq queries. This should improve relevancy by providing unified metrics for a wildcard across all it's term matches
* Added a HighlightFlag for enabling the newly separated TokenStreamOffsetStrategy since it can adversely affect passage relevancy
> Improvements to UnifiedHighlighter OffsetStrategies
> ---------------------------------------------------
>
> Key: LUCENE-7526
> URL: https://issues.apache.org/jira/browse/LUCENE-7526
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Timothy M. Rodriguez
> Priority: Minor
> Labels: highlighter, unified-highlighter
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
> ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a MemoryIndex for producing Offsets
> ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a MemoryIndex. Can only be used if the query distills down to terms and automata.
> * TokenStream removal
> ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill the memory index and then once consumed a new one was generated by uninverting the MemoryIndex back into a TokenStream if there were automata (wildcard/mtq queries) involved. Now this is avoided, which should save memory and avoid a second pass over the data.
> ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid generating a TokenStream if automata are involved.
> ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for wildcard/mtq queries. This should improve relevancy by providing unified metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated TokenStreamOffsetStrategy since it can adversely affect passage relevancy
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org