You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael Gibney (JIRA)" <ji...@apache.org> on 2019/07/05 17:04:00 UTC

[jira] [Commented] (LUCENE-4312) Index format to store position length per position

    [ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879425#comment-16879425 ] 

Michael Gibney commented on LUCENE-4312:
----------------------------------------

Following up on discussion at Berlin Buzzwords with [~mikemccand], [~sokolov], [~simonw], and [~romseygeek]:

A lot of useful context (for, e.g., synonym generation, etc.) is available at index time that is not available at query time. Leveraging this context can result in index-time TokenStream manipulations that produce token graphs. Since position length is not indexed, it is impossible at query time to reconstruct index-time TokenStream "graph" structure.

Indexed position length is a prerequisite for any use case that calls for:
1. index-time graph TokenStreams
2. precise/accurate proximity query (via spans, intervals, etc.)

Could we discuss adding first-class support for this structural "position length" information?

Updating PostingsEnum to include endPosition() -- returning {{position+1}} by default -- would be a meaningful first step. This would facilitate the development of query implementations without requiring an API fork, and would signal an intention to move in the direction of supporting index-time token graphs.

Beyond that, I'm optimistic that codecs could be enhanced to index position length without introducing much additional overhead (I'd guess that position length for the common case of linear/non-graph index-time token streams could compress quite well).

> Index format to store position length per position
> --------------------------------------------------
>
>                 Key: LUCENE-4312
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4312
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 6.0
>            Reporter: Gang Luo
>            Priority: Minor
>              Labels: Suggestion
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org