You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2018/07/16 18:49:00 UTC

[jira] [Commented] (LUCENE-8403) Support 'filtered' term vectors - don't require all terms to be present

    [ https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545609#comment-16545609 ] 

David Smiley commented on LUCENE-8403:
--------------------------------------

Ah, I remember this.  Here, the TVs are only in use for the UnifiedHighlighter for MultiTermQueries, and we had some interesting analysis in which we can know categorically that some terms will never be matched by our MTQs, and so they are dead weight.  Possible 40-50% dead weight for the app in question, if I recall.  Is it a real problem that CheckIndex complains?  I suppose that might come up in tests via the lucene-test-framework randomization that occasionally calls CheckIndex? I can't seem to find those call-sites right now though.
CC [~rcmuir]

> Support 'filtered' term vectors - don't require all terms to be present
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-8403
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8403
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Braun
>            Priority: Minor
>
> The genesis of this was a conversation and idea from [~dsmiley] several years ago.
> In order to optimize term vector storage, we may not actually need all tokens to be present in the term vectors - and if so, ideally our codec could just opt not to store them.
> I attempted to fork the standard codec and override the TermVectorsFormat and TermVectorsWriter to ignore storing certain Terms within a field. This worked, however, CheckIndex checks that the terms present in the standard postings are also present in the TVs, if TVs enabled. So this then doesn't work as 'valid' according to CheckIndex.
> Can the TermVectorsFormat be made in such a way to support configuration of tokens that should not be stored (benefits: less storage, more optimal retrieval per doc)? Is this valuable to the wider community? Is there a way we can design this to not break CheckIndex's contract while at the same time lessening storage for unneeded tokens?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org