You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael Gibney (Jira)" <ji...@apache.org> on 2021/07/08 13:58:00 UTC

[jira] [Created] (LUCENE-10023) Multi-token post-analysis DocValues

Michael Gibney created LUCENE-10023:
---------------------------------------

             Summary: Multi-token post-analysis DocValues
                 Key: LUCENE-10023
                 URL: https://issues.apache.org/jira/browse/LUCENE-10023
             Project: Lucene - Core
          Issue Type: Improvement
          Components: core/index
            Reporter: Michael Gibney


The single-token case for post-analysis DocValues is accounted for by {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but there are cases where it would be desirable to have post-analysis DocValues based on multi-token fields.

The main use cases that I can think of are variants of faceting/terms aggregation. I understand that this could be viewed as "trappy" for the naive "Moby Dick word cloud" case; but:
# I think this can be supported fairly cleanly in Lucene
# Explicit user configuration of this option would help prevent people shooting themselves in the foot
# The current situation is arguably "trappy" as well; it just offloads the trappiness onto Lucene-external workarounds for systems/users that want to support this kind of behavior
# Integrating this functionality directly in Lucene would afford consistency guarantees that present opportunities for future optimizations (e.g., shared Terms dictionary between indexed terms and DocValues).

This issue proposes adding support for multi-token post-analysis DocValues directly to {{IndexingChain}}. The initial proposal involves extending the API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to existing {{IndexableFieldType.docValuesType()}}).




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org