You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael Gibney (Jira)" <ji...@apache.org> on 2021/07/08 14:00:05 UTC

[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

    [ https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377413#comment-17377413 ] 

Michael Gibney commented on LUCENE-10023:
-----------------------------------------

In contrast to "naive word cloud" faceting, the more compelling use cases for multi-token post-analysis DocValues tend to be specialized cases, with inherent limitations on the number of tokens. A couple of comments on related issues mention {{path_tokenizer}} in Elasticsearch (see [this comment|https://github.com/elastic/elasticsearch/issues/12394#issuecomment-199555310], and [this comment|https://github.com/elastic/elasticsearch/issues/18064#issuecomment-232297988]). My own use case has to do with fields that are in a sense single-valued, but with TokenFilters that may produce expanded "synonym"-style mappings (really broader/narrower/related/hierarchical entities).

And fwiw, I would argue that there are legitimate use cases even for the "naive word cloud" approach -- text corpus analytics, etc.

I realize that it would be possible to do this work external to Lucene; but to me it felt cleanest to add it here, at least to have something concrete for seeding discussion. The PR initially includes only a trivial test demonstrating the new behavior; more tests can be added if there's a decision to further pursue this approach.

> Multi-token post-analysis DocValues
> -----------------------------------
>
>                 Key: LUCENE-10023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10023
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael Gibney
>            Priority: Major
>
> The single-token case for post-analysis DocValues is accounted for by {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but there are cases where it would be desirable to have post-analysis DocValues based on multi-token fields.
> The main use cases that I can think of are variants of faceting/terms aggregation. I understand that this could be viewed as "trappy" for the naive "Moby Dick word cloud" case; but:
> # I think this can be supported fairly cleanly in Lucene
> # Explicit user configuration of this option would help prevent people shooting themselves in the foot
> # The current situation is arguably "trappy" as well; it just offloads the trappiness onto Lucene-external workarounds for systems/users that want to support this kind of behavior
> # Integrating this functionality directly in Lucene would afford consistency guarantees that present opportunities for future optimizations (e.g., shared Terms dictionary between indexed terms and DocValues).
> This issue proposes adding support for multi-token post-analysis DocValues directly to {{IndexingChain}}. The initial proposal involves extending the API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to existing {{IndexableFieldType.docValuesType()}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org