You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ingomar Wesp (JIRA)" <ji...@apache.org> on 2018/03/31 20:39:00 UTC

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

    [ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421475#comment-16421475 ] 

Ingomar Wesp commented on LUCENE-7960:
--------------------------------------

I'd like to propose a patch (see attached pull request #349) that adds two options to the EdgeNGramFilter:
 * keepShortTerms: Causes the filter to pass through input terms that are shorter than the minimum gram size.
 * keepLongTerms: Causes the filter to pass through input terms that are longer than the maximum gram size.

I'm not entirely sure about the usefulness of keepLongTerms, but enabling the ability pass through short terms would certainly be neat for queries where you'd like to match ALL tokens as either prefixes or exact terms, but some query tokens are shorter than the minimum gram size. As far is I understand, a second field containing the exact terms isn't really a viable alternative there, because you can easily run into situations where only a subset of query tokens matches for either field.

> NGram filters -- add option to keep short terms
> -----------------------------------------------
>
>                 Key: LUCENE-7960
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7960
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Shawn Heisey
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of problems for users.  I am not suggesting that the default behavior be changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like keepShortTerms, that defaults to false, to allow the short terms to be preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org