You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shawn Heisey (JIRA)" <ji...@apache.org> on 2018/03/31 21:29:00 UTC

[jira] [Comment Edited] (LUCENE-7960) NGram filters -- add option to keep short terms

    [ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421479#comment-16421479 ] 

Shawn Heisey edited comment on LUCENE-7960 at 3/31/18 9:28 PM:
---------------------------------------------------------------

When I created this issue, I didn't think about long terms.  Somebody probably needs the functionality.

On further reflection, I don't think that new parameter names should be plural.  Using "keepShortTerm" and "keepLongTerm" sounds better to me.  They could both be enabled if that's what the user wants.  The same options should be added to all ngram analysis components, not just EdgeNgramFilter.

Let's consider a min length of 4 and a max length of 6, using EdgeNgramFilter.

If the input term is "abcdefgh", here's the basic term list from the filter:

abcd abcde abcdef

If keepShortTerm is enabled with the longer input, there is no change.  If keepLongTerm is enabled with the longer input, then the term list will be:

abcd abcde abcdef abcdefgh

The seven-character string would not be created.  If that's what the user wants, they should just increase the max value, rather than enable the new option.

If the input term is "ab", then the filter would not normally produce any terms.  With keepShortTerm, the output would be the input -- "ab".

The keepLongTerm option would have no effect with a short input.  

I did glance at the patch, but didn't examine it in detail, so I don't know if it does what I just described or not.


was (Author: elyograg):
When I created this issue, I didn't think about long terms.  Somebody probably needs the functionality.

On further reflection, I don't think that new parameter names should be plural.  Using "keepShortTerm" and "keepLongTerm" sounds better to me.  They could both be enabled if that's what the user wants.  The same options should be added to all ngram analysis components, not just EdgeNgramFilter.

Let's consider a min length of 4 and a max length of 6, using EdgeNgramFilter.

If the input term is "abcdefgh", here's the basic term list from the filter:

abcd abcde abcdef

If keepShortTerm is enabled with the longer input, there is no change.  If keepLongTerm is enabled with the longer input, then the term list will be:

abcd abcde abcdef abcdefgh

The seven-character string would not be created.  If that's what the user wants, they should just increase the max value, rather than enable the new option.

If the input term is "ab", then the filter would not normally produce any terms.  With keepShortTerm, the output would be the input -- "ab".  The three-character term would not be produced.  If the user wants that, they would need to reduce the min value.

The keepLongTerm option would have no effect with a short input.  

I did glance at the patch, but didn't examine it in detail, so I don't know if it does what I just described or not.

> NGram filters -- add option to keep short terms
> -----------------------------------------------
>
>                 Key: LUCENE-7960
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7960
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Shawn Heisey
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of problems for users.  I am not suggesting that the default behavior be changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like keepShortTerms, that defaults to false, to allow the short terms to be preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org