You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shawn Heisey (JIRA)" <ji...@apache.org> on 2018/05/01 17:17:00 UTC

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

    [ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459873#comment-16459873 ] 

Shawn Heisey commented on LUCENE-7960:
--------------------------------------

I've gotten a look at the PR.

Changing the signature on an existing constructor isn't a good idea.  Lucene is a public API and there will be user code using that constructor that must continue to work if Lucene is upgraded.  We should add a new constructor and have the existing constructor(s) call that one with default values.

The only question about that is whether the existing constructor should be deprecated in stable and removed in master.  I'm not sure who to ask.

There are some variable renames.  They don't look like problems, especially because the visibility is private, but I'd like to get the opinion of someone who has deeper Lucene knowledge.

I'm having a difficult time following the modifications to the filter logic.  Some of the modifications look like they're not directly related to implementing this issue, but I can't tell for sure.


> NGram filters -- add option to keep short terms
> -----------------------------------------------
>
>                 Key: LUCENE-7960
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7960
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Shawn Heisey
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of problems for users.  I am not suggesting that the default behavior be changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like keepShortTerms, that defaults to false, to allow the short terms to be preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org