You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Todd Feak (JIRA)" <ji...@apache.org> on 2008/12/13 00:46:44 UTC

[jira] Created: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
--------------------------------------------------------------------

                 Key: LUCENE-1491
                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
             Project: Lucene - Java
          Issue Type: Bug
          Components: Analysis
    Affects Versions: 2.4
            Reporter: Todd Feak


If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.

Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated LUCENE-1491:
-------------------------------------

    Lucene Fields: [New, Patch Available]  (was: [New])
    Fix Version/s: 2.9
         Assignee: Otis Gospodnetic

> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>            Assignee: Otis Gospodnetic
>             Fix For: 2.9
>
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715661#action_12715661 ] 

Otis Gospodnetic commented on LUCENE-1491:
------------------------------------------

I'm not 100% sure - I'm not using ngrams at the moment, so I have no place to test this out, but skipping a shorter than minimal ngrams seems like it would result in silent data loss.

Ah, here, example:
What would happen to "to be or not to be" if min=4 and we relied on ngrams to perform phrase queries?

All of those terms would be dropped, so a search for "to be or not to be" would result in 0 hits.

If the above is correct, I think this sounds like a bad thing that one wouldn't expect...

> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>            Assignee: Otis Gospodnetic
>             Fix For: 2.9
>
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic resolved LUCENE-1491.
--------------------------------------

    Resolution: Fixed

Thanks Todd & Co.

Sending        CHANGES.txt
Sending        analyzers/src/java/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.java
Sending        analyzers/src/java/org/apache/lucene/analysis/ngram/NGramTokenFilter.java
Sending        analyzers/src/test/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilterTest.java
Sending        analyzers/src/test/org/apache/lucene/analysis/ngram/NGramTokenFilterTest.java
Transmitting file data .....
Committed revision 794034.


> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>            Assignee: Otis Gospodnetic
>             Fix For: 2.9
>
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "Todd Feak (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Feak updated LUCENE-1491:
------------------------------

    Attachment: LUCENE-1491.patch

Patch includes test to highlight broken EgdeNGramTokenFilter and NGramTokenFilter. Fixes for both.

> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715712#action_12715712 ] 

Karl Wettin commented on LUCENE-1491:
-------------------------------------

Although you have a valid point I'd like to argue this a bit. 

My arguments is probably considered silly by some. Perhaps it's just me that use ngrams for something completly different than what everybody else does, but here we go: Adding the feature as suggested by this patch is, according to me, to fix symptoms from bad use of character ngrams.

BOL, EOL, whitespace and punctuation are all valid parts of character ngrams than can increase precision/recall quite a bit. EdgeNGrams could sort of be considered such data too. So what I'm saying here is that I consider your example a bad use of charachter ngrams, that the whole sentance should have been grammed up. So in the case of 4-grams the output would end up as: "to b", "o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so on.

Supporting what I suggest will of course mean quite a bit of more work. A whole new filter that also does input text normalization such as removing double spaces and what not. That will probably not be implemented anytime soon. But adding the features in the patch to the filter actually means that this use is endorsed by the community and I'm not sure that's a good idea. I thus think it would be better with some sort of secondary filter that did the exact same thing as the patch.

Perhaps I should leave this issue alone and do some more work with LUCENE-1306 

> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>            Assignee: Otis Gospodnetic
>             Fix For: 2.9
>
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "Todd Feak (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656783#action_12656783 ] 

Todd Feak commented on LUCENE-1491:
-----------------------------------

The NGramTokenFilter is affected by the same bug.

> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715712#action_12715712 ] 

Karl Wettin edited comment on LUCENE-1491 at 6/2/09 2:51 PM:
-------------------------------------------------------------

Although you have a valid point I'd like to argue this a bit. 

My arguments are probably considered silly by some. Perhaps it's just me that use ngrams for something completly different than what everybody else does, but here we go: Adding the feature as suggested by this patch is, according to me, to fix symptoms from bad use of character ngrams.

BOL, EOL, whitespace and punctuation are all valid parts of character ngrams than can increase precision/recall quite a bit. EdgeNGrams could sort of be considered such data too. So what I'm saying here is that I consider your example a bad use of charachter ngrams, that the whole sentance should have been grammed up. So in the case of 4-grams the output would end up as: "to b", "o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so on.

Supporting what I suggest will of course mean quite a bit of more work. A whole new filter that also does input text normalization such as removing double spaces and what not. That will probably not be implemented anytime soon. But adding the features in the patch to the filter actually means that this use is endorsed by the community and I'm not sure that's a good idea. I thus think it would be better with some sort of secondary filter that did the exact same thing as the patch.

Perhaps I should leave this issue alone and do some more work with LUCENE-1306 

      was (Author: karl.wettin):
    Although you have a valid point I'd like to argue this a bit. 

My arguments is probably considered silly by some. Perhaps it's just me that use ngrams for something completly different than what everybody else does, but here we go: Adding the feature as suggested by this patch is, according to me, to fix symptoms from bad use of character ngrams.

BOL, EOL, whitespace and punctuation are all valid parts of character ngrams than can increase precision/recall quite a bit. EdgeNGrams could sort of be considered such data too. So what I'm saying here is that I consider your example a bad use of charachter ngrams, that the whole sentance should have been grammed up. So in the case of 4-grams the output would end up as: "to b", "o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so on.

Supporting what I suggest will of course mean quite a bit of more work. A whole new filter that also does input text normalization such as removing double spaces and what not. That will probably not be implemented anytime soon. But adding the features in the patch to the filter actually means that this use is endorsed by the community and I'm not sure that's a good idea. I thus think it would be better with some sort of secondary filter that did the exact same thing as the patch.

Perhaps I should leave this issue alone and do some more work with LUCENE-1306 
  
> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>            Assignee: Otis Gospodnetic
>             Fix For: 2.9
>
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715551#action_12715551 ] 

Otis Gospodnetic commented on LUCENE-1491:
------------------------------------------

I agree this is an improvement, but like Hoss I'm worried about silently skipping shorter-than-specified-min-ngram-size tokens.

Perhaps we need boolean keepSmaller somewhere, so we can explicitly control the behaviour?


> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>            Assignee: Otis Gospodnetic
>             Fix For: 2.9
>
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716053#action_12716053 ] 

Otis Gospodnetic commented on LUCENE-1491:
------------------------------------------

I'm getting convinced to just drop ngrams < minNgram.
If nobody complains by the end of the week, I'll commit.


> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>            Assignee: Otis Gospodnetic
>             Fix For: 2.9
>
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715763#action_12715763 ] 

Otis Gospodnetic commented on LUCENE-1491:
------------------------------------------

Karl - LUCENE-1306 - I agree, I think the existing edge and non-edge ngram stuff should be folded into LUCENE-1306 (or the other way around, if it's easier).

But won't question of what we do with the chunks shorter than min ngram remain?  Does adding that boolean hurt anything? (other than an if test for every ngram :) ).

> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>            Assignee: Otis Gospodnetic
>             Fix For: 2.9
>
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658308#action_12658308 ] 

Hoss Man commented on LUCENE-1491:
----------------------------------

patch looks good ... the one question i have is whether the fix meets user expectations: the patch as posted "skips" any input tokens that are shorter then the minimum ngram length ... is that what most people will expect, or will people expect shorter tokens to be passed through?

ie: should "min" be the minimum token size produced by the filters (a hard min), or should it be the minimum ngram size produced by the filter (a soft min)?

either way this patch is an improvement, i'm just wondering what we want to define the semantics to be (or if we want to make an additional option for this)

> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "Todd Feak (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Feak updated LUCENE-1491:
------------------------------

    Affects Version/s: 3.0
                       2.9
                       2.4.1

> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715567#action_12715567 ] 

Karl Wettin commented on LUCENE-1491:
-------------------------------------

bq. Perhaps we need boolean keepSmaller somewhere, so we can explicitly control the behaviour?

I'm not sure. Is there a use case for this or is it an XY-problem?



> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>            Assignee: Otis Gospodnetic
>             Fix For: 2.9
>
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "viobade (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715849#action_12715849 ] 

viobade commented on LUCENE-1491:
---------------------------------

I think is better to keep the main goal of ngram: groups of characters between min and max.  If is need in any practical situation for minimum ngram equals with one or two characters, this can be done setting the minimum....otherwise the filter must  work in the way that is expected.. If I expect subword with minimum 3 length why do I get a token with two characters while it is not accomplish the condition?

> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>            Assignee: Otis Gospodnetic
>             Fix For: 2.9
>
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731034#action_12731034 ] 

Michael McCandless commented on LUCENE-1491:
--------------------------------------------

Otis this one looks ready to commit?

> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>            Assignee: Otis Gospodnetic
>             Fix For: 2.9
>
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org