You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Karl Wettin (JIRA)" <ji...@apache.org> on 2009/06/02 23:53:07 UTC

[jira] Issue Comment Edited: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

    [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715712#action_12715712 ] 

Karl Wettin edited comment on LUCENE-1491 at 6/2/09 2:51 PM:
-------------------------------------------------------------

Although you have a valid point I'd like to argue this a bit. 

My arguments are probably considered silly by some. Perhaps it's just me that use ngrams for something completly different than what everybody else does, but here we go: Adding the feature as suggested by this patch is, according to me, to fix symptoms from bad use of character ngrams.

BOL, EOL, whitespace and punctuation are all valid parts of character ngrams than can increase precision/recall quite a bit. EdgeNGrams could sort of be considered such data too. So what I'm saying here is that I consider your example a bad use of charachter ngrams, that the whole sentance should have been grammed up. So in the case of 4-grams the output would end up as: "to b", "o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so on.

Supporting what I suggest will of course mean quite a bit of more work. A whole new filter that also does input text normalization such as removing double spaces and what not. That will probably not be implemented anytime soon. But adding the features in the patch to the filter actually means that this use is endorsed by the community and I'm not sure that's a good idea. I thus think it would be better with some sort of secondary filter that did the exact same thing as the patch.

Perhaps I should leave this issue alone and do some more work with LUCENE-1306 

      was (Author: karl.wettin):
    Although you have a valid point I'd like to argue this a bit. 

My arguments is probably considered silly by some. Perhaps it's just me that use ngrams for something completly different than what everybody else does, but here we go: Adding the feature as suggested by this patch is, according to me, to fix symptoms from bad use of character ngrams.

BOL, EOL, whitespace and punctuation are all valid parts of character ngrams than can increase precision/recall quite a bit. EdgeNGrams could sort of be considered such data too. So what I'm saying here is that I consider your example a bad use of charachter ngrams, that the whole sentance should have been grammed up. So in the case of 4-grams the output would end up as: "to b", "o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so on.

Supporting what I suggest will of course mean quite a bit of more work. A whole new filter that also does input text normalization such as removing double spaces and what not. That will probably not be implemented anytime soon. But adding the features in the patch to the filter actually means that this use is endorsed by the community and I'm not sure that's a good idea. I thus think it would be better with some sort of secondary filter that did the exact same thing as the patch.

Perhaps I should leave this issue alone and do some more work with LUCENE-1306 
  
> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>            Assignee: Otis Gospodnetic
>             Fix For: 2.9
>
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org