You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Otis Gospodnetic (JIRA)" <ji...@apache.org> on 2008/05/15 18:07:55 UTC

[jira] Issue Comment Edited: (LUCENE-1224) NGramTokenFilter creates bad TokenStream

    [ https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597174#action_12597174 ] 

otis edited comment on LUCENE-1224 at 5/15/08 9:07 AM:
-------------------------------------------------------------------

Hiroaki:
I agree with Grant about unit tests.  I looked at the unit tests and thought the same thing as Grant - why is Hiroaki adding indexing/searching into the mix?  Your change is about modifying the positions of n-grams, and you don't need to index or search for that.  The test will be a lot simpler if you just test for positions, like Grant suggested.

Also, once you change the unit test this way, it will be a lot easier to play with positions and figure out what the "right" way to handle positions is.

Finally, it might turn out that people have different needs or different expectations for n-gram positions.  Thus, when making changes, perhaps you can think of a mechanism that allows the caller to instruct the n-gram tokenizer which token positioning approach to take (e.g. the "incremental" one, or the one based on the position of the originating token, or...)


      was (Author: otis):
    Hiroaki:
I agree with Grant about unit tests.  I looked at the unit tests and thought the same thing as Grant - why is Hiroaki adding indexing/searching into the mix?  Your change is about modifying the positions of n-grams, and you don't need to index or search for that.  The test will be a lot simpler if you just test for positions, like Grant suggested.
  
> NGramTokenFilter creates bad TokenStream
> ----------------------------------------
>
>                 Key: LUCENE-1224
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1224
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Hiroaki Kawai
>            Assignee: Grant Ingersoll
>            Priority: Critical
>         Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, NGramTokenFilter.patch
>
>
> With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string into an index, but I can't query it with "abc". If I query with "ab", I can get a hit result.
> The reason is that the NGramTokenFilter generates badly ordered TokenStream. Query is based on the Token order in the TokenStream, that how stemming or phrase should be anlayzed is based on the order (Token.positionIncrement).
> With current filter, query string "abc" is tokenized to : ab bc abc 
> meaning "query a string that has ab bc abc in this order".
> Expected filter will generate : ab abc(positionIncrement=0) bc
> meaning "query a string that has (ab|abc) bc in this order"
> I'd like to submit a patch for this issue. :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org