You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Karl Wettin (JIRA)" <ji...@apache.org> on 2008/06/13 06:50:46 UTC

[jira] Created: (LUCENE-1306) CombinedNGramTokenFilter

CombinedNGramTokenFilter
------------------------

                 Key: LUCENE-1306
                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/analyzers
            Reporter: Karl Wettin
            Assignee: Karl Wettin
            Priority: Trivial


Alternative NGram filter that produce tokens with composite prefix and suffix markers.

{code:java}
ts = new WhitespaceTokenizer(new StringReader("hello"));
ts = new CombinedNGramTokenFilter(ts, 2, 2);
assertNext(ts, "^h");
assertNext(ts, "he");
assertNext(ts, "el");
assertNext(ts, "ll");
assertNext(ts, "lo");
assertNext(ts, "o$");
assertNull(ts.next());
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646971#action_12646971 ] 

Otis Gospodnetic commented on LUCENE-1306:
------------------------------------------

Could/should this not be folded into the existing Ngram code in contrib?


> CombinedNGramTokenFilter
> ------------------------
>
>                 Key: LUCENE-1306
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt, LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

Posted by "Hiroaki Kawai (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619480#action_12619480 ] 

Hiroaki Kawai commented on LUCENE-1306:
---------------------------------------

The files looks good for me.

> CombinedNGramTokenFilter
> ------------------------
>
>                 Key: LUCENE-1306
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt, LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

Posted by "Hiroaki Kawai (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605836#action_12605836 ] 

Hiroaki Kawai commented on LUCENE-1306:
---------------------------------------

First of all, my comment No.3 was not wrong, sorry. We don't have to insert $^ token in the ngram stream.

{quote}
I don't want separate fields for the prefix, inner and suffix grams, I want to use the same single filter at query time. 
{quote}

I agree with that. :)

Then, let's consider about the phrase query.
1. At store time, we want to store a sentence "This is a pen"
2. At query time, we want to query with "This is"

At store time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
^T Th hi is s$ ^i is s$ ^a a$ ^p pe en n$

At query time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
^T Th hi is s$ ^i is s$

We can find that the stored sequence because it contains the query sequence.

{quote}
If you are creating ngrams over multiple words, say a sentence, then I state that there should only be a prefix in the start of the senstance and a suffix in the end of the sentance and that grams will contain whitespace.
{quote}

If so, at query time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
"^T","Th","hi","is","s "," i","is","s$"

We can't find the stored sequence because it does not contain the query sequence. n-gram query is always phrase query in the micro scope. 

+1 for prefix and suffix markers in the token.

{quote}
Note, also, that one could use the "flags" to indicate what the token is. I know that's a little up in the air just yet, but it does exist. 
{quote}

Yes, there is a flags. Of cource, we can use it. But I can't find the way to use them efficiently in THIS CASE, right now.

{quote}
This would mean that no stripping of special chars is required.
{quote}

Unfortunately, stripping is done outside of the ngram filter by WhitespaceTokenizer.

> CombinedNGramTokenFilter
> ------------------------
>
>                 Key: LUCENE-1306
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

Posted by "Hiroaki Kawai (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604744#action_12604744 ] 

Hiroaki Kawai commented on LUCENE-1306:
---------------------------------------

I'm sorry I could not see what it means that, "combined" + "ngram" from the code above. :( Can I ask you to let me know the intension?


> CombinedNGramTokenFilter
> ------------------------
>
>                 Key: LUCENE-1306
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1306) CombinedNGramTokenFilter

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-1306:
--------------------------------

    Attachment: LUCENE-1306.txt

New in this patch:
 * offsets as in NGramTokenFilter
 * token type "^gram", "gram$", "^gram$" and "gram"
 * a bit of javadocs

There is also a todo I'll have to look in to some other day.

{code:java}
//  todo
//  /**
//   * if true, prefix and suffix does not count as a part of the ngram size.
//   * E.g. '^he' has as n of 2 if true and 3 if false
//   */
//  private boolean usingBoundaryCharsPartOfN = true;
{code}

This was not quite as simple to add as I hoped it would be and will try to find some time to fix that before I commit it.


> CombinedNGramTokenFilter
> ------------------------
>
>                 Key: LUCENE-1306
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt, LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605721#action_12605721 ] 

Karl Wettin commented on LUCENE-1306:
-------------------------------------

I'll refine and document this patch soon. Terrible busy though. Hasty responses:

bq. Should there be a way for the client of this class to specify the prefix and suffix char? 
bq. 1. prefix and suffix chars should be configurable. Because user must choose a char that is not used in the terms.

There are getters and setters, but nothing in the constructor.

bq. Is having, for example, "^h" as the first bi-gram token really the right thing to do? Would "^he" make more sense? I know that makes it 3 characters long, but it's 2 chars from the input string. Not sure, so I'm asking.

I always considered 'start of word' and 'end of word' as a single character and a part of n. I might be wrong though. I'll have to take a look at what other people did. It would not be a very hard thing to include a setting for that.

bq. Is this primarily to distinguish between the edge and inner n-grams? If so, would it make more sense to just make use of Token type variable instead?
bq. one could use the "flags" to indicate what the token is. 

I might be missing something in your line of questioning. Don't understand what it would help to have the flag or token type as they are not stored in the index.

I don't want separate fields for the prefix, inner and suffix grams, I want to use the same single filter at query time. I typically pass down the gram boost in the payload, evaluated on gram size, how far away it is from the prefix and suffix, et c. 

bq. 3. If you want to do a phrase query (for example, "This is"), we have to generate $^ token in the gap to make the positions valid.

If you are creating ngrams over multiple words, say a sentence, then I state that there should only be a prefix in the start of the senstance and a suffix in the end of the sentance and that grams will contain whitespace. I never did phrase queries using grams but I'd probably want prefix and suffix around each token. This is another good reason to keep them in the same field with prefix and suffix markers in the token, or?

> CombinedNGramTokenFilter
> ------------------------
>
>                 Key: LUCENE-1306
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

Posted by "Hiroaki Kawai (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605483#action_12605483 ] 

Hiroaki Kawai commented on LUCENE-1306:
---------------------------------------

After thinking for a week, I think this idea is nice.

IMHO, this might be renamed to NGramTokenizer simply. A general n-gram tokenizer accepts a sequence that has no gap in it. By the concept, TokenFilter accepts a tokien stream (gapped sequence), and current NGramTokenFilter does not work well in that sense. CombinedNGramTokenFilter filles the gap with prefix(^) and suffix($), and the token stream becomes a simple stream again virtually, n-gram works nice agian.

Comments:
1. prefix and suffix chars should be configurable. Because user must choose a char that is not used in the terms.
2. prefix and suffix might be a white space. Because most of the users are not interested in whitespace itself.
3. If you want to do a phrase query (for example, "This is"), we have to generate $^ token in the gap to make the positions valid.
4. n-gram algorithm should be rewritten to make the positions valid. Please see LUCENE-1225

I think "^h" is OK, because prefix and suffix are the chars that was introduced as a workaround.


> CombinedNGramTokenFilter
> ------------------------
>
>                 Key: LUCENE-1306
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1306) CombinedNGramTokenFilter

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-1306:
--------------------------------

    Attachment: LUCENE-1306.txt

> CombinedNGramTokenFilter
> ------------------------
>
>                 Key: LUCENE-1306
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605704#action_12605704 ] 

Grant Ingersoll commented on LUCENE-1306:
-----------------------------------------

Note, also, that one could use the "flags" to indicate what the token is.  I know that's a little up in the air just yet, but it does exist.  This would mean that no stripping of special chars is required.

> CombinedNGramTokenFilter
> ------------------------
>
>                 Key: LUCENE-1306
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604756#action_12604756 ] 

Karl Wettin commented on LUCENE-1306:
-------------------------------------

The current NGram analysis in trunk is split in two, on for edge-grams and one for inner grams. 

This patch combines them both in a single filter that uses ^prefix and suffix$ tokens if they are some sort of edge gram, or both around the complete token if n is great enough. There is also method to extend if you want to add a payload (more boost to edge grams or something) or do something to the gram tokens depending on what part of the original token they contain.

> CombinedNGramTokenFilter
> ------------------------
>
>                 Key: LUCENE-1306
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605120#action_12605120 ] 

Otis Gospodnetic commented on LUCENE-1306:
------------------------------------------

Should there be a way for the client of this class to specify the prefix and suffix char?

Is having, for example, "^h" as the first bi-gram token really the right thing to do?  Would "^he" make more sense?  I know that makes it 3 characters long, but it's 2 chars from the input string.  Not sure, so I'm asking.

Is this primarily to distinguish between the edge and inner n-grams?  If so, would it make more sense to just make use of Token type variable instead?


> CombinedNGramTokenFilter
> ------------------------
>
>                 Key: LUCENE-1306
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org