You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Koji Sekiguchi (JIRA)" <ji...@apache.org> on 2013/10/02 10:49:25 UTC

[jira] [Created] (LUCENE-5253) add NGramSynonymTokenizer

Koji Sekiguchi created LUCENE-5253:
--------------------------------------

             Summary: add NGramSynonymTokenizer
                 Key: LUCENE-5253
                 URL: https://issues.apache.org/jira/browse/LUCENE-5253
             Project: Lucene - Core
          Issue Type: New Feature
          Components: modules/analysis
            Reporter: Koji Sekiguchi
            Priority: Minor


I'd like to propose that we have another n-gram tokenizer which can process synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram size is fixed, i.e. minGramSize = maxGramSize.

Today, I think we have the following problems when using SynonymFilter with NGramTokenizer. 
For purpose of illustration, we have a synonym setting "ABC, DEFG" w/ expand=true and N = 2 (2-gram).

# There is no consensus (I think :-) how we assign offsets to generated synonym tokens DE, EF and FG when expanding source token AB and BC.
# If the query pattern looks like XABC or ABCY, it cannot be matched even if there is a document "…XABCY…" in index when autoGeneratePhraseQueries set to true, because there is no "XA" or "CY" tokens in the index.

NGramSynonymTokenizer can solve these problems by providing the following methods.

* NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't tokenize registered words. e.g.

||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
|ABC|AB/DE/BC/EF/FG|ABC/DEFG|

* The back and forth of the registered words, NGramSynonymTokenizer generates *extra* tokens w/ posInc=0. e.g.

||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
|XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|

In the above sample, "Z" and "1" are the extra tokens.




--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org