You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Koji Sekiguchi (JIRA)" <ji...@apache.org> on 2007/09/15 04:28:32 UTC
[jira] Commented: (SOLR-319) changes SynonymFilterFactoryto
"Analyze" synonyms file
[ https://issues.apache.org/jira/browse/SOLR-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527682 ]
Koji Sekiguchi commented on SOLR-319:
-------------------------------------
Absolutely. I'll try to change my patch to implement the fieldtype idea. Thank you.
> changes SynonymFilterFactoryto "Analyze" synonyms file
> ------------------------------------------------------
>
> Key: SOLR-319
> URL: https://issues.apache.org/jira/browse/SOLR-319
> Project: Solr
> Issue Type: Improvement
> Reporter: Koji Sekiguchi
> Priority: Minor
> Attachments: SOLR-319.patch
>
>
> WHAT:
> Currently, SynonymFilterFactory works very well with N-gram tokenizer (CJKTokenizer, for example).
> But we have to take care of the statement in synonyms.txt.
> For example, if I use CJKTokenizer (work as bi-gram for CJK chars) and want C1C2C3 maps to C4C5C6,
> I have to write the rule as follows:
> C1C2 C2C3 => C4C5 C5C6
> But I want to write it "C1C2C3=>C4C5C6". This patch allows it. It is also helpful for sharing synonyms.txt.
> HOW:
> tokenFactory attribute is added to <filter class="solr.SynonymFilterFactory"/>.
> If the attribute is specified, SynonymFilterFactory uses the TokenizerFactory to create Tokenizer.
> Then SynonymFilterFactory uses the Tokenizer to get tokens from the rules in synonyms.txt file.
> sample-1: CJKTokenizer
> <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.CJKTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="ngram_synonym_test_ja.txt"
> ignoreCase="true" expand="true" tokenFactory="solr.CJKTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.CJKTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldtype>
> sample-2: NGramTokenizer
> <fieldtype name="text_ngram" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
> <filter class="solr.SynonymFilterFactory" synonyms="ngram_synonym_test_ngram.txt"
> ignoreCase="true" expand="true"
> tokenFactory="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldtype>
> backward compatibility:
> Yes. If you omit tokenFactory attribute from <filter class="solr.SynonymFilterFactory"/> tag, it works as usual.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.