You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Tomoko Uchida (Jira)" <ji...@apache.org> on 2020/02/01 06:31:00 UTC
[jira] [Resolved] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

     [ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tomoko Uchida resolved LUCENE-9123.
-----------------------------------
    Resolution: Fixed

Merged the patches into the master (with a MIGRATE notice) and branch_8x.
Thanks [~h.kazuaki]!

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-9123
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9123
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Kazuaki Hiraga
>            Assignee: Tomoko Uchida
>            Priority: Major
>         Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with both of SynonymGraphFilter and SynonymFilter when JT generates multiple tokens as an output. If we use `mode=normal`, it should be fine. However, we would like to use decomposed tokens that can maximize to chance to increase recall.
> Snippet of schema:
> {code:xml}
>     <fieldType name="text_custom_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
>       <analyzer>
>         <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>         <filter class="solr.SynonymGraphFilterFactory"
>                     synonyms="lang/synonyms_ja.txt"
>                     tokenizerFactory="solr.JapaneseTokenizerFactory"/>
>         <filter class="solr.JapaneseBaseFormFilterFactory"/>
>         <!-- Removes tokens with certain part-of-speech tags -->
>         <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt" />
>         <!-- Normalizes full-width romaji to half-width and half-width kana to full-width (Unicode NFKC subset) -->
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <!-- Removes common tokens typically not useful for search, but have a negative effect on ranking -->
>         <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt" /> -->
>         <!-- Normalizes common katakana spelling variations by removing any last long sound character (U+30FC) -->
>         <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
>         <!-- Lower-cases romaji characters -->
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org