You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/11/23 07:57:00 UTC

[jira] [Commented] (LUCENE-9581) Clarify discardCompoundToken behavior in the JapaneseTokenizer

    [ https://issues.apache.org/jira/browse/LUCENE-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237176#comment-17237176 ] 

ASF subversion and git services commented on LUCENE-9581:
---------------------------------------------------------

Commit a5d0654a2469c92bf02497e8fd18587058cb1a96 in lucene-solr's branch refs/heads/master from jimczi
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a5d0654 ]

LUCENE-9581: Japanese tokenizer should discard the compound token instead of disabling the decomposition of long tokens when discardCompoundToken is activated.


> Clarify discardCompoundToken behavior in the JapaneseTokenizer
> --------------------------------------------------------------
>
>                 Key: LUCENE-9581
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9581
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Jim Ferenczi
>            Priority: Minor
>         Attachments: LUCENE-9581.patch, LUCENE-9581.patch, LUCENE-9581.patch
>
>
> At first sight, the discardCompoundToken option added in LUCENE-9123 seems redundant with the NORMAL mode of the Japanese tokenizer. When set to true, the current behavior is to disable the decomposition for compounds, that's exactly what the NORMAL mode does.
> So I wonder if the right semantic of the option would be to keep only the decomposition of the compound or if it's really needed. If the goal is to make the output compatible with a graph token filter, the current workaround to set the mode to NORMAL should be enough.
> That's consistent with the mode that should be used to preserve positions in the index since we don't handle position length on the indexing side. 
> Am I missing something regarding the new option ? Is there a compelling case where it differs from the NORMAL mode ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org