You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Christian Moen (JIRA)" <ji...@apache.org> on 2012/06/08 10:52:23 UTC

[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

    [ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291635#comment-13291635 ] 

Christian Moen commented on SOLR-3524:
--------------------------------------

Hiraga-san, there are different views on how punctuation characters best are handled by tokenizers.  Punctuation characters generally don't convey much meaning useful for text search, so they are generally removed in Lucene. (A different point of view is that tokenizers shouldn't remove punctuations and that filters should do this.)

The ability to keep punctuation was left as an expert-feature in JapanseTokenizer and I think we can expose this as an expert feature in Solr as well.  Could you share some details on your use-case just so that I get a better idea of the background and importance of this?


  

                
> Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-3524
>                 URL: https://issues.apache.org/jira/browse/SOLR-3524
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6
>            Reporter: Kazuaki Hiraga
>            Priority: Minor
>         Attachments: kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior.  JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org