You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Kazuaki Hiraga (JIRA)" <ji...@apache.org> on 2012/06/08 04:57:22 UTC

[jira] [Created] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

Kazuaki Hiraga created SOLR-3524:
------------------------------------

             Summary: Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
                 Key: SOLR-3524
                 URL: https://issues.apache.org/jira/browse/SOLR-3524
             Project: Solr
          Issue Type: Improvement
          Components: Schema and Analysis
    Affects Versions: 3.6
            Reporter: Kazuaki Hiraga
            Priority: Minor


JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior.  JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation.
I would like to have an option I can configure this behavior by fieldtype definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

Posted by "Jun Ohtani (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jun Ohtani updated SOLR-3524:
-----------------------------

    Attachment: kuromoji_discard_punctuation.patch.txt

create patch.
But no test implement.
                
> Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-3524
>                 URL: https://issues.apache.org/jira/browse/SOLR-3524
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6
>            Reporter: Kazuaki Hiraga
>            Priority: Minor
>         Attachments: kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior.  JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

Posted by "Kazuaki Hiraga (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291836#comment-13291836 ] 

Kazuaki Hiraga commented on SOLR-3524:
--------------------------------------

Thank you guys!
Christian, Since some documents have keywords that consists of alphabet and punctuation such as c++, c# and so on, We want to match those keywords with the keyword that unchanged form. Of course, we will discard punctuation in many cases but some cases, especially short text, we want to preserve punctuation. Therefore, I want to have an option that I can control this behaviour.

Ohtani-san, thank you for your early reply and patch! 
                
> Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-3524
>                 URL: https://issues.apache.org/jira/browse/SOLR-3524
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6
>            Reporter: Kazuaki Hiraga
>            Priority: Minor
>         Attachments: SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior.  JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

Posted by "Jun Ohtani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291787#comment-13291787 ] 

Jun Ohtani commented on SOLR-3524:
----------------------------------

Hi Christian,

Sorry, I create the patch based ver. 3.6.0.
                
> Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-3524
>                 URL: https://issues.apache.org/jira/browse/SOLR-3524
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6
>            Reporter: Kazuaki Hiraga
>            Priority: Minor
>         Attachments: kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior.  JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

Posted by "Christian Moen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291792#comment-13291792 ] 

Christian Moen commented on SOLR-3524:
--------------------------------------

No trouble.  I'll provide a new patch shortly for {{trunk}} and {{branch_4x}} with a test as well.
                
> Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-3524
>                 URL: https://issues.apache.org/jira/browse/SOLR-3524
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6
>            Reporter: Kazuaki Hiraga
>            Priority: Minor
>         Attachments: kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior.  JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

Posted by "Christian Moen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291635#comment-13291635 ] 

Christian Moen commented on SOLR-3524:
--------------------------------------

Hiraga-san, there are different views on how punctuation characters best are handled by tokenizers.  Punctuation characters generally don't convey much meaning useful for text search, so they are generally removed in Lucene. (A different point of view is that tokenizers shouldn't remove punctuations and that filters should do this.)

The ability to keep punctuation was left as an expert-feature in JapanseTokenizer and I think we can expose this as an expert feature in Solr as well.  Could you share some details on your use-case just so that I get a better idea of the background and importance of this?


  

                
> Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-3524
>                 URL: https://issues.apache.org/jira/browse/SOLR-3524
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6
>            Reporter: Kazuaki Hiraga
>            Priority: Minor
>         Attachments: kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior.  JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

Posted by "Christian Moen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Moen updated SOLR-3524:
---------------------------------

    Attachment: SOLR-3524.patch
    
> Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-3524
>                 URL: https://issues.apache.org/jira/browse/SOLR-3524
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6
>            Reporter: Kazuaki Hiraga
>            Priority: Minor
>         Attachments: SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior.  JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

Posted by "Christian Moen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291643#comment-13291643 ] 

Christian Moen commented on SOLR-3524:
--------------------------------------

Ohtani-san, thanks for the patch!

I've tried it on {{trunk}} and applying it fails because of an {{InitializationException}} is thrown instead of a {{SolrException}}.  I'll correct this shortly.

We also need some tests here...
                
> Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-3524
>                 URL: https://issues.apache.org/jira/browse/SOLR-3524
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6
>            Reporter: Kazuaki Hiraga
>            Priority: Minor
>         Attachments: kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior.  JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

Posted by "Christian Moen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291807#comment-13291807 ] 

Christian Moen commented on SOLR-3524:
--------------------------------------

New patch with tests and documentation changes attached.
                
> Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-3524
>                 URL: https://issues.apache.org/jira/browse/SOLR-3524
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6
>            Reporter: Kazuaki Hiraga
>            Priority: Minor
>         Attachments: SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior.  JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org