You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Christian Moen (Created) (JIRA)" <ji...@apache.org> on 2012/02/02 08:55:54 UTC

[jira] [Created] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration

Need stopwords and stoptags lists for default Japanese configuration
--------------------------------------------------------------------

                 Key: LUCENE-3745
                 URL: https://issues.apache.org/jira/browse/LUCENE-3745
             Project: Lucene - Java
          Issue Type: Improvement
          Components: modules/analysis
            Reporter: Christian Moen


Stopwords and stoptags lists for Japanese needs to be developed, tested and integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration

Posted by "Christian Moen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200339#comment-13200339 ] 

Christian Moen commented on LUCENE-3745:
----------------------------------------

I'm attaching some lexical assets that are useful for building stopwords and stoptag lists.

The frequency lists are made from ~1.5 million segmented Japanese Wikipedia documents from after some scrubbing and handling.  I'd prefer to use a more balanced corpus for this, but I believe Wikipedia will be fine for this. 

The following files are attached in TSV format using UTF-8 encoding:

* {{top-pos.txt}} - Part-of-speech tag distribution
* {{top-100000.txt}} - Top 100,000 most frequent surface forms and their frequencies
* {{top-1000000-pos.txt}} - Top 1,000,000 most frequent surface form and part-of-speech tag combinations and their frequencies

There's also a tool {{filter_stoptags.py}} attached that reads a set of stoptags and evaluates it on {{top-1000000-pos.txt}} to give us an idea what passes through any given stoptag set.

An example with my current stoptag set is given below.

{noformat}
filter_stoptags.py -s stoptags.txt top-1000000-pos.txt
stop: 、        freq: 14426806  pos: 記号-読点
stop: の        freq: 14212851  pos: 助詞-連体化
stop: 。        freq: 10553747  pos: 記号-句点
stop: は        freq: 8956177   pos: 助詞-係助詞
stop: に        freq: 8757138   pos: 助詞-格助詞-一般
stop: を        freq: 7723958   pos: 助詞-格助詞-一般
stop:           freq: 7417005   pos: 記号-空白
stop: た        freq: 7366368   pos: 助動詞
stop: が        freq: 5427730   pos: 助詞-格助詞-一般
stop: て        freq: 4874861   pos: 助詞-接続助詞
pass: し        freq: 4312613   pos: 動詞-自立
stop: で        freq: 3702106   pos: 助詞-格助詞-一般
stop:           freq: 3485125   pos: 記号-空白
stop: )        freq: 3049861   pos: 記号-括弧閉
stop: (        freq: 3045461   pos: 記号-括弧開
pass: れ        freq: 2722773   pos: 動詞-接尾
pass: さ        freq: 2441965   pos: 動詞-自立
stop: で        freq: 2403133   pos: 助動詞
stop: ・        freq: 2250725   pos: 記号-一般
stop: も        freq: 1962142   pos: 助詞-係助詞
pass: する      freq: 1959374   pos: 動詞-自立
pass: いる      freq: 1937789   pos: 動詞-非自立
stop: と        freq: 1927529   pos: 助詞-格助詞-引用
pass: 年        freq: 1796435   pos: 名詞-接尾-助数詞
stop: 「        freq: 1701848   pos: 記号-括弧開
stop: と        freq: 1697926   pos: 助詞-格助詞-一般
stop: 」        freq: 1672052   pos: 記号-括弧閉
stop: から      freq: 1414661   pos: 助詞-格助詞-一般
stop: ある      freq: 1400235   pos: 助動詞
stop:           freq: 1319235   pos: 記号-空白
pass: こと      freq: 1272503   pos: 名詞-非自立-一般
stop: な        freq: 1254673   pos: 助動詞
stop: が        freq: 1110771   pos: 助詞-接続助詞
pass: の        freq: 1037815   pos: 名詞-非自立-一般
stop: として    freq: 1002940   pos: 助詞-格助詞-連語
stop:           freq: 989166    pos: 記号-空白
pass: い        freq: 923836    pos: 動詞-非自立
(...)
{noformat}

                
> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>
>                 Key: LUCENE-3745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3745
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>         Attachments: filter_stoptags.py, top-100000.txt, top-1000000-pos.txt, top-pos.txt
>
>
> Stopwords and stoptags lists for Japanese needs to be developed, tested and integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration

Posted by "Christian Moen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200461#comment-13200461 ] 

Christian Moen commented on LUCENE-3745:
----------------------------------------

I'll submit a patch for this tomorrow.
                
> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>
>                 Key: LUCENE-3745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3745
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>         Attachments: filter_stoptags.py, top-100000.txt, top-1000000-pos.txt, top-pos.txt
>
>
> Stopwords and stoptags lists for Japanese needs to be developed, tested and integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration

Posted by "Robert Muir (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-3745.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 4.0
                   3.6

Thanks Christian!
                
> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>
>                 Key: LUCENE-3745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3745
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3745.patch, filter_stoptags.py, top-100000.txt, top-1000000-pos.txt, top-pos.txt
>
>
> Stopwords and stoptags lists for Japanese needs to be developed, tested and integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration

Posted by "Christian Moen (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Moen updated LUCENE-3745:
-----------------------------------

    Attachment: LUCENE-3745.patch
    
> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>
>                 Key: LUCENE-3745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3745
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>         Attachments: LUCENE-3745.patch, filter_stoptags.py, top-100000.txt, top-1000000-pos.txt, top-pos.txt
>
>
> Stopwords and stoptags lists for Japanese needs to be developed, tested and integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200732#comment-13200732 ] 

Robert Muir commented on LUCENE-3745:
-------------------------------------

Thanks for doing this, it will be much nicer to have a properly built configuration here!

I agree with the overall approach of leaning towards the conservative side: if someone wants
they can always be more aggressive (and use the data on this issue as a guide).




                
> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>
>                 Key: LUCENE-3745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3745
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>         Attachments: LUCENE-3745.patch, filter_stoptags.py, top-100000.txt, top-1000000-pos.txt, top-pos.txt
>
>
> Stopwords and stoptags lists for Japanese needs to be developed, tested and integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration

Posted by "Christian Moen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200747#comment-13200747 ] 

Christian Moen commented on LUCENE-3745:
----------------------------------------

Thanks a lot for looking at this, Robert.  This was the thinking.  (I've referred to the issue in the stopwords and stoptags files.)
                
> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>
>                 Key: LUCENE-3745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3745
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>         Attachments: LUCENE-3745.patch, filter_stoptags.py, top-100000.txt, top-1000000-pos.txt, top-pos.txt
>
>
> Stopwords and stoptags lists for Japanese needs to be developed, tested and integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration

Posted by "Christian Moen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200680#comment-13200680 ] 

Christian Moen commented on LUCENE-3745:
----------------------------------------

Please find a patch attached.

I've made {{stoptags.txt}} lighter by not stopping all prefixes and also allowing auxiliary verbs and interjections to pass.  I didn't come across any occurrences of unclassified symbols (記号) in Wikipedia, but it is now stopped as that seem to align better with our overall stop approach for symbols.

Many of the most frequent terms that now pass have been re-introduced in {{stopwords.txt} so they are stopped using a {{StopFilter}} instead of {{KuromojiPartOfSpeechStopFilter}}.  I believe this configuration is more balanced.

Overall, I've used the term frequencies attached to as a governing guideline for what to introduce into {{stopwords.txt}}.  It mostly contains hiragana words and expressions and I've deliberately left out common kanji as I'd like to keep the stopping fairly light.

I'll create a separate JIRA for introducing stopwords and stoptags to Solr.
                
> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>
>                 Key: LUCENE-3745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3745
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>         Attachments: LUCENE-3745.patch, filter_stoptags.py, top-100000.txt, top-1000000-pos.txt, top-pos.txt
>
>
> Stopwords and stoptags lists for Japanese needs to be developed, tested and integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration

Posted by "Christian Moen (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Moen updated LUCENE-3745:
-----------------------------------

    Attachment: filter_stoptags.py
                top-pos.txt
                top-1000000-pos.txt
                top-100000.txt
    
> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>
>                 Key: LUCENE-3745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3745
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>         Attachments: filter_stoptags.py, top-100000.txt, top-1000000-pos.txt, top-pos.txt
>
>
> Stopwords and stoptags lists for Japanese needs to be developed, tested and integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200753#comment-13200753 ] 

Robert Muir commented on LUCENE-3745:
-------------------------------------

Lets get my previous ad-hoc lists out of there :)

I'll commit this for now and if there are any concerns we can reopen or refine in further issues.
                
> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>
>                 Key: LUCENE-3745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3745
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>         Attachments: LUCENE-3745.patch, filter_stoptags.py, top-100000.txt, top-1000000-pos.txt, top-pos.txt
>
>
> Stopwords and stoptags lists for Japanese needs to be developed, tested and integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org