You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Tomoko Uchida (JIRA)" <ji...@apache.org> on 2016/05/05 13:26:12 UTC
[jira] [Created] (LUCENE-7273) New kuromoji TokenFilter to keep
tokens by part-of-speech tags
Tomoko Uchida created LUCENE-7273:
-------------------------------------
Summary: New kuromoji TokenFilter to keep tokens by part-of-speech tags
Key: LUCENE-7273
URL: https://issues.apache.org/jira/browse/LUCENE-7273
Project: Lucene - Core
Issue Type: Improvement
Components: modules/analysis
Reporter: Tomoko Uchida
Priority: Minor
Kuromoji has JapanesePartOfSpeechStopFilter to drop tokens by their part-of-speech tags. In some cases, it would be convenient to keep tokens according to "keep" POS tags list.
Example usage:
{code:java}
// keeps proper nouns - location names only
String[] tags = new String[]{"名詞-固有名詞-地域-一般"};
Set<String> keeptags = new HashSet<>();
for (String tag: tags) {
keeptags.add(tag);
}
JapaneseTokenizer tokenizer = new JapaneseTokenizer(null, false, JapaneseTokenizer.Mode.SEARCH);
JapanesePartOfSpeechKeepFilter stream = new JapanesePartOfSpeechKeepFilter(tokenizer, keeptags);
{code}
{code:xml}
<!-- (Solr) analyzer definition -->
<fieldType name="text_ja_propernoun" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="false">
<analyzer>
<tokenizer class="solr.JapaneseTokenizerFactory" mode="normal"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.JapanesePartOfSpeechKeepFilterFactory" tags="lang/keeptags_ja.txt" />
<filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
{code}
Of course it can be achieved by using JapanesePartOfSpeechStopFilter, however because there are about 70 part-of-speeches, it can be cumbersome to list all stop tags to keep tokens with few POS tags of interest.
I'll add a patch soon.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org