You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Ahmet Arslan <io...@yahoo.com> on 2014/03/27 10:53:01 UTC

TruncateTokenFilter FixedPrefixStemFilter

Hello,

I would like to ask if there is an interest to add TruncateTokenFilter to lucene.

I am using this filter as a stemmer for Turkish language. In many academic research (clustering, classification,retrieval) it is used and called as Fixed Prefix Stemmer or Simple Truncation Method or F5 in short.

Among F3 TO F7, F5 stemmer (length=5) is found to work well for Turkish language in this [1]. It is the same work where some of stopwords_tr.txt are acquired. 

[1] "Information Retrieval on Turkish Texts"
http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf

ElasticSearch has this filter but it does not respect keyword attribute. 

Main advantage of F5 stemming is it does not effected by the meaning loss caused by ascii folding. It work well with ascii folding. 
[2] "Effects of diacritics on Turkish information retrieval" http://journals.tubitak.gov.tr/elektrik/issues/elk-12-20-5/elk-20-5-9-1010-819.pdf

Here is the full type I use for customers 

 <fieldType name="text_tr_ascii_f5" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.ApostropheFilterFactory"/>
     <filter class="solr.TurkishLowerCaseFilterFactory"/>
     <filter class="solr.ASCIIFoldingFilterFactory"/>
     <filter class="solr.KeywordRepeatFilterFactory"/>
     <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>

I  would like to get community opinions on :

1) interest in this? Should I create a jira issue and attach what I have got
2) keyword attribute should be respected? 
3) package name analysis.misc versus analyis.tr 
4) name of the class TruncateTokenFilter versus FixedPrefixStemFilter

Thanks,
Ahmet

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: TruncateTokenFilter FixedPrefixStemFilter

Posted by Jack Krupansky <ja...@basetechnology.com>.
Sounds interesting.

+1 for FixedPefixStemFilter.

Default prefixLength to 5.

-- Jack Krupansky

-----Original Message----- 
From: Ahmet Arslan
Sent: Thursday, March 27, 2014 5:53 AM
To: solr-dev@lucene.apache.org
Subject: TruncateTokenFilter FixedPrefixStemFilter

Hello,

I would like to ask if there is an interest to add TruncateTokenFilter to 
lucene.

I am using this filter as a stemmer for Turkish language. In many academic 
research (clustering, classification,retrieval) it is used and called as 
Fixed Prefix Stemmer or Simple Truncation Method or F5 in short.

Among F3 TO F7, F5 stemmer (length=5) is found to work well for Turkish 
language in this [1]. It is the same work where some of stopwords_tr.txt are 
acquired.

[1] "Information Retrieval on Turkish Texts"
http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf

ElasticSearch has this filter but it does not respect keyword attribute.

Main advantage of F5 stemming is it does not effected by the meaning loss 
caused by ascii folding. It work well with ascii folding.
[2] "Effects of diacritics on Turkish information retrieval" 
http://journals.tubitak.gov.tr/elektrik/issues/elk-12-20-5/elk-20-5-9-1010-819.pdf

Here is the full type I use for customers

<fieldType name="text_tr_ascii_f5" class="solr.TextField" 
positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.ApostropheFilterFactory"/>
     <filter class="solr.TurkishLowerCaseFilterFactory"/>
     <filter class="solr.ASCIIFoldingFilterFactory"/>
     <filter class="solr.KeywordRepeatFilterFactory"/>
     <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>

I  would like to get community opinions on :

1) interest in this? Should I create a jira issue and attach what I have got
2) keyword attribute should be respected?
3) package name analysis.misc versus analyis.tr
4) name of the class TruncateTokenFilter versus FixedPrefixStemFilter

Thanks,
Ahmet

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org