You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ahmet Arslan (JIRA)" <ji...@apache.org> on 2014/03/28 18:20:16 UTC

[jira] [Comment Edited] (LUCENE-5558) Add TruncateTokenFilter

    [ https://issues.apache.org/jira/browse/LUCENE-5558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951064#comment-13951064 ] 

Ahmet Arslan edited comment on LUCENE-5558 at 3/28/14 5:19 PM:
---------------------------------------------------------------

move to miscellaneous package. Same as elastic search's [truncate|https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.java] 


was (Author: iorixxx):
move to miscellaneous package. Same as elastic search's [truncate|https://github.com/elasticsearch/elasticsearch/blob/master/modules/elasticsearch/src/main/java/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.java] 

> Add TruncateTokenFilter
> -----------------------
>
>                 Key: LUCENE-5558
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5558
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.7
>            Reporter: Ahmet Arslan
>            Priority: Minor
>              Labels: Turkish, f5
>             Fix For: 4.8
>
>         Attachments: LUCENE-5558.patch, LUCENE-5558.patch, LUCENE-5558.patch
>
>
> I am using this filter as a stemmer for Turkish language. In many academic research (classification, retrieval) it is used and called as Fixed Prefix Stemmer or Simple Truncation Method or F5 in short.
> Among F3 TO F7, F5 stemmer (length=5) is found to work well for Turkish language in [Information Retrieval on Turkish Texts|http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf]. It is the same work where most of stopwords_tr.txt are acquired. 
> ElasticSearch has [truncate|http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-truncate-tokenfilter.html] filter but it does not respect keyword attribute. And it has a use case similar to TruncateFieldUpdateProcessorFactory
> Main advantage of F5 stemming is : it does not effected by the meaning loss caused by ascii folding. It is a diacritics-insensitive stemmer and works well with ascii folding. [Effects of diacritics on Turkish information retrieval|http://journals.tubitak.gov.tr/elektrik/issues/elk-12-20-5/elk-20-5-9-1010-819.pdf]
> Here is the full field type I use for "diacritics-insensitive search" for Turkish
> {code:xml}
>  <fieldType name="text_tr_ascii_f5" class="solr.TextField" positionIncrementGap="100">
>    <analyzer>
>      <tokenizer class="solr.StandardTokenizerFactory"/>
>      <filter class="solr.ApostropheFilterFactory"/>
>      <filter class="solr.TurkishLowerCaseFilterFactory"/>
>      <filter class="solr.ASCIIFoldingFilterFactory"/>
>      <filter class="solr.KeywordRepeatFilterFactory"/>
>      <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
>      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>    </analyzer>
> {code}
> I  would like to get community opinions :
> 1) Any interest in this? 
> 2) keyword attribute should be respected? 
> 3) package name analysis.misc versus analyis.tr 
> 4) name of the class TruncateTokenFilter versus FixedPrefixStemFilter



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org