You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Otis Gospodnetic (JIRA)" <ji...@apache.org> on 2009/04/24 18:50:30 UTC

[jira] Commented: (SOLR-822) CharFilter - normalize characters before tokenizer

    [ https://issues.apache.org/jira/browse/SOLR-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702434#action_12702434 ] 

Otis Gospodnetic commented on SOLR-822:
---------------------------------------

Todd's comment from Oct 23, 2008 caught my attention:

{quote}
It should also work for existing filters like LowerCase. Seems like it has the potential to be faster then the filters, as it doesn't have to perform the same replacement multiple times if a particular character is replicated into multiple tokens, like in NGramTokenizer or CJKTokenizer. 
{quote}

Couldn't we replace LowerCaseFilter then?  Or does LCF still have some unique value?  Ah, it does - it makes it possible to put it *after* something like WordDelimiterFilterFactory.  Lowercasing at the very beginning would make it impossible for WDFF to do its job.  Never mind.  Leaving for posterity.

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: SOLR-822
>                 URL: https://issues.apache.org/jira/browse/SOLR-822
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 1.3
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: character-normalization.JPG, sample_mapping_ja.txt, sample_mapping_ja.txt, SOLR-822-for-1.3.patch, SOLR-822-renameMethod.patch, SOLR-822.patch, SOLR-822.patch, SOLR-822.patch, SOLR-822.patch, SOLR-822.patch
>
>
> A new plugin which can be placed in front of <tokenizer/>.
> {code:xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping_ja.txt" />
>     <tokenizer class="solr.MappingCJKTokenizerFactory"/>
>     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
> </fieldType>
> {code}
> <charFilter/> can be multiple (chained). I'll post a JPEG file to show character normalization sample soon.
> MOTIVATION:
> In Japan, there are two types of tokenizers -- N-gram (CJKTokenizer) and Morphological Analyzer.
> When we use morphological analyzer, because the analyzer uses Japanese dictionary to detect terms,
> we need to normalize characters before tokenization.
> I'll post a patch soon, too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.