You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jan Høydahl / Cominvent <ja...@cominvent.com> on 2010/06/18 16:23:00 UTC

MappingCharFilterFactory equivalent for use after tokenizer?

Hi,

Is there a token filter which do the same job as MappingCharFilterFactory but after tokenizer, reading the same config file?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com


Re: MappingCharFilterFactory equivalent for use after tokenizer?

Posted by Robert Muir <rc...@gmail.com>.
On Fri, Jun 18, 2010 at 7:11 PM, Lance Norskog <go...@gmail.com> wrote:

> Indeed. Also, it should be possible to output multiple synonyms based
> on the mapping: word_with_umlaut should be become word_with_u and
> word_with_ue as synonyms. (Ok, maybe this example is wrong, but it
> illustrates the idea.)
>
>
I don't think we should do this. how many tokens would üüüüüüüüüüüü make?
(such malformed input exists in the wild, e.g. someone spills beer on their
keyboard and they key gets sticky)

-- 
Robert Muir
rcmuir@gmail.com

Re: MappingCharFilterFactory equivalent for use after tokenizer?

Posted by Lance Norskog <go...@gmail.com>.
Indeed. Also, it should be possible to output multiple synonyms based
on the mapping: word_with_umlaut should be become word_with_u and
word_with_ue as synonyms. (Ok, maybe this example is wrong, but it
illustrates the idea.)

On Fri, Jun 18, 2010 at 12:17 PM, Jan Høydahl / Cominvent
<ja...@cominvent.com> wrote:
> It would be nice to have, because sometimes you want to normalize accents and other characters but want to wait until other filters have run. Especially if those filters are dictionary based and therefore need the original word form.
>
> Do you have a clue of how different a CharFilter is from a normal token Filter - perhaps it is a quick port?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 18. juni 2010, at 18.38, Ahmet Arslan wrote:
>
>>> Is there a token filter which do the same job as
>>> MappingCharFilterFactory but after tokenizer, reading the
>>> same config file?
>>
>> No, closest thing can be PatternReplaceFilterFactory.
>>
>> http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceFilterFactory.html
>>
>>
>>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: MappingCharFilterFactory equivalent for use after tokenizer?

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
It would be nice to have, because sometimes you want to normalize accents and other characters but want to wait until other filters have run. Especially if those filters are dictionary based and therefore need the original word form.

Do you have a clue of how different a CharFilter is from a normal token Filter - perhaps it is a quick port?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 18. juni 2010, at 18.38, Ahmet Arslan wrote:

>> Is there a token filter which do the same job as
>> MappingCharFilterFactory but after tokenizer, reading the
>> same config file?
> 
> No, closest thing can be PatternReplaceFilterFactory.
> 
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceFilterFactory.html
> 
> 
> 


Re: MappingCharFilterFactory equivalent for use after tokenizer?

Posted by Ahmet Arslan <io...@yahoo.com>.
> Is there a token filter which do the same job as
> MappingCharFilterFactory but after tokenizer, reading the
> same config file?

No, closest thing can be PatternReplaceFilterFactory.

http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceFilterFactory.html