You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Nils Weinander <ni...@gmail.com> on 2011/06/14 11:18:52 UTC

ISOLatin1AccentFilterFactory vs ASCIIFoldingFilterFactory

Hi all, I'm new to the list (but not totally new to Solr).

The documentation states that ISOLatin1AccentFilterFactory is deprecated
in favour of ASCIIFoldingFilterFactory:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory

I see problems with this. If I have understood ASCIIFoldingFilterFactory
correctly it folds both accented characters like 'é' to 'e' and national
characters like 'ö' to 'o'. The former is desirable, the latter very much not
when indexing for example scandinavian languages. Is there a way to
limit which characters are folded?

-- 
____________________________________________________________
Nils Weinander

RE: ISOLatin1AccentFilterFactory vs ASCIIFoldingFilterFactory

Posted by Steven A Rowe <sa...@syr.edu>.

On 6/14/2011 at 7:12 AM, Ahmet Arslan wrote:
> --- On Tue, 6/14/11, Nils Weinander <ni...@gmail.com> wrote:
> > The documentation states that ISOLatin1AccentFilterFactory
> > is deprecated in favour of ASCIIFoldingFilterFactory:
[...]
> > Is there a way to limit which characters are folded?
> 
> With MappingCharFilterFactory you have fully control over which
> characters are folded. You can see the default mappings in
> mapping-ISOLatin1Accent.txt file.
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.MappingCharFilterFactory

There is also mapping-FoldToASCII.txt, which, when used with MappingCharFilterFactory, corresponds to ASCIIFoldingFilterFactory.

Steve

Re: ISOLatin1AccentFilterFactory vs ASCIIFoldingFilterFactory

Posted by Nils Weinander <ni...@gmail.com>.

On Tue, Jun 14, 2011 at 1:11 PM, Ahmet Arslan <io...@yahoo.com> wrote:
>
> With MappingCharFilterFactory you have fully control over which characters are folded. You can see the default mappings in
> mapping-ISOLatin1Accent.txt file.
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.MappingCharFilterFactory

Thanks Ahmet! Exactly what I needed.
____________________________________________________________
Nils Weinander

Re: ISOLatin1AccentFilterFactory vs ASCIIFoldingFilterFactory

Posted by Ahmet Arslan <io...@yahoo.com>.


--- On Tue, 6/14/11, Nils Weinander <ni...@gmail.com> wrote:

> From: Nils Weinander <ni...@gmail.com>
> Subject: ISOLatin1AccentFilterFactory vs ASCIIFoldingFilterFactory
> To: solr-user@lucene.apache.org
> Date: Tuesday, June 14, 2011, 12:18 PM
> Hi all, I'm new to the list (but not
> totally new to Solr).
> 
> The documentation states that ISOLatin1AccentFilterFactory
> is deprecated
> in favour of ASCIIFoldingFilterFactory:
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory
> 
> I see problems with this. If I have understood
> ASCIIFoldingFilterFactory
> correctly it folds both accented characters like 'é' to
> 'e' and national
> characters like 'ö' to 'o'. The former is desirable, the
> latter very much not
> when indexing for example scandinavian languages. Is there
> a way to
> limit which characters are folded?

With MappingCharFilterFactory you have fully control over which characters are folded. You can see the default mappings in 
mapping-ISOLatin1Accent.txt file.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.MappingCharFilterFactory