You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shalin Shekhar Mangar <sh...@gmail.com> on 2010/02/21 22:57:01 UTC

Why ASCIIFoldingFilter is not a CharFilter

Hello,

Looking over the CharFilter franchise, it seems to me that the
ASCIIFoldingFilter is a perfect candidate for being a CharFilter as it
performs character level substitutions like MappingCharFilter. However it is
not a CharFilter. Is there a reason why?

-- 
Regards,
Shalin Shekhar Mangar.

Re: Why ASCIIFoldingFilter is not a CharFilter

Posted by Robert Muir <rc...@gmail.com>.
Shalin, yeah. i guess in my opinion, the diacritics handling in conjunction
with a stemmer is unfortunately not very easy to do, without getting wierd
results.

for example, the snowball stemmers usually expect these diacritics to be
there, they are looking for something closer to the proper "dictionary form"
(they have different rules for accented letters versus non-accented forms).

but in practice, i think people make shortcuts, which is why people have
measured pretty significant improvements in retrieval effectiveness (> 15%)
simply by removing diacritics for some languages:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.4626&rep=rep1&type=pdf

what frustrates me, is there is no way to do this without confusing the
snowball stemmers. I guess theres always the option to try to 'restore'
missing diacritics instead, so stemming works correctly, but this is
complicated and sometimes I wish the snowball stemmers were just less
sensitive instead... sorry to just be complaining out loud :)

On Mon, Feb 22, 2010 at 7:58 AM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> I wasn't suggesting that they should be changed but trying to understand
> why. This makes sense. Thanks Erik and Robert.
>
> On Mon, Feb 22, 2010 at 6:16 AM, Robert Muir <rc...@gmail.com> wrote:
>
> > right, most stemmers expect the diacritics to be in their input to work
> > correctly, too.
> >
> > On Sun, Feb 21, 2010 at 5:19 PM, Erik Hatcher <erik.hatcher@gmail.com
> > >wrote:
> >
> > > won't some stemmers leave diacritics in the terms that ought to be
> > removed
> > > before indexing?
> > >
> > >
> > >
> > > On Feb 21, 2010, at 4:57 PM, Shalin Shekhar Mangar wrote:
> > >
> > >  Hello,
> > >>
> > >> Looking over the CharFilter franchise, it seems to me that the
> > >> ASCIIFoldingFilter is a perfect candidate for being a CharFilter as it
> > >> performs character level substitutions like MappingCharFilter. However
> > it
> > >> is
> > >> not a CharFilter. Is there a reason why?
> > >>
> > >> --
> > >> Regards,
> > >> Shalin Shekhar Mangar.
> > >>
> > >
> > >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
Robert Muir
rcmuir@gmail.com

Re: Why ASCIIFoldingFilter is not a CharFilter

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
I wasn't suggesting that they should be changed but trying to understand
why. This makes sense. Thanks Erik and Robert.

On Mon, Feb 22, 2010 at 6:16 AM, Robert Muir <rc...@gmail.com> wrote:

> right, most stemmers expect the diacritics to be in their input to work
> correctly, too.
>
> On Sun, Feb 21, 2010 at 5:19 PM, Erik Hatcher <erik.hatcher@gmail.com
> >wrote:
>
> > won't some stemmers leave diacritics in the terms that ought to be
> removed
> > before indexing?
> >
> >
> >
> > On Feb 21, 2010, at 4:57 PM, Shalin Shekhar Mangar wrote:
> >
> >  Hello,
> >>
> >> Looking over the CharFilter franchise, it seems to me that the
> >> ASCIIFoldingFilter is a perfect candidate for being a CharFilter as it
> >> performs character level substitutions like MappingCharFilter. However
> it
> >> is
> >> not a CharFilter. Is there a reason why?
> >>
> >> --
> >> Regards,
> >> Shalin Shekhar Mangar.
> >>
> >
> >
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Why ASCIIFoldingFilter is not a CharFilter

Posted by Robert Muir <rc...@gmail.com>.
right, most stemmers expect the diacritics to be in their input to work
correctly, too.

On Sun, Feb 21, 2010 at 5:19 PM, Erik Hatcher <er...@gmail.com>wrote:

> won't some stemmers leave diacritics in the terms that ought to be removed
> before indexing?
>
>
>
> On Feb 21, 2010, at 4:57 PM, Shalin Shekhar Mangar wrote:
>
>  Hello,
>>
>> Looking over the CharFilter franchise, it seems to me that the
>> ASCIIFoldingFilter is a perfect candidate for being a CharFilter as it
>> performs character level substitutions like MappingCharFilter. However it
>> is
>> not a CharFilter. Is there a reason why?
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: Why ASCIIFoldingFilter is not a CharFilter

Posted by Erik Hatcher <er...@gmail.com>.
won't some stemmers leave diacritics in the terms that ought to be  
removed before indexing?


On Feb 21, 2010, at 4:57 PM, Shalin Shekhar Mangar wrote:

> Hello,
>
> Looking over the CharFilter franchise, it seems to me that the
> ASCIIFoldingFilter is a perfect candidate for being a CharFilter as it
> performs character level substitutions like MappingCharFilter.  
> However it is
> not a CharFilter. Is there a reason why?
>
> -- 
> Regards,
> Shalin Shekhar Mangar.