You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Eyal.Naamati@exlibrisgroup.com" <Ey...@exlibrisgroup.com> on 2016/10/30 12:40:30 UTC

Problem with Han character in ICUFoldingFilter

Hi,

I was wondering if anyone ran into the following issue, or a similar one:
In Han script there are two separate characters - 宅 (FA04) and 宅 (5B85).
It seems that ICUFoldingFilter converts FA04 to 5B85, which results in the wrong character being indexed.
Does anyone have any idea if and how this can be resolved? Is there an option to add an exception rule to ICUFoldingFilter?
Thanks,
Eyal

Re: Problem with Han character in ICUFoldingFilter

Posted by Steve Rowe <sa...@gmail.com>.

Among several other foldings, ICUFoldingFilter performs the Unicode NFC transform, which consists of canonical decomposition (NFD) followed by canonical composition.  NFD transforms U+FA04 to U+5B85, and canonical composition leaves U+5B85 as-is.

U+FA04 is in the “Pronunciation variants from KS X 1001:1998" sub-block - KS X 1001 is a Korean encoding standard - in the "CJK Compatibility Ideographs" block <http://www.unicode.org/charts/PDF/UF900.pdf>.  I don’t know why these variants were included in Unicode, but the NFD transform includes the compatibility->canonical tranform, so it’s likely many other compatibility characters in your data will be affected, not just this one.  If the compatibility->canonical tranform is problematic, why are you using ICUFoldingFilter?

If you like some of the foldings included in ICUFoldingFilter but not others, check out the “gennorm2” and “gen-utr30-data-files” targets in the Lucene/Solr source code at lucene/analysis/icu/build.xml - you could build and use a modified binary tranform data file - this file is distributed as part of the lucene-analyzers-icu jar at org/apache/lucene/analysis/icu/utr30.nrm.

--
Steve
www.lucidworks.com

> On Oct 30, 2016, at 10:29 AM, Ahmet Arslan <io...@yahoo.com.INVALID> wrote:
> 
> Hi Eyal,
> 
> ICUFoldingFilter uses http://site.icu-project.org under the hood.
> If you think there is a bug, it is better to ask its mailing list.
> 
> Ahmet
> 
> 
> 
> On Sunday, October 30, 2016 3:41 PM, "Eyal.Naamati@exlibrisgroup.com" <Ey...@exlibrisgroup.com> wrote:
> Hi,
> 
> I was wondering if anyone ran into the following issue, or a similar one:
> In Han script there are two separate characters - 宅 (FA04) and 宅 (5B85).
> It seems that ICUFoldingFilter converts FA04 to 5B85, which results in the wrong character being indexed.
> Does anyone have any idea if and how this can be resolved? Is there an option to add an exception rule to ICUFoldingFilter?
> Thanks,
> Eyal

Re: Problem with Han character in ICUFoldingFilter

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Eyal,

ICUFoldingFilter uses http://site.icu-project.org under the hood.
If you think there is a bug, it is better to ask its mailing list.

Ahmet



On Sunday, October 30, 2016 3:41 PM, "Eyal.Naamati@exlibrisgroup.com" <Ey...@exlibrisgroup.com> wrote:
Hi,

I was wondering if anyone ran into the following issue, or a similar one:
In Han script there are two separate characters - 宅 (FA04) and 宅 (5B85).
It seems that ICUFoldingFilter converts FA04 to 5B85, which results in the wrong character being indexed.
Does anyone have any idea if and how this can be resolved? Is there an option to add an exception rule to ICUFoldingFilter?
Thanks,
Eyal