You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Eyal Naamati <Ey...@exlibrisgroup.com> on 2017/12/18 16:49:57 UTC

ICUTransformFilter with traditional to simplified Chinese

Hi All,
We are using the ICUTransformFilter to normalize traditional Chinese text to simplified Chinese.
We received feedback from some of our Chinese customers that there are some traditional characters that are not converted to their simplified variants.
For example:
"�w" should be converted to "��"
"�x" should be converted to "��"
"��" should be converted to "��"

Does anyone know if this is indeed a problem with the filter?
Or if there are other options to use instead of this filter that handle more characters?

Thanks for any feedback
Eyal

RE: ICUTransformFilter with traditional to simplified Chinese

Posted by Eyal Naamati <Ey...@exlibrisgroup.com>.

Thanks!
 I actually did ready the Stanford posts when we implemented our index, it was very helpful!

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org] 
Sent: Tuesday, December 19, 2017 1:31 AM
To: solr-user@lucene.apache.org
Subject: Re: ICUTransformFilter with traditional to simplified Chinese

On 12/18/2017 9:49 AM, Eyal Naamati wrote:
> We are using the ICUTransformFilter to normalize traditional Chinese text to simplified Chinese.
> We received feedback from some of our Chinese customers that there are some traditional characters that are not converted to their simplified variants.
> For example:
> "眞" should be converted to "真"
> "硏" should be converted to "研"
> "夲" should be converted to "本"
>
> Does anyone know if this is indeed a problem with the filter?
> Or if there are other options to use instead of this filter that handle more characters?

I have one index for a website we built for a customer in Japan.  While researching how to effectively handle CJK characters, I came across an entire series of blog posts.  Here's the first post, you can check other posts on the same blog for most posts on the same subject.  There are a lot of them:

https://urldefense.proofpoint.com/v2/url?u=http-3A__discovery-2Dgrindstone.blogspot.com_2013_10_cjk-2Dwith-2Dsolr-2Dfor-2Dlibraries-2Dpart-2D1.html&d=DwIDaQ&c=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4&r=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE&m=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw&s=ZsqkNmNtZFgRxog-CW6KYJ28NtGoZq91tuixLQ8lJIw&e=

One of the filters that Stanford utilized (and we also implemented) is a custom filter that they wrote, apparently specifically because there are things that the ICU filters included with Lucene do not catch.

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sul-2Ddlss_CJKFoldingFilter&d=DwIDaQ&c=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4&r=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE&m=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw&s=3-FHJky_wxpuxfDuVVbukGBeYtL43_G49vBH7xaTStY&e=

Looking into the code for the custom filter and checking into your first example, this filter actually seems to go in the reverse direction -- it converts 真 to 眞.  I did not look into the other examples, and I'm completely clueless about CJK characters, so I don't know what those characters are or what the correct action would be.

That third-party custom filter would probably be helpful to you.  Even though it goes in the reverse direction for your first example, as long as the behavior at index time and query time is the same, you should still get matches.  End users would most likely never see the results of the analysis.

Whether or not the behavior you've noticed is a bug with ICUTransformFilter is a question that I cannot answer.  If it is, then the bug will be in ICU, not Lucene.

https://urldefense.proofpoint.com/v2/url?u=http-3A__lucene.apache.org_core_7-5F1-5F0_analyzers-2Dicu_org_apache_lucene_analysis_icu_ICUTransformFilter.html&d=DwIDaQ&c=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4&r=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE&m=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw&s=XoPsu6iF8r_aEHXuep-m3vILU8vIfilW0uv82ZRQtUA&e=

Thanks,
Shawn

Re: ICUTransformFilter with traditional to simplified Chinese

Posted by Shawn Heisey <ap...@elyograg.org>.

On 12/18/2017 9:49 AM, Eyal Naamati wrote:
> We are using the ICUTransformFilter to normalize traditional Chinese text to simplified Chinese.
> We received feedback from some of our Chinese customers that there are some traditional characters that are not converted to their simplified variants.
> For example:
> "眞" should be converted to "真"
> "硏" should be converted to "研"
> "夲" should be converted to "本"
>
> Does anyone know if this is indeed a problem with the filter?
> Or if there are other options to use instead of this filter that handle more characters?

I have one index for a website we built for a customer in Japan.  While
researching how to effectively handle CJK characters, I came across an
entire series of blog posts.  Here's the first post, you can check other
posts on the same blog for most posts on the same subject.  There are a
lot of them:

http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html

One of the filters that Stanford utilized (and we also implemented) is a
custom filter that they wrote, apparently specifically because there are
things that the ICU filters included with Lucene do not catch.

https://github.com/sul-dlss/CJKFoldingFilter

Looking into the code for the custom filter and checking into your first
example, this filter actually seems to go in the reverse direction -- it
converts 真 to 眞.  I did not look into the other examples, and I'm
completely clueless about CJK characters, so I don't know what those
characters are or what the correct action would be.

That third-party custom filter would probably be helpful to you.  Even
though it goes in the reverse direction for your first example, as long
as the behavior at index time and query time is the same, you should
still get matches.  End users would most likely never see the results of
the analysis.

Whether or not the behavior you've noticed is a bug with
ICUTransformFilter is a question that I cannot answer.  If it is, then
the bug will be in ICU, not Lucene.

http://lucene.apache.org/core/7_1_0/analyzers-icu/org/apache/lucene/analysis/icu/ICUTransformFilter.html

Thanks,
Shawn