You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Amanda Shuman <am...@gmail.com> on 2018/07/20 07:54:01 UTC

Question regarding searching Chinese characters

Hi all,

We have a problem. Some of our historical documents have mixed together
simplified and Chinese characters. There seems to be no problem when
searching either traditional or simplified separately - that is, if a
particular string/phrase is all in traditional or simplified, it finds it -
but it does not find the string/phrase if the two different characters (one
traditional, one simplified) are mixed together in the SAME string/phrase.

Has anyone ever handled this problem before? I know some libraries seem to
have implemented something that seems to be able to handle this, but I'm
not sure how they did so!

Amanda
------
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project
<http://www.maoistlegacy.uni-freiburg.de/>
PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925

Re: Question regarding searching Chinese characters

Posted by Tomoko Uchida <to...@gmail.com>.
Hi,

There is ICUTransformFilter (that included Solr distribution) which also
should be work for you.
See the example settings:
https://lucene.apache.org/solr/guide/7_4/filter-descriptions.html#icu-transform-filter

Combine it with HMMChineseTokenizer.
https://lucene.apache.org/solr/guide/7_4/language-analysis.html#hmm-chinese-tokenizer

In other words, replace your SmartChineseAnalyzer settings by
HMMChineseTokenizer
& ICUTransformFilter pipeline.

----
Here is a bit complicated explanation, so you can skip if you do not want
to go into analyzer details.

I do not understand Chinese, but seems there are no easy or one-stop
solutions in my view. (As Japanese, we have similar problems with Chinese.)

HMMChineseTokenizer expects Simplified Chinese text.
See:
https://lucene.apache.org/core/7_4_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizer.html

So you should transform all traditional Chinese characters **before**
applying HMMChineseTokenizer by CharFilters, otherwise the Tokenizer do not
correctly work.

Unfortunately, there is no such CharFilters as far as I know.
ICUNormalizer2CharFilter do not handle such transformation so it is no
help. CJKFoldingFilter and  ICUTransformFilter do the
traditional-simplified transformation, however, they are TokenFilters that
works after applying a Tokenizer.

I think you need two steps if you want to use HMMChineseTokenizer correctly.

1. transform all traditional characters to simplified ones and save to
temporary files.
    I do not have clear idea for doing this, but you can create a Java
program that calls Lucene's ICUTransformFilter
2. then, index to Solr using SmartChineseAnalyzer.

Regards,
Tomoko

2018年7月20日(金) 22:12 Susheel Kumar <su...@gmail.com>:

> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
> each of A, B or C or D in query and they seems to be matching and CJKFF is
> transforming the 舊 to 旧
>
> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <su...@gmail.com>
> wrote:
>
> > Lack of my chinese language knowledge but if you want, I can do quick
> test
> > for you in Analysis tab if you can give me what to put in index and query
> > window...
> >
> > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <su...@gmail.com>
> > wrote:
> >
> >> Have you tried to use CJKFoldingFilter https://g
> >> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would cover
> >> your use case but I am using this filter and so far no issues.
> >>
> >> Thnx
> >>
> >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <amanda.shuman@gmail.com
> >
> >> wrote:
> >>
> >>> Thanks, Alex - I have seen a few of those links but never considered
> >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> >>> basically what is laid out in the old blogspot post, namely this point:
> >>>
> >>>
> >>> "Why approach CJK resource discovery differently?
> >>>
> >>> 2.  Search results must be as script agnostic as possible.
> >>>
> >>> There is more than one way to write each word. "Simplified" characters
> >>> were
> >>> emphasized for printed materials in mainland China starting in the
> 1950s;
> >>> "Traditional" characters were used in printed materials prior to the
> >>> 1950s,
> >>> and are still used in Taiwan, Hong Kong and Macau today.
> >>> Since the characters are distinct, it's as if Chinese materials are
> >>> written
> >>> in two scripts.
> >>> Another way to think about it:  every written Chinese word has at least
> >>> two
> >>> completely different spellings.  And it can be mix-n-match:  a word can
> >>> be
> >>> written with one traditional  and one simplified character.
> >>> Example:   Given a user query 舊小說  (traditional for old fiction), the
> >>> results should include matches for 舊小說 (traditional) and 旧小说
> (simplified
> >>> characters for old fiction)"
> >>>
> >>> So, using the example provided above, we are dealing with materials
> >>> produced in the 1950s-1970s that do even weirder things like:
> >>>
> >>> A. 舊小說
> >>>
> >>> can also be
> >>>
> >>> B. 旧小说 (all simplified)
> >>> or
> >>> C. 旧小說 (first character simplified, last character traditional)
> >>> or
> >>> D. 舊小 说 (first character traditional, last character simplified)
> >>>
> >>> Thankfully the middle character was never simplified in recent times.
> >>>
> >>> From a historical standpoint, the mixed nature of the characters in the
> >>> same word/phrase is because not all simplified characters were adopted
> at
> >>> the same time by everyone uniformly (good times...).
> >>>
> >>> The problem seems to be that Solr can easily handle A or B above, but
> >>> NOT C
> >>> or D using the Smart Chinese analyzer. I'm not really sure how to
> change
> >>> that at this point... maybe I should figure out how to contact the
> >>> creators
> >>> of the analyzer and ask them?
> >>>
> >>> Amanda
> >>>
> >>> ------
> >>> Dr. Amanda Shuman
> >>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> >>> <http://www.maoistlegacy.uni-freiburg.de/>
> >>> PhD, University of California, Santa Cruz
> >>> http://www.amandashuman.net/
> >>> http://www.prchistoryresources.org/
> >>> Office: +49 (0) 761 203 4925
> >>>
> >>>
> >>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> >>> arafalov@gmail.com>
> >>> wrote:
> >>>
> >>> > This is probably your start, if not read already:
> >>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >>> >
> >>> > Otherwise, I think your answer would be somewhere around using ICU4J,
> >>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
> >>> > (mentioned on the same page above)
> >>> > Specifically, transformations:
> >>> > http://userguide.icu-project.org/transforms/general
> >>> >
> >>> > With that, maybe you map both alphabets into latin. I did that once
> >>> > for Thai for a demo:
> >>> > https://github.com/arafalov/solr-thai-test/blob/master/
> >>> > collection1/conf/schema.xml#L34
> >>> >
> >>> > The challenge is to figure out all the magic rules for that. You'd
> >>> > have to dig through the ICU documentation and other web pages. I
> found
> >>> > this one for example:
> >>> > http://avajava.com/tutorials/lessons/what-are-the-system-
> >>> > transliterators-available-with-icu4j.html;jsessionid=
> >>> > BEAB0AF05A588B97B8A2393054D908C0
> >>> >
> >>> > There is also 12 part series on Solr and Asian text processing,
> though
> >>> > it is a bit old now: http://discovery-grindstone.blogspot.com/
> >>> >
> >>> > Hope one of these things help.
> >>> >
> >>> > Regards,
> >>> >    Alex.
> >>> >
> >>> >
> >>> > On 20 July 2018 at 03:54, Amanda Shuman <am...@gmail.com>
> >>> wrote:
> >>> > > Hi all,
> >>> > >
> >>> > > We have a problem. Some of our historical documents have mixed
> >>> together
> >>> > > simplified and Chinese characters. There seems to be no problem
> when
> >>> > > searching either traditional or simplified separately - that is,
> if a
> >>> > > particular string/phrase is all in traditional or simplified, it
> >>> finds
> >>> > it -
> >>> > > but it does not find the string/phrase if the two different
> >>> characters
> >>> > (one
> >>> > > traditional, one simplified) are mixed together in the SAME
> >>> > string/phrase.
> >>> > >
> >>> > > Has anyone ever handled this problem before? I know some libraries
> >>> seem
> >>> > to
> >>> > > have implemented something that seems to be able to handle this,
> but
> >>> I'm
> >>> > > not sure how they did so!
> >>> > >
> >>> > > Amanda
> >>> > > ------
> >>> > > Dr. Amanda Shuman
> >>> > > Post-doc researcher, University of Freiburg, The Maoist Legacy
> >>> Project
> >>> > > <http://www.maoistlegacy.uni-freiburg.de/>
> >>> > > PhD, University of California, Santa Cruz
> >>> > > http://www.amandashuman.net/
> >>> > > http://www.prchistoryresources.org/
> >>> > > Office: +49 (0) 761 203 4925
> >>> >
> >>>
> >>
> >>
> >
>


-- 
Tomoko Uchida

Re: Question regarding searching Chinese characters

Posted by Tomoko Uchida <to...@gmail.com>.
Hi Amanda,

> do all I need to do is modify the settings from smartChinese to the ones
you posted here

Yes, the settings I posted should work for you, at least partially.
If you are happy with the results, it's OK!
But please take this as a starting point because it's not perfect.

> Or do I need to still do something with the SmartChineseAnalyzer?

Try the settings, then if you notice something strange and want to know why
and how to solve it, that may be the time to dive deep into. ;)

I cannot explain how analyzers works here... but you should start off with
the Solr documentation.
https://lucene.apache.org/solr/guide/7_0/understanding-analyzers-tokenizers-and-filters.html

Regards,
Tomoko



2018年7月24日(火) 21:08 Amanda Shuman <am...@gmail.com>:

> Hi Tomoko,
>
> Thanks so much for this explanation - I did not even know this was
> possible! I will try it out but I have one question: do all I need to do is
> modify the settings from smartChinese to the ones you posted here:
>
> <analyzer>
>   <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
>   <tokenizer class="solr.HMMChineseTokenizerFactory"/>
>   <filter class="solr.ICUTransformFilterFactory"
> id="Traditional-Simplified"/>
> </analyzer>
>
> Or do I need to still do something with the SmartChineseAnalyzer? I did not
> quite understand this in your first message:
>
> " I think you need two steps if you want to use HMMChineseTokenizer
> correctly.
>
> 1. transform all traditional characters to simplified ones and save to
> temporary files.
>     I do not have clear idea for doing this, but you can create a Java
> program that calls Lucene's ICUTransformFilter
> 2. then, index to Solr using SmartChineseAnalyzer."
>
> My understanding is that with the new settings you posted, I don't need to
> do these steps. Is that correct? Otherwise, I don't really know how to do
> step 1 with the java program....
>
> Thanks!
> Amanda
>
>
> ------
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> <http://www.maoistlegacy.uni-freiburg.de/>
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 8:03 PM, Tomoko Uchida <
> tomoko.uchida.1111@gmail.com
> > wrote:
>
> > Yes, while traditional - simplified transformation would be out of the
> > scope of Unicode normalization,
> > you would like to add ICUNormalizer2CharFilterFactory anyway :)
> >
> > Let me refine my example settings:
> >
> > <analyzer>
> >   <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
> >   <tokenizer class="solr.HMMChineseTokenizerFactory"/>
> >   <filter class="solr.ICUTransformFilterFactory"
> > id="Traditional-Simplified"/>
> > </analyzer>
> >
> > Regards,
> > Tomoko
> >
> >
> > 2018年7月21日(土) 2:54 Alexandre Rafalovitch <ar...@gmail.com>:
> >
> > > Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> > > template of what needs to be done.
> > >
> > > Regards,
> > >    Alex.
> > >
> > > On 20 July 2018 at 12:40, Walter Underwood <wu...@wunderwood.org>
> > wrote:
> > > > Looks like we need a charfilter version of the ICU transforms. That
> > > could run before the tokenizer.
> > > >
> > > > I’ve never built a charfilter, but it seems like this would be a good
> > > first project for someone who wants to contribute.
> > > >
> > > > wunder
> > > > Walter Underwood
> > > > wunder@wunderwood.org
> > > > http://observer.wunderwood.org/  (my blog)
> > > >
> > > >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> > > tomoko.uchida.1111@gmail.com> wrote:
> > > >>
> > > >> Exactly. More concretely, the starting point is: replacing your
> > analyzer
> > > >>
> > > >> <analyzer
> > > class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> > > >>
> > > >> to
> > > >>
> > > >> <analyzer>
> > > >>  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
> > > >>  <filter class="solr.ICUTransformFilterFactory"
> > > >> id="Traditional-Simplified"/>
> > > >> </analyzer>
> > > >>
> > > >> and see if the results are as expected. Then research another
> filters
> > if
> > > >> your requirements is not met.
> > > >>
> > > >> Just a reminder: HMMChineseTokenizerFactory do not handle
> traditional
> > > >> characters as I noted previous in post, so ICUTransformFilterFactory
> > is
> > > an
> > > >> incomplete workaround.
> > > >>
> > > >> 2018年7月21日(土) 0:05 Walter Underwood <wu...@wunderwood.org>:
> > > >>
> > > >>> I expect that this is the line that does the transformation:
> > > >>>
> > > >>>   <filter class="solr.ICUTransformFilterFactory"
> > > >>> id="Traditional-Simplified"/>
> > > >>>
> > > >>> This mapping is a standard feature of ICU. More info on ICU
> > transforms
> > > is
> > > >>> in this doc, though not much detail on this particular transform.
> > > >>>
> > > >>> http://userguide.icu-project.org/transforms/general
> > > >>>
> > > >>> wunder
> > > >>> Walter Underwood
> > > >>> wunder@wunderwood.org
> > > >>> http://observer.wunderwood.org/  (my blog)
> > > >>>
> > > >>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <susheel2777@gmail.com
> >
> > > >>> wrote:
> > > >>>>
> > > >>>> I think so.  I used the exact as in github
> > > >>>>
> > > >>>> <fieldType name="text_cjk" class="solr.TextField"
> > > >>>> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
> > > >>>> <analyzer>
> > > >>>>   <tokenizer class="solr.ICUTokenizerFactory" />
> > > >>>>   <filter class="solr.CJKWidthFilterFactory"/>
> > > >>>>   <filter
> > > class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> > > >>>>   <filter class="solr.ICUTransformFilterFactory"
> > > >>> id="Traditional-Simplified"/>
> > > >>>>   <filter class="solr.ICUTransformFilterFactory"
> > > >>> id="Katakana-Hiragana"/>
> > > >>>>   <filter class="solr.ICUFoldingFilterFactory"/>
> > > >>>>   <filter class="solr.CJKBigramFilterFactory" han="true"
> > > >>>> hiragana="true" katakana="true" hangul="true"
> outputUnigrams="true"
> > />
> > > >>>> </analyzer>
> > > >>>> </fieldType>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <
> > > amanda.shuman@gmail.com
> > > >>>>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Thanks! That does indeed look promising... This can be added on
> top
> > > of
> > > >>>>> Smart Chinese, right? Or is it an alternative?
> > > >>>>>
> > > >>>>>
> > > >>>>> ------
> > > >>>>> Dr. Amanda Shuman
> > > >>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> > > Project
> > > >>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> > > >>>>> PhD, University of California, Santa Cruz
> > > >>>>> http://www.amandashuman.net/
> > > >>>>> http://www.prchistoryresources.org/
> > > >>>>> Office: +49 (0) 761 203 4925
> > > >>>>>
> > > >>>>>
> > > >>>>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <
> > > susheel2777@gmail.com>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index
> > and
> > > >>> then
> > > >>>>>> each of A, B or C or D in query and they seems to be matching
> and
> > > CJKFF
> > > >>>>> is
> > > >>>>>> transforming the 舊 to 旧
> > > >>>>>>
> > > >>>>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <
> > > susheel2777@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Lack of my chinese language knowledge but if you want, I can do
> > > quick
> > > >>>>>> test
> > > >>>>>>> for you in Analysis tab if you can give me what to put in index
> > and
> > > >>>>> query
> > > >>>>>>> window...
> > > >>>>>>>
> > > >>>>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <
> > > susheel2777@gmail.com
> > > >>>>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Have you tried to use CJKFoldingFilter https://g
> > > >>>>>>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this
> > would
> > > >>>>> cover
> > > >>>>>>>> your use case but I am using this filter and so far no issues.
> > > >>>>>>>>
> > > >>>>>>>> Thnx
> > > >>>>>>>>
> > > >>>>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> > > >>>>> amanda.shuman@gmail.com
> > > >>>>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Thanks, Alex - I have seen a few of those links but never
> > > considered
> > > >>>>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The
> > > issue
> > > >>> is
> > > >>>>>>>>> basically what is laid out in the old blogspot post, namely
> > this
> > > >>>>> point:
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> "Why approach CJK resource discovery differently?
> > > >>>>>>>>>
> > > >>>>>>>>> 2.  Search results must be as script agnostic as possible.
> > > >>>>>>>>>
> > > >>>>>>>>> There is more than one way to write each word. "Simplified"
> > > >>>>> characters
> > > >>>>>>>>> were
> > > >>>>>>>>> emphasized for printed materials in mainland China starting
> in
> > > the
> > > >>>>>> 1950s;
> > > >>>>>>>>> "Traditional" characters were used in printed materials prior
> > to
> > > the
> > > >>>>>>>>> 1950s,
> > > >>>>>>>>> and are still used in Taiwan, Hong Kong and Macau today.
> > > >>>>>>>>> Since the characters are distinct, it's as if Chinese
> materials
> > > are
> > > >>>>>>>>> written
> > > >>>>>>>>> in two scripts.
> > > >>>>>>>>> Another way to think about it:  every written Chinese word
> has
> > at
> > > >>>>> least
> > > >>>>>>>>> two
> > > >>>>>>>>> completely different spellings.  And it can be mix-n-match:
> a
> > > word
> > > >>>>> can
> > > >>>>>>>>> be
> > > >>>>>>>>> written with one traditional  and one simplified character.
> > > >>>>>>>>> Example:   Given a user query 舊小說  (traditional for old
> > fiction),
> > > >>> the
> > > >>>>>>>>> results should include matches for 舊小說 (traditional) and 旧小说
> > > >>>>>> (simplified
> > > >>>>>>>>> characters for old fiction)"
> > > >>>>>>>>>
> > > >>>>>>>>> So, using the example provided above, we are dealing with
> > > materials
> > > >>>>>>>>> produced in the 1950s-1970s that do even weirder things like:
> > > >>>>>>>>>
> > > >>>>>>>>> A. 舊小說
> > > >>>>>>>>>
> > > >>>>>>>>> can also be
> > > >>>>>>>>>
> > > >>>>>>>>> B. 旧小说 (all simplified)
> > > >>>>>>>>> or
> > > >>>>>>>>> C. 旧小說 (first character simplified, last character
> traditional)
> > > >>>>>>>>> or
> > > >>>>>>>>> D. 舊小 说 (first character traditional, last character
> > simplified)
> > > >>>>>>>>>
> > > >>>>>>>>> Thankfully the middle character was never simplified in
> recent
> > > >>> times.
> > > >>>>>>>>>
> > > >>>>>>>>> From a historical standpoint, the mixed nature of the
> > characters
> > > in
> > > >>>>> the
> > > >>>>>>>>> same word/phrase is because not all simplified characters
> were
> > > >>>>> adopted
> > > >>>>>> at
> > > >>>>>>>>> the same time by everyone uniformly (good times...).
> > > >>>>>>>>>
> > > >>>>>>>>> The problem seems to be that Solr can easily handle A or B
> > above,
> > > >>> but
> > > >>>>>>>>> NOT C
> > > >>>>>>>>> or D using the Smart Chinese analyzer. I'm not really sure
> how
> > to
> > > >>>>>> change
> > > >>>>>>>>> that at this point... maybe I should figure out how to
> contact
> > > the
> > > >>>>>>>>> creators
> > > >>>>>>>>> of the analyzer and ask them?
> > > >>>>>>>>>
> > > >>>>>>>>> Amanda
> > > >>>>>>>>>
> > > >>>>>>>>> ------
> > > >>>>>>>>> Dr. Amanda Shuman
> > > >>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist
> Legacy
> > > >>>>> Project
> > > >>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> > > >>>>>>>>> PhD, University of California, Santa Cruz
> > > >>>>>>>>> http://www.amandashuman.net/
> > > >>>>>>>>> http://www.prchistoryresources.org/
> > > >>>>>>>>> Office: +49 (0) 761 203 4925
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> > > >>>>>>>>> arafalov@gmail.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> This is probably your start, if not read already:
> > > >>>>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-
> > analysis.html
> > > >>>>>>>>>>
> > > >>>>>>>>>> Otherwise, I think your answer would be somewhere around
> using
> > > >>>>> ICU4J,
> > > >>>>>>>>>> IBM's library for dealing with Unicode:
> > > >>>>> http://site.icu-project.org/
> > > >>>>>>>>>> (mentioned on the same page above)
> > > >>>>>>>>>> Specifically, transformations:
> > > >>>>>>>>>> http://userguide.icu-project.org/transforms/general
> > > >>>>>>>>>>
> > > >>>>>>>>>> With that, maybe you map both alphabets into latin. I did
> that
> > > once
> > > >>>>>>>>>> for Thai for a demo:
> > > >>>>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/
> > > >>>>>>>>>> collection1/conf/schema.xml#L34
> > > >>>>>>>>>>
> > > >>>>>>>>>> The challenge is to figure out all the magic rules for that.
> > > You'd
> > > >>>>>>>>>> have to dig through the ICU documentation and other web
> > pages. I
> > > >>>>>> found
> > > >>>>>>>>>> this one for example:
> > > >>>>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system-
> > > >>>>>>>>>> transliterators-available-with-icu4j.html;jsessionid=
> > > >>>>>>>>>> BEAB0AF05A588B97B8A2393054D908C0
> > > >>>>>>>>>>
> > > >>>>>>>>>> There is also 12 part series on Solr and Asian text
> > processing,
> > > >>>>>> though
> > > >>>>>>>>>> it is a bit old now: http://discovery-grindstone.
> > blogspot.com/
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hope one of these things help.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Regards,
> > > >>>>>>>>>>  Alex.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <
> > > amanda.shuman@gmail.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> We have a problem. Some of our historical documents have
> > mixed
> > > >>>>>>>>> together
> > > >>>>>>>>>>> simplified and Chinese characters. There seems to be no
> > problem
> > > >>>>>> when
> > > >>>>>>>>>>> searching either traditional or simplified separately -
> that
> > > is,
> > > >>>>>> if a
> > > >>>>>>>>>>> particular string/phrase is all in traditional or
> simplified,
> > > it
> > > >>>>>>>>> finds
> > > >>>>>>>>>> it -
> > > >>>>>>>>>>> but it does not find the string/phrase if the two different
> > > >>>>>>>>> characters
> > > >>>>>>>>>> (one
> > > >>>>>>>>>>> traditional, one simplified) are mixed together in the SAME
> > > >>>>>>>>>> string/phrase.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Has anyone ever handled this problem before? I know some
> > > >>>>> libraries
> > > >>>>>>>>> seem
> > > >>>>>>>>>> to
> > > >>>>>>>>>>> have implemented something that seems to be able to handle
> > > this,
> > > >>>>>> but
> > > >>>>>>>>> I'm
> > > >>>>>>>>>>> not sure how they did so!
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Amanda
> > > >>>>>>>>>>> ------
> > > >>>>>>>>>>> Dr. Amanda Shuman
> > > >>>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist
> > Legacy
> > > >>>>>>>>> Project
> > > >>>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> > > >>>>>>>>>>> PhD, University of California, Santa Cruz
> > > >>>>>>>>>>> http://www.amandashuman.net/
> > > >>>>>>>>>>> http://www.prchistoryresources.org/
> > > >>>>>>>>>>> Office: +49 (0) 761 203 4925
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > > >>
> > > >> --
> > > >> Tomoko Uchida
> > > >
> > >
> >
> >
> > --
> > Tomoko Uchida
> >
>


-- 
Tomoko Uchida

Re: Question regarding searching Chinese characters

Posted by Amanda Shuman <am...@gmail.com>.
Hi Tomoko,

Thanks so much for this explanation - I did not even know this was
possible! I will try it out but I have one question: do all I need to do is
modify the settings from smartChinese to the ones you posted here:

<analyzer>
  <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
  <filter class="solr.ICUTransformFilterFactory"
id="Traditional-Simplified"/>
</analyzer>

Or do I need to still do something with the SmartChineseAnalyzer? I did not
quite understand this in your first message:

" I think you need two steps if you want to use HMMChineseTokenizer
correctly.

1. transform all traditional characters to simplified ones and save to
temporary files.
    I do not have clear idea for doing this, but you can create a Java
program that calls Lucene's ICUTransformFilter
2. then, index to Solr using SmartChineseAnalyzer."

My understanding is that with the new settings you posted, I don't need to
do these steps. Is that correct? Otherwise, I don't really know how to do
step 1 with the java program....

Thanks!
Amanda


------
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project
<http://www.maoistlegacy.uni-freiburg.de/>
PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Fri, Jul 20, 2018 at 8:03 PM, Tomoko Uchida <tomoko.uchida.1111@gmail.com
> wrote:

> Yes, while traditional - simplified transformation would be out of the
> scope of Unicode normalization,
> you would like to add ICUNormalizer2CharFilterFactory anyway :)
>
> Let me refine my example settings:
>
> <analyzer>
>   <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
>   <tokenizer class="solr.HMMChineseTokenizerFactory"/>
>   <filter class="solr.ICUTransformFilterFactory"
> id="Traditional-Simplified"/>
> </analyzer>
>
> Regards,
> Tomoko
>
>
> 2018年7月21日(土) 2:54 Alexandre Rafalovitch <ar...@gmail.com>:
>
> > Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> > template of what needs to be done.
> >
> > Regards,
> >    Alex.
> >
> > On 20 July 2018 at 12:40, Walter Underwood <wu...@wunderwood.org>
> wrote:
> > > Looks like we need a charfilter version of the ICU transforms. That
> > could run before the tokenizer.
> > >
> > > I’ve never built a charfilter, but it seems like this would be a good
> > first project for someone who wants to contribute.
> > >
> > > wunder
> > > Walter Underwood
> > > wunder@wunderwood.org
> > > http://observer.wunderwood.org/  (my blog)
> > >
> > >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> > tomoko.uchida.1111@gmail.com> wrote:
> > >>
> > >> Exactly. More concretely, the starting point is: replacing your
> analyzer
> > >>
> > >> <analyzer
> > class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> > >>
> > >> to
> > >>
> > >> <analyzer>
> > >>  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
> > >>  <filter class="solr.ICUTransformFilterFactory"
> > >> id="Traditional-Simplified"/>
> > >> </analyzer>
> > >>
> > >> and see if the results are as expected. Then research another filters
> if
> > >> your requirements is not met.
> > >>
> > >> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> > >> characters as I noted previous in post, so ICUTransformFilterFactory
> is
> > an
> > >> incomplete workaround.
> > >>
> > >> 2018年7月21日(土) 0:05 Walter Underwood <wu...@wunderwood.org>:
> > >>
> > >>> I expect that this is the line that does the transformation:
> > >>>
> > >>>   <filter class="solr.ICUTransformFilterFactory"
> > >>> id="Traditional-Simplified"/>
> > >>>
> > >>> This mapping is a standard feature of ICU. More info on ICU
> transforms
> > is
> > >>> in this doc, though not much detail on this particular transform.
> > >>>
> > >>> http://userguide.icu-project.org/transforms/general
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wunder@wunderwood.org
> > >>> http://observer.wunderwood.org/  (my blog)
> > >>>
> > >>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <su...@gmail.com>
> > >>> wrote:
> > >>>>
> > >>>> I think so.  I used the exact as in github
> > >>>>
> > >>>> <fieldType name="text_cjk" class="solr.TextField"
> > >>>> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
> > >>>> <analyzer>
> > >>>>   <tokenizer class="solr.ICUTokenizerFactory" />
> > >>>>   <filter class="solr.CJKWidthFilterFactory"/>
> > >>>>   <filter
> > class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> > >>>>   <filter class="solr.ICUTransformFilterFactory"
> > >>> id="Traditional-Simplified"/>
> > >>>>   <filter class="solr.ICUTransformFilterFactory"
> > >>> id="Katakana-Hiragana"/>
> > >>>>   <filter class="solr.ICUFoldingFilterFactory"/>
> > >>>>   <filter class="solr.CJKBigramFilterFactory" han="true"
> > >>>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true"
> />
> > >>>> </analyzer>
> > >>>> </fieldType>
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <
> > amanda.shuman@gmail.com
> > >>>>
> > >>>> wrote:
> > >>>>
> > >>>>> Thanks! That does indeed look promising... This can be added on top
> > of
> > >>>>> Smart Chinese, right? Or is it an alternative?
> > >>>>>
> > >>>>>
> > >>>>> ------
> > >>>>> Dr. Amanda Shuman
> > >>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> > Project
> > >>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> > >>>>> PhD, University of California, Santa Cruz
> > >>>>> http://www.amandashuman.net/
> > >>>>> http://www.prchistoryresources.org/
> > >>>>> Office: +49 (0) 761 203 4925
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <
> > susheel2777@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index
> and
> > >>> then
> > >>>>>> each of A, B or C or D in query and they seems to be matching and
> > CJKFF
> > >>>>> is
> > >>>>>> transforming the 舊 to 旧
> > >>>>>>
> > >>>>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <
> > susheel2777@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Lack of my chinese language knowledge but if you want, I can do
> > quick
> > >>>>>> test
> > >>>>>>> for you in Analysis tab if you can give me what to put in index
> and
> > >>>>> query
> > >>>>>>> window...
> > >>>>>>>
> > >>>>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <
> > susheel2777@gmail.com
> > >>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Have you tried to use CJKFoldingFilter https://g
> > >>>>>>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this
> would
> > >>>>> cover
> > >>>>>>>> your use case but I am using this filter and so far no issues.
> > >>>>>>>>
> > >>>>>>>> Thnx
> > >>>>>>>>
> > >>>>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> > >>>>> amanda.shuman@gmail.com
> > >>>>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Thanks, Alex - I have seen a few of those links but never
> > considered
> > >>>>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The
> > issue
> > >>> is
> > >>>>>>>>> basically what is laid out in the old blogspot post, namely
> this
> > >>>>> point:
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> "Why approach CJK resource discovery differently?
> > >>>>>>>>>
> > >>>>>>>>> 2.  Search results must be as script agnostic as possible.
> > >>>>>>>>>
> > >>>>>>>>> There is more than one way to write each word. "Simplified"
> > >>>>> characters
> > >>>>>>>>> were
> > >>>>>>>>> emphasized for printed materials in mainland China starting in
> > the
> > >>>>>> 1950s;
> > >>>>>>>>> "Traditional" characters were used in printed materials prior
> to
> > the
> > >>>>>>>>> 1950s,
> > >>>>>>>>> and are still used in Taiwan, Hong Kong and Macau today.
> > >>>>>>>>> Since the characters are distinct, it's as if Chinese materials
> > are
> > >>>>>>>>> written
> > >>>>>>>>> in two scripts.
> > >>>>>>>>> Another way to think about it:  every written Chinese word has
> at
> > >>>>> least
> > >>>>>>>>> two
> > >>>>>>>>> completely different spellings.  And it can be mix-n-match:  a
> > word
> > >>>>> can
> > >>>>>>>>> be
> > >>>>>>>>> written with one traditional  and one simplified character.
> > >>>>>>>>> Example:   Given a user query 舊小說  (traditional for old
> fiction),
> > >>> the
> > >>>>>>>>> results should include matches for 舊小說 (traditional) and 旧小说
> > >>>>>> (simplified
> > >>>>>>>>> characters for old fiction)"
> > >>>>>>>>>
> > >>>>>>>>> So, using the example provided above, we are dealing with
> > materials
> > >>>>>>>>> produced in the 1950s-1970s that do even weirder things like:
> > >>>>>>>>>
> > >>>>>>>>> A. 舊小說
> > >>>>>>>>>
> > >>>>>>>>> can also be
> > >>>>>>>>>
> > >>>>>>>>> B. 旧小说 (all simplified)
> > >>>>>>>>> or
> > >>>>>>>>> C. 旧小說 (first character simplified, last character traditional)
> > >>>>>>>>> or
> > >>>>>>>>> D. 舊小 说 (first character traditional, last character
> simplified)
> > >>>>>>>>>
> > >>>>>>>>> Thankfully the middle character was never simplified in recent
> > >>> times.
> > >>>>>>>>>
> > >>>>>>>>> From a historical standpoint, the mixed nature of the
> characters
> > in
> > >>>>> the
> > >>>>>>>>> same word/phrase is because not all simplified characters were
> > >>>>> adopted
> > >>>>>> at
> > >>>>>>>>> the same time by everyone uniformly (good times...).
> > >>>>>>>>>
> > >>>>>>>>> The problem seems to be that Solr can easily handle A or B
> above,
> > >>> but
> > >>>>>>>>> NOT C
> > >>>>>>>>> or D using the Smart Chinese analyzer. I'm not really sure how
> to
> > >>>>>> change
> > >>>>>>>>> that at this point... maybe I should figure out how to contact
> > the
> > >>>>>>>>> creators
> > >>>>>>>>> of the analyzer and ask them?
> > >>>>>>>>>
> > >>>>>>>>> Amanda
> > >>>>>>>>>
> > >>>>>>>>> ------
> > >>>>>>>>> Dr. Amanda Shuman
> > >>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> > >>>>> Project
> > >>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> > >>>>>>>>> PhD, University of California, Santa Cruz
> > >>>>>>>>> http://www.amandashuman.net/
> > >>>>>>>>> http://www.prchistoryresources.org/
> > >>>>>>>>> Office: +49 (0) 761 203 4925
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> > >>>>>>>>> arafalov@gmail.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> This is probably your start, if not read already:
> > >>>>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-
> analysis.html
> > >>>>>>>>>>
> > >>>>>>>>>> Otherwise, I think your answer would be somewhere around using
> > >>>>> ICU4J,
> > >>>>>>>>>> IBM's library for dealing with Unicode:
> > >>>>> http://site.icu-project.org/
> > >>>>>>>>>> (mentioned on the same page above)
> > >>>>>>>>>> Specifically, transformations:
> > >>>>>>>>>> http://userguide.icu-project.org/transforms/general
> > >>>>>>>>>>
> > >>>>>>>>>> With that, maybe you map both alphabets into latin. I did that
> > once
> > >>>>>>>>>> for Thai for a demo:
> > >>>>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/
> > >>>>>>>>>> collection1/conf/schema.xml#L34
> > >>>>>>>>>>
> > >>>>>>>>>> The challenge is to figure out all the magic rules for that.
> > You'd
> > >>>>>>>>>> have to dig through the ICU documentation and other web
> pages. I
> > >>>>>> found
> > >>>>>>>>>> this one for example:
> > >>>>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system-
> > >>>>>>>>>> transliterators-available-with-icu4j.html;jsessionid=
> > >>>>>>>>>> BEAB0AF05A588B97B8A2393054D908C0
> > >>>>>>>>>>
> > >>>>>>>>>> There is also 12 part series on Solr and Asian text
> processing,
> > >>>>>> though
> > >>>>>>>>>> it is a bit old now: http://discovery-grindstone.
> blogspot.com/
> > >>>>>>>>>>
> > >>>>>>>>>> Hope one of these things help.
> > >>>>>>>>>>
> > >>>>>>>>>> Regards,
> > >>>>>>>>>>  Alex.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <
> > amanda.shuman@gmail.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>> Hi all,
> > >>>>>>>>>>>
> > >>>>>>>>>>> We have a problem. Some of our historical documents have
> mixed
> > >>>>>>>>> together
> > >>>>>>>>>>> simplified and Chinese characters. There seems to be no
> problem
> > >>>>>> when
> > >>>>>>>>>>> searching either traditional or simplified separately - that
> > is,
> > >>>>>> if a
> > >>>>>>>>>>> particular string/phrase is all in traditional or simplified,
> > it
> > >>>>>>>>> finds
> > >>>>>>>>>> it -
> > >>>>>>>>>>> but it does not find the string/phrase if the two different
> > >>>>>>>>> characters
> > >>>>>>>>>> (one
> > >>>>>>>>>>> traditional, one simplified) are mixed together in the SAME
> > >>>>>>>>>> string/phrase.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Has anyone ever handled this problem before? I know some
> > >>>>> libraries
> > >>>>>>>>> seem
> > >>>>>>>>>> to
> > >>>>>>>>>>> have implemented something that seems to be able to handle
> > this,
> > >>>>>> but
> > >>>>>>>>> I'm
> > >>>>>>>>>>> not sure how they did so!
> > >>>>>>>>>>>
> > >>>>>>>>>>> Amanda
> > >>>>>>>>>>> ------
> > >>>>>>>>>>> Dr. Amanda Shuman
> > >>>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist
> Legacy
> > >>>>>>>>> Project
> > >>>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> > >>>>>>>>>>> PhD, University of California, Santa Cruz
> > >>>>>>>>>>> http://www.amandashuman.net/
> > >>>>>>>>>>> http://www.prchistoryresources.org/
> > >>>>>>>>>>> Office: +49 (0) 761 203 4925
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>>
> > >>
> > >> --
> > >> Tomoko Uchida
> > >
> >
>
>
> --
> Tomoko Uchida
>

Re: Question regarding searching Chinese characters

Posted by Tomoko Uchida <to...@gmail.com>.
Yes, while traditional - simplified transformation would be out of the
scope of Unicode normalization,
you would like to add ICUNormalizer2CharFilterFactory anyway :)

Let me refine my example settings:

<analyzer>
  <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
  <filter class="solr.ICUTransformFilterFactory"
id="Traditional-Simplified"/>
</analyzer>

Regards,
Tomoko


2018年7月21日(土) 2:54 Alexandre Rafalovitch <ar...@gmail.com>:

> Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> template of what needs to be done.
>
> Regards,
>    Alex.
>
> On 20 July 2018 at 12:40, Walter Underwood <wu...@wunderwood.org> wrote:
> > Looks like we need a charfilter version of the ICU transforms. That
> could run before the tokenizer.
> >
> > I’ve never built a charfilter, but it seems like this would be a good
> first project for someone who wants to contribute.
> >
> > wunder
> > Walter Underwood
> > wunder@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> tomoko.uchida.1111@gmail.com> wrote:
> >>
> >> Exactly. More concretely, the starting point is: replacing your analyzer
> >>
> >> <analyzer
> class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> >>
> >> to
> >>
> >> <analyzer>
> >>  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
> >>  <filter class="solr.ICUTransformFilterFactory"
> >> id="Traditional-Simplified"/>
> >> </analyzer>
> >>
> >> and see if the results are as expected. Then research another filters if
> >> your requirements is not met.
> >>
> >> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> >> characters as I noted previous in post, so ICUTransformFilterFactory is
> an
> >> incomplete workaround.
> >>
> >> 2018年7月21日(土) 0:05 Walter Underwood <wu...@wunderwood.org>:
> >>
> >>> I expect that this is the line that does the transformation:
> >>>
> >>>   <filter class="solr.ICUTransformFilterFactory"
> >>> id="Traditional-Simplified"/>
> >>>
> >>> This mapping is a standard feature of ICU. More info on ICU transforms
> is
> >>> in this doc, though not much detail on this particular transform.
> >>>
> >>> http://userguide.icu-project.org/transforms/general
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wunder@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <su...@gmail.com>
> >>> wrote:
> >>>>
> >>>> I think so.  I used the exact as in github
> >>>>
> >>>> <fieldType name="text_cjk" class="solr.TextField"
> >>>> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
> >>>> <analyzer>
> >>>>   <tokenizer class="solr.ICUTokenizerFactory" />
> >>>>   <filter class="solr.CJKWidthFilterFactory"/>
> >>>>   <filter
> class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> >>>>   <filter class="solr.ICUTransformFilterFactory"
> >>> id="Traditional-Simplified"/>
> >>>>   <filter class="solr.ICUTransformFilterFactory"
> >>> id="Katakana-Hiragana"/>
> >>>>   <filter class="solr.ICUFoldingFilterFactory"/>
> >>>>   <filter class="solr.CJKBigramFilterFactory" han="true"
> >>>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
> >>>> </analyzer>
> >>>> </fieldType>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <
> amanda.shuman@gmail.com
> >>>>
> >>>> wrote:
> >>>>
> >>>>> Thanks! That does indeed look promising... This can be added on top
> of
> >>>>> Smart Chinese, right? Or is it an alternative?
> >>>>>
> >>>>>
> >>>>> ------
> >>>>> Dr. Amanda Shuman
> >>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> Project
> >>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> >>>>> PhD, University of California, Santa Cruz
> >>>>> http://www.amandashuman.net/
> >>>>> http://www.prchistoryresources.org/
> >>>>> Office: +49 (0) 761 203 4925
> >>>>>
> >>>>>
> >>>>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <
> susheel2777@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
> >>> then
> >>>>>> each of A, B or C or D in query and they seems to be matching and
> CJKFF
> >>>>> is
> >>>>>> transforming the 舊 to 旧
> >>>>>>
> >>>>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <
> susheel2777@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Lack of my chinese language knowledge but if you want, I can do
> quick
> >>>>>> test
> >>>>>>> for you in Analysis tab if you can give me what to put in index and
> >>>>> query
> >>>>>>> window...
> >>>>>>>
> >>>>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <
> susheel2777@gmail.com
> >>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Have you tried to use CJKFoldingFilter https://g
> >>>>>>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> >>>>> cover
> >>>>>>>> your use case but I am using this filter and so far no issues.
> >>>>>>>>
> >>>>>>>> Thnx
> >>>>>>>>
> >>>>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> >>>>> amanda.shuman@gmail.com
> >>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Thanks, Alex - I have seen a few of those links but never
> considered
> >>>>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The
> issue
> >>> is
> >>>>>>>>> basically what is laid out in the old blogspot post, namely this
> >>>>> point:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> "Why approach CJK resource discovery differently?
> >>>>>>>>>
> >>>>>>>>> 2.  Search results must be as script agnostic as possible.
> >>>>>>>>>
> >>>>>>>>> There is more than one way to write each word. "Simplified"
> >>>>> characters
> >>>>>>>>> were
> >>>>>>>>> emphasized for printed materials in mainland China starting in
> the
> >>>>>> 1950s;
> >>>>>>>>> "Traditional" characters were used in printed materials prior to
> the
> >>>>>>>>> 1950s,
> >>>>>>>>> and are still used in Taiwan, Hong Kong and Macau today.
> >>>>>>>>> Since the characters are distinct, it's as if Chinese materials
> are
> >>>>>>>>> written
> >>>>>>>>> in two scripts.
> >>>>>>>>> Another way to think about it:  every written Chinese word has at
> >>>>> least
> >>>>>>>>> two
> >>>>>>>>> completely different spellings.  And it can be mix-n-match:  a
> word
> >>>>> can
> >>>>>>>>> be
> >>>>>>>>> written with one traditional  and one simplified character.
> >>>>>>>>> Example:   Given a user query 舊小說  (traditional for old fiction),
> >>> the
> >>>>>>>>> results should include matches for 舊小說 (traditional) and 旧小说
> >>>>>> (simplified
> >>>>>>>>> characters for old fiction)"
> >>>>>>>>>
> >>>>>>>>> So, using the example provided above, we are dealing with
> materials
> >>>>>>>>> produced in the 1950s-1970s that do even weirder things like:
> >>>>>>>>>
> >>>>>>>>> A. 舊小說
> >>>>>>>>>
> >>>>>>>>> can also be
> >>>>>>>>>
> >>>>>>>>> B. 旧小说 (all simplified)
> >>>>>>>>> or
> >>>>>>>>> C. 旧小說 (first character simplified, last character traditional)
> >>>>>>>>> or
> >>>>>>>>> D. 舊小 说 (first character traditional, last character simplified)
> >>>>>>>>>
> >>>>>>>>> Thankfully the middle character was never simplified in recent
> >>> times.
> >>>>>>>>>
> >>>>>>>>> From a historical standpoint, the mixed nature of the characters
> in
> >>>>> the
> >>>>>>>>> same word/phrase is because not all simplified characters were
> >>>>> adopted
> >>>>>> at
> >>>>>>>>> the same time by everyone uniformly (good times...).
> >>>>>>>>>
> >>>>>>>>> The problem seems to be that Solr can easily handle A or B above,
> >>> but
> >>>>>>>>> NOT C
> >>>>>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to
> >>>>>> change
> >>>>>>>>> that at this point... maybe I should figure out how to contact
> the
> >>>>>>>>> creators
> >>>>>>>>> of the analyzer and ask them?
> >>>>>>>>>
> >>>>>>>>> Amanda
> >>>>>>>>>
> >>>>>>>>> ------
> >>>>>>>>> Dr. Amanda Shuman
> >>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> >>>>> Project
> >>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> >>>>>>>>> PhD, University of California, Santa Cruz
> >>>>>>>>> http://www.amandashuman.net/
> >>>>>>>>> http://www.prchistoryresources.org/
> >>>>>>>>> Office: +49 (0) 761 203 4925
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> >>>>>>>>> arafalov@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> This is probably your start, if not read already:
> >>>>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >>>>>>>>>>
> >>>>>>>>>> Otherwise, I think your answer would be somewhere around using
> >>>>> ICU4J,
> >>>>>>>>>> IBM's library for dealing with Unicode:
> >>>>> http://site.icu-project.org/
> >>>>>>>>>> (mentioned on the same page above)
> >>>>>>>>>> Specifically, transformations:
> >>>>>>>>>> http://userguide.icu-project.org/transforms/general
> >>>>>>>>>>
> >>>>>>>>>> With that, maybe you map both alphabets into latin. I did that
> once
> >>>>>>>>>> for Thai for a demo:
> >>>>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/
> >>>>>>>>>> collection1/conf/schema.xml#L34
> >>>>>>>>>>
> >>>>>>>>>> The challenge is to figure out all the magic rules for that.
> You'd
> >>>>>>>>>> have to dig through the ICU documentation and other web pages. I
> >>>>>> found
> >>>>>>>>>> this one for example:
> >>>>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system-
> >>>>>>>>>> transliterators-available-with-icu4j.html;jsessionid=
> >>>>>>>>>> BEAB0AF05A588B97B8A2393054D908C0
> >>>>>>>>>>
> >>>>>>>>>> There is also 12 part series on Solr and Asian text processing,
> >>>>>> though
> >>>>>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/
> >>>>>>>>>>
> >>>>>>>>>> Hope one of these things help.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>>  Alex.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <
> amanda.shuman@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> We have a problem. Some of our historical documents have mixed
> >>>>>>>>> together
> >>>>>>>>>>> simplified and Chinese characters. There seems to be no problem
> >>>>>> when
> >>>>>>>>>>> searching either traditional or simplified separately - that
> is,
> >>>>>> if a
> >>>>>>>>>>> particular string/phrase is all in traditional or simplified,
> it
> >>>>>>>>> finds
> >>>>>>>>>> it -
> >>>>>>>>>>> but it does not find the string/phrase if the two different
> >>>>>>>>> characters
> >>>>>>>>>> (one
> >>>>>>>>>>> traditional, one simplified) are mixed together in the SAME
> >>>>>>>>>> string/phrase.
> >>>>>>>>>>>
> >>>>>>>>>>> Has anyone ever handled this problem before? I know some
> >>>>> libraries
> >>>>>>>>> seem
> >>>>>>>>>> to
> >>>>>>>>>>> have implemented something that seems to be able to handle
> this,
> >>>>>> but
> >>>>>>>>> I'm
> >>>>>>>>>>> not sure how they did so!
> >>>>>>>>>>>
> >>>>>>>>>>> Amanda
> >>>>>>>>>>> ------
> >>>>>>>>>>> Dr. Amanda Shuman
> >>>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> >>>>>>>>> Project
> >>>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> >>>>>>>>>>> PhD, University of California, Santa Cruz
> >>>>>>>>>>> http://www.amandashuman.net/
> >>>>>>>>>>> http://www.prchistoryresources.org/
> >>>>>>>>>>> Office: +49 (0) 761 203 4925
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
> >>
> >> --
> >> Tomoko Uchida
> >
>


-- 
Tomoko Uchida

Re: Question regarding searching Chinese characters

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
template of what needs to be done.

Regards,
   Alex.

On 20 July 2018 at 12:40, Walter Underwood <wu...@wunderwood.org> wrote:
> Looks like we need a charfilter version of the ICU transforms. That could run before the tokenizer.
>
> I’ve never built a charfilter, but it seems like this would be a good first project for someone who wants to contribute.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <to...@gmail.com> wrote:
>>
>> Exactly. More concretely, the starting point is: replacing your analyzer
>>
>> <analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
>>
>> to
>>
>> <analyzer>
>>  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
>>  <filter class="solr.ICUTransformFilterFactory"
>> id="Traditional-Simplified"/>
>> </analyzer>
>>
>> and see if the results are as expected. Then research another filters if
>> your requirements is not met.
>>
>> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
>> characters as I noted previous in post, so ICUTransformFilterFactory is an
>> incomplete workaround.
>>
>> 2018年7月21日(土) 0:05 Walter Underwood <wu...@wunderwood.org>:
>>
>>> I expect that this is the line that does the transformation:
>>>
>>>   <filter class="solr.ICUTransformFilterFactory"
>>> id="Traditional-Simplified"/>
>>>
>>> This mapping is a standard feature of ICU. More info on ICU transforms is
>>> in this doc, though not much detail on this particular transform.
>>>
>>> http://userguide.icu-project.org/transforms/general
>>>
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <su...@gmail.com>
>>> wrote:
>>>>
>>>> I think so.  I used the exact as in github
>>>>
>>>> <fieldType name="text_cjk" class="solr.TextField"
>>>> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>>>> <analyzer>
>>>>   <tokenizer class="solr.ICUTokenizerFactory" />
>>>>   <filter class="solr.CJKWidthFilterFactory"/>
>>>>   <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>>>>   <filter class="solr.ICUTransformFilterFactory"
>>> id="Traditional-Simplified"/>
>>>>   <filter class="solr.ICUTransformFilterFactory"
>>> id="Katakana-Hiragana"/>
>>>>   <filter class="solr.ICUFoldingFilterFactory"/>
>>>>   <filter class="solr.CJKBigramFilterFactory" han="true"
>>>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>>>> </analyzer>
>>>> </fieldType>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <amanda.shuman@gmail.com
>>>>
>>>> wrote:
>>>>
>>>>> Thanks! That does indeed look promising... This can be added on top of
>>>>> Smart Chinese, right? Or is it an alternative?
>>>>>
>>>>>
>>>>> ------
>>>>> Dr. Amanda Shuman
>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>>> PhD, University of California, Santa Cruz
>>>>> http://www.amandashuman.net/
>>>>> http://www.prchistoryresources.org/
>>>>> Office: +49 (0) 761 203 4925
>>>>>
>>>>>
>>>>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <su...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
>>> then
>>>>>> each of A, B or C or D in query and they seems to be matching and CJKFF
>>>>> is
>>>>>> transforming the 舊 to 旧
>>>>>>
>>>>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <su...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Lack of my chinese language knowledge but if you want, I can do quick
>>>>>> test
>>>>>>> for you in Analysis tab if you can give me what to put in index and
>>>>> query
>>>>>>> window...
>>>>>>>
>>>>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2777@gmail.com
>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Have you tried to use CJKFoldingFilter https://g
>>>>>>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
>>>>> cover
>>>>>>>> your use case but I am using this filter and so far no issues.
>>>>>>>>
>>>>>>>> Thnx
>>>>>>>>
>>>>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
>>>>> amanda.shuman@gmail.com
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks, Alex - I have seen a few of those links but never considered
>>>>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The issue
>>> is
>>>>>>>>> basically what is laid out in the old blogspot post, namely this
>>>>> point:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> "Why approach CJK resource discovery differently?
>>>>>>>>>
>>>>>>>>> 2.  Search results must be as script agnostic as possible.
>>>>>>>>>
>>>>>>>>> There is more than one way to write each word. "Simplified"
>>>>> characters
>>>>>>>>> were
>>>>>>>>> emphasized for printed materials in mainland China starting in the
>>>>>> 1950s;
>>>>>>>>> "Traditional" characters were used in printed materials prior to the
>>>>>>>>> 1950s,
>>>>>>>>> and are still used in Taiwan, Hong Kong and Macau today.
>>>>>>>>> Since the characters are distinct, it's as if Chinese materials are
>>>>>>>>> written
>>>>>>>>> in two scripts.
>>>>>>>>> Another way to think about it:  every written Chinese word has at
>>>>> least
>>>>>>>>> two
>>>>>>>>> completely different spellings.  And it can be mix-n-match:  a word
>>>>> can
>>>>>>>>> be
>>>>>>>>> written with one traditional  and one simplified character.
>>>>>>>>> Example:   Given a user query 舊小說  (traditional for old fiction),
>>> the
>>>>>>>>> results should include matches for 舊小說 (traditional) and 旧小说
>>>>>> (simplified
>>>>>>>>> characters for old fiction)"
>>>>>>>>>
>>>>>>>>> So, using the example provided above, we are dealing with materials
>>>>>>>>> produced in the 1950s-1970s that do even weirder things like:
>>>>>>>>>
>>>>>>>>> A. 舊小說
>>>>>>>>>
>>>>>>>>> can also be
>>>>>>>>>
>>>>>>>>> B. 旧小说 (all simplified)
>>>>>>>>> or
>>>>>>>>> C. 旧小說 (first character simplified, last character traditional)
>>>>>>>>> or
>>>>>>>>> D. 舊小 说 (first character traditional, last character simplified)
>>>>>>>>>
>>>>>>>>> Thankfully the middle character was never simplified in recent
>>> times.
>>>>>>>>>
>>>>>>>>> From a historical standpoint, the mixed nature of the characters in
>>>>> the
>>>>>>>>> same word/phrase is because not all simplified characters were
>>>>> adopted
>>>>>> at
>>>>>>>>> the same time by everyone uniformly (good times...).
>>>>>>>>>
>>>>>>>>> The problem seems to be that Solr can easily handle A or B above,
>>> but
>>>>>>>>> NOT C
>>>>>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to
>>>>>> change
>>>>>>>>> that at this point... maybe I should figure out how to contact the
>>>>>>>>> creators
>>>>>>>>> of the analyzer and ask them?
>>>>>>>>>
>>>>>>>>> Amanda
>>>>>>>>>
>>>>>>>>> ------
>>>>>>>>> Dr. Amanda Shuman
>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>>>>> Project
>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>>>>>>> PhD, University of California, Santa Cruz
>>>>>>>>> http://www.amandashuman.net/
>>>>>>>>> http://www.prchistoryresources.org/
>>>>>>>>> Office: +49 (0) 761 203 4925
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>>>>>>>>> arafalov@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> This is probably your start, if not read already:
>>>>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>>>>>>>>>>
>>>>>>>>>> Otherwise, I think your answer would be somewhere around using
>>>>> ICU4J,
>>>>>>>>>> IBM's library for dealing with Unicode:
>>>>> http://site.icu-project.org/
>>>>>>>>>> (mentioned on the same page above)
>>>>>>>>>> Specifically, transformations:
>>>>>>>>>> http://userguide.icu-project.org/transforms/general
>>>>>>>>>>
>>>>>>>>>> With that, maybe you map both alphabets into latin. I did that once
>>>>>>>>>> for Thai for a demo:
>>>>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/
>>>>>>>>>> collection1/conf/schema.xml#L34
>>>>>>>>>>
>>>>>>>>>> The challenge is to figure out all the magic rules for that. You'd
>>>>>>>>>> have to dig through the ICU documentation and other web pages. I
>>>>>> found
>>>>>>>>>> this one for example:
>>>>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system-
>>>>>>>>>> transliterators-available-with-icu4j.html;jsessionid=
>>>>>>>>>> BEAB0AF05A588B97B8A2393054D908C0
>>>>>>>>>>
>>>>>>>>>> There is also 12 part series on Solr and Asian text processing,
>>>>>> though
>>>>>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/
>>>>>>>>>>
>>>>>>>>>> Hope one of these things help.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>  Alex.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <am...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> We have a problem. Some of our historical documents have mixed
>>>>>>>>> together
>>>>>>>>>>> simplified and Chinese characters. There seems to be no problem
>>>>>> when
>>>>>>>>>>> searching either traditional or simplified separately - that is,
>>>>>> if a
>>>>>>>>>>> particular string/phrase is all in traditional or simplified, it
>>>>>>>>> finds
>>>>>>>>>> it -
>>>>>>>>>>> but it does not find the string/phrase if the two different
>>>>>>>>> characters
>>>>>>>>>> (one
>>>>>>>>>>> traditional, one simplified) are mixed together in the SAME
>>>>>>>>>> string/phrase.
>>>>>>>>>>>
>>>>>>>>>>> Has anyone ever handled this problem before? I know some
>>>>> libraries
>>>>>>>>> seem
>>>>>>>>>> to
>>>>>>>>>>> have implemented something that seems to be able to handle this,
>>>>>> but
>>>>>>>>> I'm
>>>>>>>>>>> not sure how they did so!
>>>>>>>>>>>
>>>>>>>>>>> Amanda
>>>>>>>>>>> ------
>>>>>>>>>>> Dr. Amanda Shuman
>>>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>>>>>>>>> Project
>>>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>>>>>>>>> PhD, University of California, Santa Cruz
>>>>>>>>>>> http://www.amandashuman.net/
>>>>>>>>>>> http://www.prchistoryresources.org/
>>>>>>>>>>> Office: +49 (0) 761 203 4925
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>
>> --
>> Tomoko Uchida
>

Re: Question regarding searching Chinese characters

Posted by Walter Underwood <wu...@wunderwood.org>.
Looks like we need a charfilter version of the ICU transforms. That could run before the tokenizer.

I’ve never built a charfilter, but it seems like this would be a good first project for someone who wants to contribute.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <to...@gmail.com> wrote:
> 
> Exactly. More concretely, the starting point is: replacing your analyzer
> 
> <analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> 
> to
> 
> <analyzer>
>  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
>  <filter class="solr.ICUTransformFilterFactory"
> id="Traditional-Simplified"/>
> </analyzer>
> 
> and see if the results are as expected. Then research another filters if
> your requirements is not met.
> 
> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> characters as I noted previous in post, so ICUTransformFilterFactory is an
> incomplete workaround.
> 
> 2018年7月21日(土) 0:05 Walter Underwood <wu...@wunderwood.org>:
> 
>> I expect that this is the line that does the transformation:
>> 
>>   <filter class="solr.ICUTransformFilterFactory"
>> id="Traditional-Simplified"/>
>> 
>> This mapping is a standard feature of ICU. More info on ICU transforms is
>> in this doc, though not much detail on this particular transform.
>> 
>> http://userguide.icu-project.org/transforms/general
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <su...@gmail.com>
>> wrote:
>>> 
>>> I think so.  I used the exact as in github
>>> 
>>> <fieldType name="text_cjk" class="solr.TextField"
>>> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>>> <analyzer>
>>>   <tokenizer class="solr.ICUTokenizerFactory" />
>>>   <filter class="solr.CJKWidthFilterFactory"/>
>>>   <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>>>   <filter class="solr.ICUTransformFilterFactory"
>> id="Traditional-Simplified"/>
>>>   <filter class="solr.ICUTransformFilterFactory"
>> id="Katakana-Hiragana"/>
>>>   <filter class="solr.ICUFoldingFilterFactory"/>
>>>   <filter class="solr.CJKBigramFilterFactory" han="true"
>>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>>> </analyzer>
>>> </fieldType>
>>> 
>>> 
>>> 
>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <amanda.shuman@gmail.com
>>> 
>>> wrote:
>>> 
>>>> Thanks! That does indeed look promising... This can be added on top of
>>>> Smart Chinese, right? Or is it an alternative?
>>>> 
>>>> 
>>>> ------
>>>> Dr. Amanda Shuman
>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>> PhD, University of California, Santa Cruz
>>>> http://www.amandashuman.net/
>>>> http://www.prchistoryresources.org/
>>>> Office: +49 (0) 761 203 4925
>>>> 
>>>> 
>>>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <su...@gmail.com>
>>>> wrote:
>>>> 
>>>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
>> then
>>>>> each of A, B or C or D in query and they seems to be matching and CJKFF
>>>> is
>>>>> transforming the 舊 to 旧
>>>>> 
>>>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <su...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Lack of my chinese language knowledge but if you want, I can do quick
>>>>> test
>>>>>> for you in Analysis tab if you can give me what to put in index and
>>>> query
>>>>>> window...
>>>>>> 
>>>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2777@gmail.com
>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Have you tried to use CJKFoldingFilter https://g
>>>>>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
>>>> cover
>>>>>>> your use case but I am using this filter and so far no issues.
>>>>>>> 
>>>>>>> Thnx
>>>>>>> 
>>>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
>>>> amanda.shuman@gmail.com
>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Thanks, Alex - I have seen a few of those links but never considered
>>>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The issue
>> is
>>>>>>>> basically what is laid out in the old blogspot post, namely this
>>>> point:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> "Why approach CJK resource discovery differently?
>>>>>>>> 
>>>>>>>> 2.  Search results must be as script agnostic as possible.
>>>>>>>> 
>>>>>>>> There is more than one way to write each word. "Simplified"
>>>> characters
>>>>>>>> were
>>>>>>>> emphasized for printed materials in mainland China starting in the
>>>>> 1950s;
>>>>>>>> "Traditional" characters were used in printed materials prior to the
>>>>>>>> 1950s,
>>>>>>>> and are still used in Taiwan, Hong Kong and Macau today.
>>>>>>>> Since the characters are distinct, it's as if Chinese materials are
>>>>>>>> written
>>>>>>>> in two scripts.
>>>>>>>> Another way to think about it:  every written Chinese word has at
>>>> least
>>>>>>>> two
>>>>>>>> completely different spellings.  And it can be mix-n-match:  a word
>>>> can
>>>>>>>> be
>>>>>>>> written with one traditional  and one simplified character.
>>>>>>>> Example:   Given a user query 舊小說  (traditional for old fiction),
>> the
>>>>>>>> results should include matches for 舊小說 (traditional) and 旧小说
>>>>> (simplified
>>>>>>>> characters for old fiction)"
>>>>>>>> 
>>>>>>>> So, using the example provided above, we are dealing with materials
>>>>>>>> produced in the 1950s-1970s that do even weirder things like:
>>>>>>>> 
>>>>>>>> A. 舊小說
>>>>>>>> 
>>>>>>>> can also be
>>>>>>>> 
>>>>>>>> B. 旧小说 (all simplified)
>>>>>>>> or
>>>>>>>> C. 旧小說 (first character simplified, last character traditional)
>>>>>>>> or
>>>>>>>> D. 舊小 说 (first character traditional, last character simplified)
>>>>>>>> 
>>>>>>>> Thankfully the middle character was never simplified in recent
>> times.
>>>>>>>> 
>>>>>>>> From a historical standpoint, the mixed nature of the characters in
>>>> the
>>>>>>>> same word/phrase is because not all simplified characters were
>>>> adopted
>>>>> at
>>>>>>>> the same time by everyone uniformly (good times...).
>>>>>>>> 
>>>>>>>> The problem seems to be that Solr can easily handle A or B above,
>> but
>>>>>>>> NOT C
>>>>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to
>>>>> change
>>>>>>>> that at this point... maybe I should figure out how to contact the
>>>>>>>> creators
>>>>>>>> of the analyzer and ask them?
>>>>>>>> 
>>>>>>>> Amanda
>>>>>>>> 
>>>>>>>> ------
>>>>>>>> Dr. Amanda Shuman
>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>>>> Project
>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>>>>>> PhD, University of California, Santa Cruz
>>>>>>>> http://www.amandashuman.net/
>>>>>>>> http://www.prchistoryresources.org/
>>>>>>>> Office: +49 (0) 761 203 4925
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>>>>>>>> arafalov@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> This is probably your start, if not read already:
>>>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>>>>>>>>> 
>>>>>>>>> Otherwise, I think your answer would be somewhere around using
>>>> ICU4J,
>>>>>>>>> IBM's library for dealing with Unicode:
>>>> http://site.icu-project.org/
>>>>>>>>> (mentioned on the same page above)
>>>>>>>>> Specifically, transformations:
>>>>>>>>> http://userguide.icu-project.org/transforms/general
>>>>>>>>> 
>>>>>>>>> With that, maybe you map both alphabets into latin. I did that once
>>>>>>>>> for Thai for a demo:
>>>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/
>>>>>>>>> collection1/conf/schema.xml#L34
>>>>>>>>> 
>>>>>>>>> The challenge is to figure out all the magic rules for that. You'd
>>>>>>>>> have to dig through the ICU documentation and other web pages. I
>>>>> found
>>>>>>>>> this one for example:
>>>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system-
>>>>>>>>> transliterators-available-with-icu4j.html;jsessionid=
>>>>>>>>> BEAB0AF05A588B97B8A2393054D908C0
>>>>>>>>> 
>>>>>>>>> There is also 12 part series on Solr and Asian text processing,
>>>>> though
>>>>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/
>>>>>>>>> 
>>>>>>>>> Hope one of these things help.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>>  Alex.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <am...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>> 
>>>>>>>>>> We have a problem. Some of our historical documents have mixed
>>>>>>>> together
>>>>>>>>>> simplified and Chinese characters. There seems to be no problem
>>>>> when
>>>>>>>>>> searching either traditional or simplified separately - that is,
>>>>> if a
>>>>>>>>>> particular string/phrase is all in traditional or simplified, it
>>>>>>>> finds
>>>>>>>>> it -
>>>>>>>>>> but it does not find the string/phrase if the two different
>>>>>>>> characters
>>>>>>>>> (one
>>>>>>>>>> traditional, one simplified) are mixed together in the SAME
>>>>>>>>> string/phrase.
>>>>>>>>>> 
>>>>>>>>>> Has anyone ever handled this problem before? I know some
>>>> libraries
>>>>>>>> seem
>>>>>>>>> to
>>>>>>>>>> have implemented something that seems to be able to handle this,
>>>>> but
>>>>>>>> I'm
>>>>>>>>>> not sure how they did so!
>>>>>>>>>> 
>>>>>>>>>> Amanda
>>>>>>>>>> ------
>>>>>>>>>> Dr. Amanda Shuman
>>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>>>>>>>> Project
>>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>>>>>>>> PhD, University of California, Santa Cruz
>>>>>>>>>> http://www.amandashuman.net/
>>>>>>>>>> http://www.prchistoryresources.org/
>>>>>>>>>> Office: +49 (0) 761 203 4925
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>> 
> 
> -- 
> Tomoko Uchida


Re: Question regarding searching Chinese characters

Posted by Tomoko Uchida <to...@gmail.com>.
Exactly. More concretely, the starting point is: replacing your analyzer

<analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>

to

<analyzer>
  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
  <filter class="solr.ICUTransformFilterFactory"
id="Traditional-Simplified"/>
</analyzer>

and see if the results are as expected. Then research another filters if
your requirements is not met.

Just a reminder: HMMChineseTokenizerFactory do not handle traditional
characters as I noted previous in post, so ICUTransformFilterFactory is an
incomplete workaround.

2018年7月21日(土) 0:05 Walter Underwood <wu...@wunderwood.org>:

> I expect that this is the line that does the transformation:
>
>    <filter class="solr.ICUTransformFilterFactory"
> id="Traditional-Simplified"/>
>
> This mapping is a standard feature of ICU. More info on ICU transforms is
> in this doc, though not much detail on this particular transform.
>
> http://userguide.icu-project.org/transforms/general
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jul 20, 2018, at 7:43 AM, Susheel Kumar <su...@gmail.com>
> wrote:
> >
> > I think so.  I used the exact as in github
> >
> > <fieldType name="text_cjk" class="solr.TextField"
> > positionIncrementGap="10000" autoGeneratePhraseQueries="false">
> >  <analyzer>
> >    <tokenizer class="solr.ICUTokenizerFactory" />
> >    <filter class="solr.CJKWidthFilterFactory"/>
> >    <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> >    <filter class="solr.ICUTransformFilterFactory"
> id="Traditional-Simplified"/>
> >    <filter class="solr.ICUTransformFilterFactory"
> id="Katakana-Hiragana"/>
> >    <filter class="solr.ICUFoldingFilterFactory"/>
> >    <filter class="solr.CJKBigramFilterFactory" han="true"
> > hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
> >  </analyzer>
> > </fieldType>
> >
> >
> >
> > On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <amanda.shuman@gmail.com
> >
> > wrote:
> >
> >> Thanks! That does indeed look promising... This can be added on top of
> >> Smart Chinese, right? Or is it an alternative?
> >>
> >>
> >> ------
> >> Dr. Amanda Shuman
> >> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> >> <http://www.maoistlegacy.uni-freiburg.de/>
> >> PhD, University of California, Santa Cruz
> >> http://www.amandashuman.net/
> >> http://www.prchistoryresources.org/
> >> Office: +49 (0) 761 203 4925
> >>
> >>
> >> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <su...@gmail.com>
> >> wrote:
> >>
> >>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
> then
> >>> each of A, B or C or D in query and they seems to be matching and CJKFF
> >> is
> >>> transforming the 舊 to 旧
> >>>
> >>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <su...@gmail.com>
> >>> wrote:
> >>>
> >>>> Lack of my chinese language knowledge but if you want, I can do quick
> >>> test
> >>>> for you in Analysis tab if you can give me what to put in index and
> >> query
> >>>> window...
> >>>>
> >>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2777@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>>> Have you tried to use CJKFoldingFilter https://g
> >>>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> >> cover
> >>>>> your use case but I am using this filter and so far no issues.
> >>>>>
> >>>>> Thnx
> >>>>>
> >>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> >> amanda.shuman@gmail.com
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Thanks, Alex - I have seen a few of those links but never considered
> >>>>>> transliteration! We use lucene's Smart Chinese analyzer. The issue
> is
> >>>>>> basically what is laid out in the old blogspot post, namely this
> >> point:
> >>>>>>
> >>>>>>
> >>>>>> "Why approach CJK resource discovery differently?
> >>>>>>
> >>>>>> 2.  Search results must be as script agnostic as possible.
> >>>>>>
> >>>>>> There is more than one way to write each word. "Simplified"
> >> characters
> >>>>>> were
> >>>>>> emphasized for printed materials in mainland China starting in the
> >>> 1950s;
> >>>>>> "Traditional" characters were used in printed materials prior to the
> >>>>>> 1950s,
> >>>>>> and are still used in Taiwan, Hong Kong and Macau today.
> >>>>>> Since the characters are distinct, it's as if Chinese materials are
> >>>>>> written
> >>>>>> in two scripts.
> >>>>>> Another way to think about it:  every written Chinese word has at
> >> least
> >>>>>> two
> >>>>>> completely different spellings.  And it can be mix-n-match:  a word
> >> can
> >>>>>> be
> >>>>>> written with one traditional  and one simplified character.
> >>>>>> Example:   Given a user query 舊小說  (traditional for old fiction),
> the
> >>>>>> results should include matches for 舊小說 (traditional) and 旧小说
> >>> (simplified
> >>>>>> characters for old fiction)"
> >>>>>>
> >>>>>> So, using the example provided above, we are dealing with materials
> >>>>>> produced in the 1950s-1970s that do even weirder things like:
> >>>>>>
> >>>>>> A. 舊小說
> >>>>>>
> >>>>>> can also be
> >>>>>>
> >>>>>> B. 旧小说 (all simplified)
> >>>>>> or
> >>>>>> C. 旧小說 (first character simplified, last character traditional)
> >>>>>> or
> >>>>>> D. 舊小 说 (first character traditional, last character simplified)
> >>>>>>
> >>>>>> Thankfully the middle character was never simplified in recent
> times.
> >>>>>>
> >>>>>> From a historical standpoint, the mixed nature of the characters in
> >> the
> >>>>>> same word/phrase is because not all simplified characters were
> >> adopted
> >>> at
> >>>>>> the same time by everyone uniformly (good times...).
> >>>>>>
> >>>>>> The problem seems to be that Solr can easily handle A or B above,
> but
> >>>>>> NOT C
> >>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to
> >>> change
> >>>>>> that at this point... maybe I should figure out how to contact the
> >>>>>> creators
> >>>>>> of the analyzer and ask them?
> >>>>>>
> >>>>>> Amanda
> >>>>>>
> >>>>>> ------
> >>>>>> Dr. Amanda Shuman
> >>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> >> Project
> >>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> >>>>>> PhD, University of California, Santa Cruz
> >>>>>> http://www.amandashuman.net/
> >>>>>> http://www.prchistoryresources.org/
> >>>>>> Office: +49 (0) 761 203 4925
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> >>>>>> arafalov@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> This is probably your start, if not read already:
> >>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >>>>>>>
> >>>>>>> Otherwise, I think your answer would be somewhere around using
> >> ICU4J,
> >>>>>>> IBM's library for dealing with Unicode:
> >> http://site.icu-project.org/
> >>>>>>> (mentioned on the same page above)
> >>>>>>> Specifically, transformations:
> >>>>>>> http://userguide.icu-project.org/transforms/general
> >>>>>>>
> >>>>>>> With that, maybe you map both alphabets into latin. I did that once
> >>>>>>> for Thai for a demo:
> >>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/
> >>>>>>> collection1/conf/schema.xml#L34
> >>>>>>>
> >>>>>>> The challenge is to figure out all the magic rules for that. You'd
> >>>>>>> have to dig through the ICU documentation and other web pages. I
> >>> found
> >>>>>>> this one for example:
> >>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system-
> >>>>>>> transliterators-available-with-icu4j.html;jsessionid=
> >>>>>>> BEAB0AF05A588B97B8A2393054D908C0
> >>>>>>>
> >>>>>>> There is also 12 part series on Solr and Asian text processing,
> >>> though
> >>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/
> >>>>>>>
> >>>>>>> Hope one of these things help.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>>   Alex.
> >>>>>>>
> >>>>>>>
> >>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <am...@gmail.com>
> >>>>>> wrote:
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> We have a problem. Some of our historical documents have mixed
> >>>>>> together
> >>>>>>>> simplified and Chinese characters. There seems to be no problem
> >>> when
> >>>>>>>> searching either traditional or simplified separately - that is,
> >>> if a
> >>>>>>>> particular string/phrase is all in traditional or simplified, it
> >>>>>> finds
> >>>>>>> it -
> >>>>>>>> but it does not find the string/phrase if the two different
> >>>>>> characters
> >>>>>>> (one
> >>>>>>>> traditional, one simplified) are mixed together in the SAME
> >>>>>>> string/phrase.
> >>>>>>>>
> >>>>>>>> Has anyone ever handled this problem before? I know some
> >> libraries
> >>>>>> seem
> >>>>>>> to
> >>>>>>>> have implemented something that seems to be able to handle this,
> >>> but
> >>>>>> I'm
> >>>>>>>> not sure how they did so!
> >>>>>>>>
> >>>>>>>> Amanda
> >>>>>>>> ------
> >>>>>>>> Dr. Amanda Shuman
> >>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> >>>>>> Project
> >>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> >>>>>>>> PhD, University of California, Santa Cruz
> >>>>>>>> http://www.amandashuman.net/
> >>>>>>>> http://www.prchistoryresources.org/
> >>>>>>>> Office: +49 (0) 761 203 4925
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

-- 
Tomoko Uchida

Re: Question regarding searching Chinese characters

Posted by Walter Underwood <wu...@wunderwood.org>.
I expect that this is the line that does the transformation:

   <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>

This mapping is a standard feature of ICU. More info on ICU transforms is in this doc, though not much detail on this particular transform. 

http://userguide.icu-project.org/transforms/general

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <su...@gmail.com> wrote:
> 
> I think so.  I used the exact as in github
> 
> <fieldType name="text_cjk" class="solr.TextField"
> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>  <analyzer>
>    <tokenizer class="solr.ICUTokenizerFactory" />
>    <filter class="solr.CJKWidthFilterFactory"/>
>    <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>    <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
>    <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
>    <filter class="solr.ICUFoldingFilterFactory"/>
>    <filter class="solr.CJKBigramFilterFactory" han="true"
> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>  </analyzer>
> </fieldType>
> 
> 
> 
> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <am...@gmail.com>
> wrote:
> 
>> Thanks! That does indeed look promising... This can be added on top of
>> Smart Chinese, right? Or is it an alternative?
>> 
>> 
>> ------
>> Dr. Amanda Shuman
>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>> <http://www.maoistlegacy.uni-freiburg.de/>
>> PhD, University of California, Santa Cruz
>> http://www.amandashuman.net/
>> http://www.prchistoryresources.org/
>> Office: +49 (0) 761 203 4925
>> 
>> 
>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <su...@gmail.com>
>> wrote:
>> 
>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
>>> each of A, B or C or D in query and they seems to be matching and CJKFF
>> is
>>> transforming the 舊 to 旧
>>> 
>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <su...@gmail.com>
>>> wrote:
>>> 
>>>> Lack of my chinese language knowledge but if you want, I can do quick
>>> test
>>>> for you in Analysis tab if you can give me what to put in index and
>> query
>>>> window...
>>>> 
>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <su...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Have you tried to use CJKFoldingFilter https://g
>>>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
>> cover
>>>>> your use case but I am using this filter and so far no issues.
>>>>> 
>>>>> Thnx
>>>>> 
>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
>> amanda.shuman@gmail.com
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Thanks, Alex - I have seen a few of those links but never considered
>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
>>>>>> basically what is laid out in the old blogspot post, namely this
>> point:
>>>>>> 
>>>>>> 
>>>>>> "Why approach CJK resource discovery differently?
>>>>>> 
>>>>>> 2.  Search results must be as script agnostic as possible.
>>>>>> 
>>>>>> There is more than one way to write each word. "Simplified"
>> characters
>>>>>> were
>>>>>> emphasized for printed materials in mainland China starting in the
>>> 1950s;
>>>>>> "Traditional" characters were used in printed materials prior to the
>>>>>> 1950s,
>>>>>> and are still used in Taiwan, Hong Kong and Macau today.
>>>>>> Since the characters are distinct, it's as if Chinese materials are
>>>>>> written
>>>>>> in two scripts.
>>>>>> Another way to think about it:  every written Chinese word has at
>> least
>>>>>> two
>>>>>> completely different spellings.  And it can be mix-n-match:  a word
>> can
>>>>>> be
>>>>>> written with one traditional  and one simplified character.
>>>>>> Example:   Given a user query 舊小說  (traditional for old fiction), the
>>>>>> results should include matches for 舊小說 (traditional) and 旧小说
>>> (simplified
>>>>>> characters for old fiction)"
>>>>>> 
>>>>>> So, using the example provided above, we are dealing with materials
>>>>>> produced in the 1950s-1970s that do even weirder things like:
>>>>>> 
>>>>>> A. 舊小說
>>>>>> 
>>>>>> can also be
>>>>>> 
>>>>>> B. 旧小说 (all simplified)
>>>>>> or
>>>>>> C. 旧小說 (first character simplified, last character traditional)
>>>>>> or
>>>>>> D. 舊小 说 (first character traditional, last character simplified)
>>>>>> 
>>>>>> Thankfully the middle character was never simplified in recent times.
>>>>>> 
>>>>>> From a historical standpoint, the mixed nature of the characters in
>> the
>>>>>> same word/phrase is because not all simplified characters were
>> adopted
>>> at
>>>>>> the same time by everyone uniformly (good times...).
>>>>>> 
>>>>>> The problem seems to be that Solr can easily handle A or B above, but
>>>>>> NOT C
>>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to
>>> change
>>>>>> that at this point... maybe I should figure out how to contact the
>>>>>> creators
>>>>>> of the analyzer and ask them?
>>>>>> 
>>>>>> Amanda
>>>>>> 
>>>>>> ------
>>>>>> Dr. Amanda Shuman
>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>> Project
>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>>>> PhD, University of California, Santa Cruz
>>>>>> http://www.amandashuman.net/
>>>>>> http://www.prchistoryresources.org/
>>>>>> Office: +49 (0) 761 203 4925
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>>>>>> arafalov@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> This is probably your start, if not read already:
>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>>>>>>> 
>>>>>>> Otherwise, I think your answer would be somewhere around using
>> ICU4J,
>>>>>>> IBM's library for dealing with Unicode:
>> http://site.icu-project.org/
>>>>>>> (mentioned on the same page above)
>>>>>>> Specifically, transformations:
>>>>>>> http://userguide.icu-project.org/transforms/general
>>>>>>> 
>>>>>>> With that, maybe you map both alphabets into latin. I did that once
>>>>>>> for Thai for a demo:
>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/
>>>>>>> collection1/conf/schema.xml#L34
>>>>>>> 
>>>>>>> The challenge is to figure out all the magic rules for that. You'd
>>>>>>> have to dig through the ICU documentation and other web pages. I
>>> found
>>>>>>> this one for example:
>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system-
>>>>>>> transliterators-available-with-icu4j.html;jsessionid=
>>>>>>> BEAB0AF05A588B97B8A2393054D908C0
>>>>>>> 
>>>>>>> There is also 12 part series on Solr and Asian text processing,
>>> though
>>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/
>>>>>>> 
>>>>>>> Hope one of these things help.
>>>>>>> 
>>>>>>> Regards,
>>>>>>>   Alex.
>>>>>>> 
>>>>>>> 
>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <am...@gmail.com>
>>>>>> wrote:
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> We have a problem. Some of our historical documents have mixed
>>>>>> together
>>>>>>>> simplified and Chinese characters. There seems to be no problem
>>> when
>>>>>>>> searching either traditional or simplified separately - that is,
>>> if a
>>>>>>>> particular string/phrase is all in traditional or simplified, it
>>>>>> finds
>>>>>>> it -
>>>>>>>> but it does not find the string/phrase if the two different
>>>>>> characters
>>>>>>> (one
>>>>>>>> traditional, one simplified) are mixed together in the SAME
>>>>>>> string/phrase.
>>>>>>>> 
>>>>>>>> Has anyone ever handled this problem before? I know some
>> libraries
>>>>>> seem
>>>>>>> to
>>>>>>>> have implemented something that seems to be able to handle this,
>>> but
>>>>>> I'm
>>>>>>>> not sure how they did so!
>>>>>>>> 
>>>>>>>> Amanda
>>>>>>>> ------
>>>>>>>> Dr. Amanda Shuman
>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>>>>>> Project
>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>>>>>> PhD, University of California, Santa Cruz
>>>>>>>> http://www.amandashuman.net/
>>>>>>>> http://www.prchistoryresources.org/
>>>>>>>> Office: +49 (0) 761 203 4925
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Re: Question regarding searching Chinese characters

Posted by Susheel Kumar <su...@gmail.com>.
I think so.  I used the exact as in github

<fieldType name="text_cjk" class="solr.TextField"
positionIncrementGap="10000" autoGeneratePhraseQueries="false">
  <analyzer>
    <tokenizer class="solr.ICUTokenizerFactory" />
    <filter class="solr.CJKWidthFilterFactory"/>
    <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
    <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
    <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
    <filter class="solr.ICUFoldingFilterFactory"/>
    <filter class="solr.CJKBigramFilterFactory" han="true"
hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
  </analyzer>
</fieldType>



On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <am...@gmail.com>
wrote:

> Thanks! That does indeed look promising... This can be added on top of
> Smart Chinese, right? Or is it an alternative?
>
>
> ------
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> <http://www.maoistlegacy.uni-freiburg.de/>
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <su...@gmail.com>
> wrote:
>
> > I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
> > each of A, B or C or D in query and they seems to be matching and CJKFF
> is
> > transforming the 舊 to 旧
> >
> > On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <su...@gmail.com>
> > wrote:
> >
> > > Lack of my chinese language knowledge but if you want, I can do quick
> > test
> > > for you in Analysis tab if you can give me what to put in index and
> query
> > > window...
> > >
> > > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <su...@gmail.com>
> > > wrote:
> > >
> > >> Have you tried to use CJKFoldingFilter https://g
> > >> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> cover
> > >> your use case but I am using this filter and so far no issues.
> > >>
> > >> Thnx
> > >>
> > >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> amanda.shuman@gmail.com
> > >
> > >> wrote:
> > >>
> > >>> Thanks, Alex - I have seen a few of those links but never considered
> > >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> > >>> basically what is laid out in the old blogspot post, namely this
> point:
> > >>>
> > >>>
> > >>> "Why approach CJK resource discovery differently?
> > >>>
> > >>> 2.  Search results must be as script agnostic as possible.
> > >>>
> > >>> There is more than one way to write each word. "Simplified"
> characters
> > >>> were
> > >>> emphasized for printed materials in mainland China starting in the
> > 1950s;
> > >>> "Traditional" characters were used in printed materials prior to the
> > >>> 1950s,
> > >>> and are still used in Taiwan, Hong Kong and Macau today.
> > >>> Since the characters are distinct, it's as if Chinese materials are
> > >>> written
> > >>> in two scripts.
> > >>> Another way to think about it:  every written Chinese word has at
> least
> > >>> two
> > >>> completely different spellings.  And it can be mix-n-match:  a word
> can
> > >>> be
> > >>> written with one traditional  and one simplified character.
> > >>> Example:   Given a user query 舊小說  (traditional for old fiction), the
> > >>> results should include matches for 舊小說 (traditional) and 旧小说
> > (simplified
> > >>> characters for old fiction)"
> > >>>
> > >>> So, using the example provided above, we are dealing with materials
> > >>> produced in the 1950s-1970s that do even weirder things like:
> > >>>
> > >>> A. 舊小說
> > >>>
> > >>> can also be
> > >>>
> > >>> B. 旧小说 (all simplified)
> > >>> or
> > >>> C. 旧小說 (first character simplified, last character traditional)
> > >>> or
> > >>> D. 舊小 说 (first character traditional, last character simplified)
> > >>>
> > >>> Thankfully the middle character was never simplified in recent times.
> > >>>
> > >>> From a historical standpoint, the mixed nature of the characters in
> the
> > >>> same word/phrase is because not all simplified characters were
> adopted
> > at
> > >>> the same time by everyone uniformly (good times...).
> > >>>
> > >>> The problem seems to be that Solr can easily handle A or B above, but
> > >>> NOT C
> > >>> or D using the Smart Chinese analyzer. I'm not really sure how to
> > change
> > >>> that at this point... maybe I should figure out how to contact the
> > >>> creators
> > >>> of the analyzer and ask them?
> > >>>
> > >>> Amanda
> > >>>
> > >>> ------
> > >>> Dr. Amanda Shuman
> > >>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> Project
> > >>> <http://www.maoistlegacy.uni-freiburg.de/>
> > >>> PhD, University of California, Santa Cruz
> > >>> http://www.amandashuman.net/
> > >>> http://www.prchistoryresources.org/
> > >>> Office: +49 (0) 761 203 4925
> > >>>
> > >>>
> > >>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> > >>> arafalov@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > This is probably your start, if not read already:
> > >>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> > >>> >
> > >>> > Otherwise, I think your answer would be somewhere around using
> ICU4J,
> > >>> > IBM's library for dealing with Unicode:
> http://site.icu-project.org/
> > >>> > (mentioned on the same page above)
> > >>> > Specifically, transformations:
> > >>> > http://userguide.icu-project.org/transforms/general
> > >>> >
> > >>> > With that, maybe you map both alphabets into latin. I did that once
> > >>> > for Thai for a demo:
> > >>> > https://github.com/arafalov/solr-thai-test/blob/master/
> > >>> > collection1/conf/schema.xml#L34
> > >>> >
> > >>> > The challenge is to figure out all the magic rules for that. You'd
> > >>> > have to dig through the ICU documentation and other web pages. I
> > found
> > >>> > this one for example:
> > >>> > http://avajava.com/tutorials/lessons/what-are-the-system-
> > >>> > transliterators-available-with-icu4j.html;jsessionid=
> > >>> > BEAB0AF05A588B97B8A2393054D908C0
> > >>> >
> > >>> > There is also 12 part series on Solr and Asian text processing,
> > though
> > >>> > it is a bit old now: http://discovery-grindstone.blogspot.com/
> > >>> >
> > >>> > Hope one of these things help.
> > >>> >
> > >>> > Regards,
> > >>> >    Alex.
> > >>> >
> > >>> >
> > >>> > On 20 July 2018 at 03:54, Amanda Shuman <am...@gmail.com>
> > >>> wrote:
> > >>> > > Hi all,
> > >>> > >
> > >>> > > We have a problem. Some of our historical documents have mixed
> > >>> together
> > >>> > > simplified and Chinese characters. There seems to be no problem
> > when
> > >>> > > searching either traditional or simplified separately - that is,
> > if a
> > >>> > > particular string/phrase is all in traditional or simplified, it
> > >>> finds
> > >>> > it -
> > >>> > > but it does not find the string/phrase if the two different
> > >>> characters
> > >>> > (one
> > >>> > > traditional, one simplified) are mixed together in the SAME
> > >>> > string/phrase.
> > >>> > >
> > >>> > > Has anyone ever handled this problem before? I know some
> libraries
> > >>> seem
> > >>> > to
> > >>> > > have implemented something that seems to be able to handle this,
> > but
> > >>> I'm
> > >>> > > not sure how they did so!
> > >>> > >
> > >>> > > Amanda
> > >>> > > ------
> > >>> > > Dr. Amanda Shuman
> > >>> > > Post-doc researcher, University of Freiburg, The Maoist Legacy
> > >>> Project
> > >>> > > <http://www.maoistlegacy.uni-freiburg.de/>
> > >>> > > PhD, University of California, Santa Cruz
> > >>> > > http://www.amandashuman.net/
> > >>> > > http://www.prchistoryresources.org/
> > >>> > > Office: +49 (0) 761 203 4925
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: Question regarding searching Chinese characters

Posted by Amanda Shuman <am...@gmail.com>.
Thanks! That does indeed look promising... This can be added on top of
Smart Chinese, right? Or is it an alternative?


------
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project
<http://www.maoistlegacy.uni-freiburg.de/>
PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <su...@gmail.com>
wrote:

> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
> each of A, B or C or D in query and they seems to be matching and CJKFF is
> transforming the 舊 to 旧
>
> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <su...@gmail.com>
> wrote:
>
> > Lack of my chinese language knowledge but if you want, I can do quick
> test
> > for you in Analysis tab if you can give me what to put in index and query
> > window...
> >
> > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <su...@gmail.com>
> > wrote:
> >
> >> Have you tried to use CJKFoldingFilter https://g
> >> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would cover
> >> your use case but I am using this filter and so far no issues.
> >>
> >> Thnx
> >>
> >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <amanda.shuman@gmail.com
> >
> >> wrote:
> >>
> >>> Thanks, Alex - I have seen a few of those links but never considered
> >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> >>> basically what is laid out in the old blogspot post, namely this point:
> >>>
> >>>
> >>> "Why approach CJK resource discovery differently?
> >>>
> >>> 2.  Search results must be as script agnostic as possible.
> >>>
> >>> There is more than one way to write each word. "Simplified" characters
> >>> were
> >>> emphasized for printed materials in mainland China starting in the
> 1950s;
> >>> "Traditional" characters were used in printed materials prior to the
> >>> 1950s,
> >>> and are still used in Taiwan, Hong Kong and Macau today.
> >>> Since the characters are distinct, it's as if Chinese materials are
> >>> written
> >>> in two scripts.
> >>> Another way to think about it:  every written Chinese word has at least
> >>> two
> >>> completely different spellings.  And it can be mix-n-match:  a word can
> >>> be
> >>> written with one traditional  and one simplified character.
> >>> Example:   Given a user query 舊小說  (traditional for old fiction), the
> >>> results should include matches for 舊小說 (traditional) and 旧小说
> (simplified
> >>> characters for old fiction)"
> >>>
> >>> So, using the example provided above, we are dealing with materials
> >>> produced in the 1950s-1970s that do even weirder things like:
> >>>
> >>> A. 舊小說
> >>>
> >>> can also be
> >>>
> >>> B. 旧小说 (all simplified)
> >>> or
> >>> C. 旧小說 (first character simplified, last character traditional)
> >>> or
> >>> D. 舊小 说 (first character traditional, last character simplified)
> >>>
> >>> Thankfully the middle character was never simplified in recent times.
> >>>
> >>> From a historical standpoint, the mixed nature of the characters in the
> >>> same word/phrase is because not all simplified characters were adopted
> at
> >>> the same time by everyone uniformly (good times...).
> >>>
> >>> The problem seems to be that Solr can easily handle A or B above, but
> >>> NOT C
> >>> or D using the Smart Chinese analyzer. I'm not really sure how to
> change
> >>> that at this point... maybe I should figure out how to contact the
> >>> creators
> >>> of the analyzer and ask them?
> >>>
> >>> Amanda
> >>>
> >>> ------
> >>> Dr. Amanda Shuman
> >>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> >>> <http://www.maoistlegacy.uni-freiburg.de/>
> >>> PhD, University of California, Santa Cruz
> >>> http://www.amandashuman.net/
> >>> http://www.prchistoryresources.org/
> >>> Office: +49 (0) 761 203 4925
> >>>
> >>>
> >>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> >>> arafalov@gmail.com>
> >>> wrote:
> >>>
> >>> > This is probably your start, if not read already:
> >>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >>> >
> >>> > Otherwise, I think your answer would be somewhere around using ICU4J,
> >>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
> >>> > (mentioned on the same page above)
> >>> > Specifically, transformations:
> >>> > http://userguide.icu-project.org/transforms/general
> >>> >
> >>> > With that, maybe you map both alphabets into latin. I did that once
> >>> > for Thai for a demo:
> >>> > https://github.com/arafalov/solr-thai-test/blob/master/
> >>> > collection1/conf/schema.xml#L34
> >>> >
> >>> > The challenge is to figure out all the magic rules for that. You'd
> >>> > have to dig through the ICU documentation and other web pages. I
> found
> >>> > this one for example:
> >>> > http://avajava.com/tutorials/lessons/what-are-the-system-
> >>> > transliterators-available-with-icu4j.html;jsessionid=
> >>> > BEAB0AF05A588B97B8A2393054D908C0
> >>> >
> >>> > There is also 12 part series on Solr and Asian text processing,
> though
> >>> > it is a bit old now: http://discovery-grindstone.blogspot.com/
> >>> >
> >>> > Hope one of these things help.
> >>> >
> >>> > Regards,
> >>> >    Alex.
> >>> >
> >>> >
> >>> > On 20 July 2018 at 03:54, Amanda Shuman <am...@gmail.com>
> >>> wrote:
> >>> > > Hi all,
> >>> > >
> >>> > > We have a problem. Some of our historical documents have mixed
> >>> together
> >>> > > simplified and Chinese characters. There seems to be no problem
> when
> >>> > > searching either traditional or simplified separately - that is,
> if a
> >>> > > particular string/phrase is all in traditional or simplified, it
> >>> finds
> >>> > it -
> >>> > > but it does not find the string/phrase if the two different
> >>> characters
> >>> > (one
> >>> > > traditional, one simplified) are mixed together in the SAME
> >>> > string/phrase.
> >>> > >
> >>> > > Has anyone ever handled this problem before? I know some libraries
> >>> seem
> >>> > to
> >>> > > have implemented something that seems to be able to handle this,
> but
> >>> I'm
> >>> > > not sure how they did so!
> >>> > >
> >>> > > Amanda
> >>> > > ------
> >>> > > Dr. Amanda Shuman
> >>> > > Post-doc researcher, University of Freiburg, The Maoist Legacy
> >>> Project
> >>> > > <http://www.maoistlegacy.uni-freiburg.de/>
> >>> > > PhD, University of California, Santa Cruz
> >>> > > http://www.amandashuman.net/
> >>> > > http://www.prchistoryresources.org/
> >>> > > Office: +49 (0) 761 203 4925
> >>> >
> >>>
> >>
> >>
> >
>

Re: Question regarding searching Chinese characters

Posted by Susheel Kumar <su...@gmail.com>.
I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
each of A, B or C or D in query and they seems to be matching and CJKFF is
transforming the 舊 to 旧

On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <su...@gmail.com>
wrote:

> Lack of my chinese language knowledge but if you want, I can do quick test
> for you in Analysis tab if you can give me what to put in index and query
> window...
>
> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <su...@gmail.com>
> wrote:
>
>> Have you tried to use CJKFoldingFilter https://g
>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would cover
>> your use case but I am using this filter and so far no issues.
>>
>> Thnx
>>
>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <am...@gmail.com>
>> wrote:
>>
>>> Thanks, Alex - I have seen a few of those links but never considered
>>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
>>> basically what is laid out in the old blogspot post, namely this point:
>>>
>>>
>>> "Why approach CJK resource discovery differently?
>>>
>>> 2.  Search results must be as script agnostic as possible.
>>>
>>> There is more than one way to write each word. "Simplified" characters
>>> were
>>> emphasized for printed materials in mainland China starting in the 1950s;
>>> "Traditional" characters were used in printed materials prior to the
>>> 1950s,
>>> and are still used in Taiwan, Hong Kong and Macau today.
>>> Since the characters are distinct, it's as if Chinese materials are
>>> written
>>> in two scripts.
>>> Another way to think about it:  every written Chinese word has at least
>>> two
>>> completely different spellings.  And it can be mix-n-match:  a word can
>>> be
>>> written with one traditional  and one simplified character.
>>> Example:   Given a user query 舊小說  (traditional for old fiction), the
>>> results should include matches for 舊小說 (traditional) and 旧小说 (simplified
>>> characters for old fiction)"
>>>
>>> So, using the example provided above, we are dealing with materials
>>> produced in the 1950s-1970s that do even weirder things like:
>>>
>>> A. 舊小說
>>>
>>> can also be
>>>
>>> B. 旧小说 (all simplified)
>>> or
>>> C. 旧小說 (first character simplified, last character traditional)
>>> or
>>> D. 舊小 说 (first character traditional, last character simplified)
>>>
>>> Thankfully the middle character was never simplified in recent times.
>>>
>>> From a historical standpoint, the mixed nature of the characters in the
>>> same word/phrase is because not all simplified characters were adopted at
>>> the same time by everyone uniformly (good times...).
>>>
>>> The problem seems to be that Solr can easily handle A or B above, but
>>> NOT C
>>> or D using the Smart Chinese analyzer. I'm not really sure how to change
>>> that at this point... maybe I should figure out how to contact the
>>> creators
>>> of the analyzer and ask them?
>>>
>>> Amanda
>>>
>>> ------
>>> Dr. Amanda Shuman
>>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>> PhD, University of California, Santa Cruz
>>> http://www.amandashuman.net/
>>> http://www.prchistoryresources.org/
>>> Office: +49 (0) 761 203 4925
>>>
>>>
>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>>> arafalov@gmail.com>
>>> wrote:
>>>
>>> > This is probably your start, if not read already:
>>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>>> >
>>> > Otherwise, I think your answer would be somewhere around using ICU4J,
>>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
>>> > (mentioned on the same page above)
>>> > Specifically, transformations:
>>> > http://userguide.icu-project.org/transforms/general
>>> >
>>> > With that, maybe you map both alphabets into latin. I did that once
>>> > for Thai for a demo:
>>> > https://github.com/arafalov/solr-thai-test/blob/master/
>>> > collection1/conf/schema.xml#L34
>>> >
>>> > The challenge is to figure out all the magic rules for that. You'd
>>> > have to dig through the ICU documentation and other web pages. I found
>>> > this one for example:
>>> > http://avajava.com/tutorials/lessons/what-are-the-system-
>>> > transliterators-available-with-icu4j.html;jsessionid=
>>> > BEAB0AF05A588B97B8A2393054D908C0
>>> >
>>> > There is also 12 part series on Solr and Asian text processing, though
>>> > it is a bit old now: http://discovery-grindstone.blogspot.com/
>>> >
>>> > Hope one of these things help.
>>> >
>>> > Regards,
>>> >    Alex.
>>> >
>>> >
>>> > On 20 July 2018 at 03:54, Amanda Shuman <am...@gmail.com>
>>> wrote:
>>> > > Hi all,
>>> > >
>>> > > We have a problem. Some of our historical documents have mixed
>>> together
>>> > > simplified and Chinese characters. There seems to be no problem when
>>> > > searching either traditional or simplified separately - that is, if a
>>> > > particular string/phrase is all in traditional or simplified, it
>>> finds
>>> > it -
>>> > > but it does not find the string/phrase if the two different
>>> characters
>>> > (one
>>> > > traditional, one simplified) are mixed together in the SAME
>>> > string/phrase.
>>> > >
>>> > > Has anyone ever handled this problem before? I know some libraries
>>> seem
>>> > to
>>> > > have implemented something that seems to be able to handle this, but
>>> I'm
>>> > > not sure how they did so!
>>> > >
>>> > > Amanda
>>> > > ------
>>> > > Dr. Amanda Shuman
>>> > > Post-doc researcher, University of Freiburg, The Maoist Legacy
>>> Project
>>> > > <http://www.maoistlegacy.uni-freiburg.de/>
>>> > > PhD, University of California, Santa Cruz
>>> > > http://www.amandashuman.net/
>>> > > http://www.prchistoryresources.org/
>>> > > Office: +49 (0) 761 203 4925
>>> >
>>>
>>
>>
>

Re: Question regarding searching Chinese characters

Posted by Susheel Kumar <su...@gmail.com>.
Lack of my chinese language knowledge but if you want, I can do quick test
for you in Analysis tab if you can give me what to put in index and query
window...

On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <su...@gmail.com>
wrote:

> Have you tried to use CJKFoldingFilter https://github.com/sul-dlss/
> CJKFoldingFilter.  I am not sure if this would cover your use case but I
> am using this filter and so far no issues.
>
> Thnx
>
> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <am...@gmail.com>
> wrote:
>
>> Thanks, Alex - I have seen a few of those links but never considered
>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
>> basically what is laid out in the old blogspot post, namely this point:
>>
>>
>> "Why approach CJK resource discovery differently?
>>
>> 2.  Search results must be as script agnostic as possible.
>>
>> There is more than one way to write each word. "Simplified" characters
>> were
>> emphasized for printed materials in mainland China starting in the 1950s;
>> "Traditional" characters were used in printed materials prior to the
>> 1950s,
>> and are still used in Taiwan, Hong Kong and Macau today.
>> Since the characters are distinct, it's as if Chinese materials are
>> written
>> in two scripts.
>> Another way to think about it:  every written Chinese word has at least
>> two
>> completely different spellings.  And it can be mix-n-match:  a word can be
>> written with one traditional  and one simplified character.
>> Example:   Given a user query 舊小說  (traditional for old fiction), the
>> results should include matches for 舊小說 (traditional) and 旧小说 (simplified
>> characters for old fiction)"
>>
>> So, using the example provided above, we are dealing with materials
>> produced in the 1950s-1970s that do even weirder things like:
>>
>> A. 舊小說
>>
>> can also be
>>
>> B. 旧小说 (all simplified)
>> or
>> C. 旧小說 (first character simplified, last character traditional)
>> or
>> D. 舊小 说 (first character traditional, last character simplified)
>>
>> Thankfully the middle character was never simplified in recent times.
>>
>> From a historical standpoint, the mixed nature of the characters in the
>> same word/phrase is because not all simplified characters were adopted at
>> the same time by everyone uniformly (good times...).
>>
>> The problem seems to be that Solr can easily handle A or B above, but NOT
>> C
>> or D using the Smart Chinese analyzer. I'm not really sure how to change
>> that at this point... maybe I should figure out how to contact the
>> creators
>> of the analyzer and ask them?
>>
>> Amanda
>>
>> ------
>> Dr. Amanda Shuman
>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>> <http://www.maoistlegacy.uni-freiburg.de/>
>> PhD, University of California, Santa Cruz
>> http://www.amandashuman.net/
>> http://www.prchistoryresources.org/
>> Office: +49 (0) 761 203 4925
>>
>>
>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>> arafalov@gmail.com>
>> wrote:
>>
>> > This is probably your start, if not read already:
>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>> >
>> > Otherwise, I think your answer would be somewhere around using ICU4J,
>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
>> > (mentioned on the same page above)
>> > Specifically, transformations:
>> > http://userguide.icu-project.org/transforms/general
>> >
>> > With that, maybe you map both alphabets into latin. I did that once
>> > for Thai for a demo:
>> > https://github.com/arafalov/solr-thai-test/blob/master/
>> > collection1/conf/schema.xml#L34
>> >
>> > The challenge is to figure out all the magic rules for that. You'd
>> > have to dig through the ICU documentation and other web pages. I found
>> > this one for example:
>> > http://avajava.com/tutorials/lessons/what-are-the-system-
>> > transliterators-available-with-icu4j.html;jsessionid=
>> > BEAB0AF05A588B97B8A2393054D908C0
>> >
>> > There is also 12 part series on Solr and Asian text processing, though
>> > it is a bit old now: http://discovery-grindstone.blogspot.com/
>> >
>> > Hope one of these things help.
>> >
>> > Regards,
>> >    Alex.
>> >
>> >
>> > On 20 July 2018 at 03:54, Amanda Shuman <am...@gmail.com>
>> wrote:
>> > > Hi all,
>> > >
>> > > We have a problem. Some of our historical documents have mixed
>> together
>> > > simplified and Chinese characters. There seems to be no problem when
>> > > searching either traditional or simplified separately - that is, if a
>> > > particular string/phrase is all in traditional or simplified, it finds
>> > it -
>> > > but it does not find the string/phrase if the two different characters
>> > (one
>> > > traditional, one simplified) are mixed together in the SAME
>> > string/phrase.
>> > >
>> > > Has anyone ever handled this problem before? I know some libraries
>> seem
>> > to
>> > > have implemented something that seems to be able to handle this, but
>> I'm
>> > > not sure how they did so!
>> > >
>> > > Amanda
>> > > ------
>> > > Dr. Amanda Shuman
>> > > Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>> > > <http://www.maoistlegacy.uni-freiburg.de/>
>> > > PhD, University of California, Santa Cruz
>> > > http://www.amandashuman.net/
>> > > http://www.prchistoryresources.org/
>> > > Office: +49 (0) 761 203 4925
>> >
>>
>
>

Re: Question regarding searching Chinese characters

Posted by Susheel Kumar <su...@gmail.com>.
Have you tried to use CJKFoldingFilter
https://github.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
cover your use case but I am using this filter and so far no issues.

Thnx

On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <am...@gmail.com>
wrote:

> Thanks, Alex - I have seen a few of those links but never considered
> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> basically what is laid out in the old blogspot post, namely this point:
>
>
> "Why approach CJK resource discovery differently?
>
> 2.  Search results must be as script agnostic as possible.
>
> There is more than one way to write each word. "Simplified" characters were
> emphasized for printed materials in mainland China starting in the 1950s;
> "Traditional" characters were used in printed materials prior to the 1950s,
> and are still used in Taiwan, Hong Kong and Macau today.
> Since the characters are distinct, it's as if Chinese materials are written
> in two scripts.
> Another way to think about it:  every written Chinese word has at least two
> completely different spellings.  And it can be mix-n-match:  a word can be
> written with one traditional  and one simplified character.
> Example:   Given a user query 舊小說  (traditional for old fiction), the
> results should include matches for 舊小說 (traditional) and 旧小说 (simplified
> characters for old fiction)"
>
> So, using the example provided above, we are dealing with materials
> produced in the 1950s-1970s that do even weirder things like:
>
> A. 舊小說
>
> can also be
>
> B. 旧小说 (all simplified)
> or
> C. 旧小說 (first character simplified, last character traditional)
> or
> D. 舊小 说 (first character traditional, last character simplified)
>
> Thankfully the middle character was never simplified in recent times.
>
> From a historical standpoint, the mixed nature of the characters in the
> same word/phrase is because not all simplified characters were adopted at
> the same time by everyone uniformly (good times...).
>
> The problem seems to be that Solr can easily handle A or B above, but NOT C
> or D using the Smart Chinese analyzer. I'm not really sure how to change
> that at this point... maybe I should figure out how to contact the creators
> of the analyzer and ask them?
>
> Amanda
>
> ------
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> <http://www.maoistlegacy.uni-freiburg.de/>
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <arafalov@gmail.com
> >
> wrote:
>
> > This is probably your start, if not read already:
> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >
> > Otherwise, I think your answer would be somewhere around using ICU4J,
> > IBM's library for dealing with Unicode: http://site.icu-project.org/
> > (mentioned on the same page above)
> > Specifically, transformations:
> > http://userguide.icu-project.org/transforms/general
> >
> > With that, maybe you map both alphabets into latin. I did that once
> > for Thai for a demo:
> > https://github.com/arafalov/solr-thai-test/blob/master/
> > collection1/conf/schema.xml#L34
> >
> > The challenge is to figure out all the magic rules for that. You'd
> > have to dig through the ICU documentation and other web pages. I found
> > this one for example:
> > http://avajava.com/tutorials/lessons/what-are-the-system-
> > transliterators-available-with-icu4j.html;jsessionid=
> > BEAB0AF05A588B97B8A2393054D908C0
> >
> > There is also 12 part series on Solr and Asian text processing, though
> > it is a bit old now: http://discovery-grindstone.blogspot.com/
> >
> > Hope one of these things help.
> >
> > Regards,
> >    Alex.
> >
> >
> > On 20 July 2018 at 03:54, Amanda Shuman <am...@gmail.com> wrote:
> > > Hi all,
> > >
> > > We have a problem. Some of our historical documents have mixed together
> > > simplified and Chinese characters. There seems to be no problem when
> > > searching either traditional or simplified separately - that is, if a
> > > particular string/phrase is all in traditional or simplified, it finds
> > it -
> > > but it does not find the string/phrase if the two different characters
> > (one
> > > traditional, one simplified) are mixed together in the SAME
> > string/phrase.
> > >
> > > Has anyone ever handled this problem before? I know some libraries seem
> > to
> > > have implemented something that seems to be able to handle this, but
> I'm
> > > not sure how they did so!
> > >
> > > Amanda
> > > ------
> > > Dr. Amanda Shuman
> > > Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> > > <http://www.maoistlegacy.uni-freiburg.de/>
> > > PhD, University of California, Santa Cruz
> > > http://www.amandashuman.net/
> > > http://www.prchistoryresources.org/
> > > Office: +49 (0) 761 203 4925
> >
>

Re: Question regarding searching Chinese characters

Posted by Amanda Shuman <am...@gmail.com>.
Thanks, Alex - I have seen a few of those links but never considered
transliteration! We use lucene's Smart Chinese analyzer. The issue is
basically what is laid out in the old blogspot post, namely this point:


"Why approach CJK resource discovery differently?

2.  Search results must be as script agnostic as possible.

There is more than one way to write each word. "Simplified" characters were
emphasized for printed materials in mainland China starting in the 1950s;
"Traditional" characters were used in printed materials prior to the 1950s,
and are still used in Taiwan, Hong Kong and Macau today.
Since the characters are distinct, it's as if Chinese materials are written
in two scripts.
Another way to think about it:  every written Chinese word has at least two
completely different spellings.  And it can be mix-n-match:  a word can be
written with one traditional  and one simplified character.
Example:   Given a user query 舊小說  (traditional for old fiction), the
results should include matches for 舊小說 (traditional) and 旧小说 (simplified
characters for old fiction)"

So, using the example provided above, we are dealing with materials
produced in the 1950s-1970s that do even weirder things like:

A. 舊小說

can also be

B. 旧小说 (all simplified)
or
C. 旧小說 (first character simplified, last character traditional)
or
D. 舊小 说 (first character traditional, last character simplified)

Thankfully the middle character was never simplified in recent times.

From a historical standpoint, the mixed nature of the characters in the
same word/phrase is because not all simplified characters were adopted at
the same time by everyone uniformly (good times...).

The problem seems to be that Solr can easily handle A or B above, but NOT C
or D using the Smart Chinese analyzer. I'm not really sure how to change
that at this point... maybe I should figure out how to contact the creators
of the analyzer and ask them?

Amanda

------
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project
<http://www.maoistlegacy.uni-freiburg.de/>
PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> This is probably your start, if not read already:
> https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>
> Otherwise, I think your answer would be somewhere around using ICU4J,
> IBM's library for dealing with Unicode: http://site.icu-project.org/
> (mentioned on the same page above)
> Specifically, transformations:
> http://userguide.icu-project.org/transforms/general
>
> With that, maybe you map both alphabets into latin. I did that once
> for Thai for a demo:
> https://github.com/arafalov/solr-thai-test/blob/master/
> collection1/conf/schema.xml#L34
>
> The challenge is to figure out all the magic rules for that. You'd
> have to dig through the ICU documentation and other web pages. I found
> this one for example:
> http://avajava.com/tutorials/lessons/what-are-the-system-
> transliterators-available-with-icu4j.html;jsessionid=
> BEAB0AF05A588B97B8A2393054D908C0
>
> There is also 12 part series on Solr and Asian text processing, though
> it is a bit old now: http://discovery-grindstone.blogspot.com/
>
> Hope one of these things help.
>
> Regards,
>    Alex.
>
>
> On 20 July 2018 at 03:54, Amanda Shuman <am...@gmail.com> wrote:
> > Hi all,
> >
> > We have a problem. Some of our historical documents have mixed together
> > simplified and Chinese characters. There seems to be no problem when
> > searching either traditional or simplified separately - that is, if a
> > particular string/phrase is all in traditional or simplified, it finds
> it -
> > but it does not find the string/phrase if the two different characters
> (one
> > traditional, one simplified) are mixed together in the SAME
> string/phrase.
> >
> > Has anyone ever handled this problem before? I know some libraries seem
> to
> > have implemented something that seems to be able to handle this, but I'm
> > not sure how they did so!
> >
> > Amanda
> > ------
> > Dr. Amanda Shuman
> > Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> > <http://www.maoistlegacy.uni-freiburg.de/>
> > PhD, University of California, Santa Cruz
> > http://www.amandashuman.net/
> > http://www.prchistoryresources.org/
> > Office: +49 (0) 761 203 4925
>

Re: Question regarding searching Chinese characters

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
This is probably your start, if not read already:
https://lucene.apache.org/solr/guide/7_4/language-analysis.html

Otherwise, I think your answer would be somewhere around using ICU4J,
IBM's library for dealing with Unicode: http://site.icu-project.org/
(mentioned on the same page above)
Specifically, transformations:
http://userguide.icu-project.org/transforms/general

With that, maybe you map both alphabets into latin. I did that once
for Thai for a demo:
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34

The challenge is to figure out all the magic rules for that. You'd
have to dig through the ICU documentation and other web pages. I found
this one for example:
http://avajava.com/tutorials/lessons/what-are-the-system-transliterators-available-with-icu4j.html;jsessionid=BEAB0AF05A588B97B8A2393054D908C0

There is also 12 part series on Solr and Asian text processing, though
it is a bit old now: http://discovery-grindstone.blogspot.com/

Hope one of these things help.

Regards,
   Alex.


On 20 July 2018 at 03:54, Amanda Shuman <am...@gmail.com> wrote:
> Hi all,
>
> We have a problem. Some of our historical documents have mixed together
> simplified and Chinese characters. There seems to be no problem when
> searching either traditional or simplified separately - that is, if a
> particular string/phrase is all in traditional or simplified, it finds it -
> but it does not find the string/phrase if the two different characters (one
> traditional, one simplified) are mixed together in the SAME string/phrase.
>
> Has anyone ever handled this problem before? I know some libraries seem to
> have implemented something that seems to be able to handle this, but I'm
> not sure how they did so!
>
> Amanda
> ------
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> <http://www.maoistlegacy.uni-freiburg.de/>
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925