You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Christopher Beer <ca...@stanford.edu> on 2018/08/14 17:01:50 UTC
Re: Question regarding searching Chinese characters

Hi all,

Thanks for this enlightening thread. As it happens, at Stanford Libraries we’re currently working on upgrading from Solr 4 to 7 and we’re looking forward to using the new dictionary-based word splitting in the ICUTokenizer.

We have many of the same challenges as Amanda mentioned, and thanks to the advice on this thread, we’ve taken a stab at a CharFilter to do the traditional -> simplified transformation [1] and it seems to be promising and we've sent it out for testing by our subject matter experts for evaluation.

Thanks,
Chris

[1] https://github.com/sul-dlss/CJKFilterUtils/blob/master/src/main/java/edu/stanford/lucene/analysis/ICUTransformCharFilter.java

On 2018/07/24 12:54:35, Tomoko Uchida <t....@gmail.com> wrote:
Hi Amanda,>

do all I need to do is modify the settings from smartChinese to the ones>
you posted here>

Yes, the settings I posted should work for you, at least partially.>
If you are happy with the results, it's OK!>
But please take this as a starting point because it's not perfect.>

Or do I need to still do something with the SmartChineseAnalyzer?>

Try the settings, then if you notice something strange and want to know why>
and how to solve it, that may be the time to dive deep into. ;)>

I cannot explain how analyzers works here... but you should start off with>
the Solr documentation.>
https://lucene.apache.org/solr/guide/7_0/understanding-analyzers-tokenizers-and-filters.html>

Regards,>
Tomoko>



2018年7月24日(火) 21:08 Amanda Shuman <am...@gmail.com>:>

Hi Tomoko,>

Thanks so much for this explanation - I did not even know this was>
possible! I will try it out but I have one question: do all I need to do is>
modify the settings from smartChinese to the ones you posted here:>

<analyzer>>
<charFilter class="solr.ICUNormalizer2CharFilterFactory"/>>
<tokenizer class="solr.HMMChineseTokenizerFactory"/>>
<filter class="solr.ICUTransformFilterFactory">
id="Traditional-Simplified"/>>
</analyzer>>

Or do I need to still do something with the SmartChineseAnalyzer? I did not>
quite understand this in your first message:>

" I think you need two steps if you want to use HMMChineseTokenizer>
correctly.>

1. transform all traditional characters to simplified ones and save to>
temporary files.>
I do not have clear idea for doing this, but you can create a Java>
program that calls Lucene's ICUTransformFilter>
2. then, index to Solr using SmartChineseAnalyzer.">

My understanding is that with the new settings you posted, I don't need to>
do these steps. Is that correct? Otherwise, I don't really know how to do>
step 1 with the java program....>

Thanks!>
Amanda>


------>
Dr. Amanda Shuman>
Post-doc researcher, University of Freiburg, The Maoist Legacy Project>
<http://www.maoistlegacy.uni-freiburg.de/>>
PhD, University of California, Santa Cruz>
http://www.amandashuman.net/>
http://www.prchistoryresources.org/>
Office: +49 (0) 761 203 4925>