You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Christopher Beer <ca...@stanford.edu> on 2018/08/14 17:01:50 UTC
Re: Question regarding searching Chinese characters
Hi all,
Thanks for this enlightening thread. As it happens, at Stanford Libraries we’re currently working on upgrading from Solr 4 to 7 and we’re looking forward to using the new dictionary-based word splitting in the ICUTokenizer.
We have many of the same challenges as Amanda mentioned, and thanks to the advice on this thread, we’ve taken a stab at a CharFilter to do the traditional -> simplified transformation [1] and it seems to be promising and we've sent it out for testing by our subject matter experts for evaluation.
Thanks,
Chris
[1] https://github.com/sul-dlss/CJKFilterUtils/blob/master/src/main/java/edu/stanford/lucene/analysis/ICUTransformCharFilter.java
On 2018/07/24 12:54:35, Tomoko Uchida <t....@gmail.com> wrote:
Hi Amanda,>
do all I need to do is modify the settings from smartChinese to the ones>
you posted here>
Yes, the settings I posted should work for you, at least partially.>
If you are happy with the results, it's OK!>
But please take this as a starting point because it's not perfect.>
Or do I need to still do something with the SmartChineseAnalyzer?>
Try the settings, then if you notice something strange and want to know why>
and how to solve it, that may be the time to dive deep into. ;)>
I cannot explain how analyzers works here... but you should start off with>
the Solr documentation.>
https://lucene.apache.org/solr/guide/7_0/understanding-analyzers-tokenizers-and-filters.html>
Regards,>
Tomoko>
2018年7月24日(火) 21:08 Amanda Shuman <am...@gmail.com>:>
Hi Tomoko,>
Thanks so much for this explanation - I did not even know this was>
possible! I will try it out but I have one question: do all I need to do is>
modify the settings from smartChinese to the ones you posted here:>
<analyzer>>
<charFilter class="solr.ICUNormalizer2CharFilterFactory"/>>
<tokenizer class="solr.HMMChineseTokenizerFactory"/>>
<filter class="solr.ICUTransformFilterFactory">
id="Traditional-Simplified"/>>
</analyzer>>
Or do I need to still do something with the SmartChineseAnalyzer? I did not>
quite understand this in your first message:>
" I think you need two steps if you want to use HMMChineseTokenizer>
correctly.>
1. transform all traditional characters to simplified ones and save to>
temporary files.>
I do not have clear idea for doing this, but you can create a Java>
program that calls Lucene's ICUTransformFilter>
2. then, index to Solr using SmartChineseAnalyzer.">
My understanding is that with the new settings you posted, I don't need to>
do these steps. Is that correct? Otherwise, I don't really know how to do>
step 1 with the java program....>
Thanks!>
Amanda>
------>
Dr. Amanda Shuman>
Post-doc researcher, University of Freiburg, The Maoist Legacy Project>
<http://www.maoistlegacy.uni-freiburg.de/>>
PhD, University of California, Santa Cruz>
http://www.amandashuman.net/>
http://www.prchistoryresources.org/>
Office: +49 (0) 761 203 4925>