You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by waynelam <wa...@ln.edu.hk> on 2011/06/20 11:41:27 UTC

Searching in Traditional / Simplified Chinese Record

Hi,

  I 've recently make change to my schema.xml to support import of 
Chinese Record.
What i want to do is to search both Traditional Chinese(TC) (e.g. ?? 
)and Simplified Chinese (SC) (e.g. ??) Record
when in the same query. I know I can do that by encoding all SC Record 
to TC. I want to change to way to index
rather that change the record.

Anyone should show me the way in much appreciated.


Thanks

Wayne


-- 
-----------------------------------------
Wayne Lam
Assistant Library Officer I
Systems Development&  Support
Fong Sum Wood Library
Lingnan University
8 Castle Peak Road
Tuen Mun, New Territories
Hong Kong SAR
China
Phone:   +852 26168585
Email:   waynelam@ln.edu.hk
Website: http://www.library.ln.edu.hk


Re: Searching in Traditional / Simplified Chinese Record

Posted by waynelam <wa...@ln.edu.hk>.
By "changing the record", i mean translate them word by word using software.
Sorry i m new for this kind of modification. For synonyms filter, would 
there be
a big table and result in degrade of indexing performance?

I have tried using filter like ICUTransformFilterFactory but it seems 
not works

<analyzer type="index" class="org.apache.lucene.analysis.cjk.CJKAnalyzer">
<tokenizer class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" 
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" 
composed="false" remove_diacritics="true" remove_modifiers="true" 
fold="true"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
</analyzer>

Am i setting it wrong?


Regards,

Wayne



On 6/21/2011 2:30 AM, François Schiettecatte wrote:
> Wayne
>
> I am not sure what you mean by 'changing the record'.
>
> One option would be to implement something like the synonyms filter to generate the TC for SC when you index the document, which would index both the TC and the SC in the same location. That way your users would be able to search with either TC or SC.
>
> Another option would be to use the same synonyms filter but do the expansion at search time.
>
> Cheers
>
> François
>
>
> On Jun 20, 2011, at 5:41 AM, waynelam wrote:
>
>> Hi,
>>
>> I 've recently make change to my schema.xml to support import of Chinese Record.
>> What i want to do is to search both Traditional Chinese(TC) (e.g. ?? )and Simplified Chinese (SC) (e.g. ??) Record
>> when in the same query. I know I can do that by encoding all SC Record to TC. I want to change to way to index
>> rather that change the record.
>>
>> Anyone should show me the way in much appreciated.
>>
>>
>> Thanks
>>
>> Wayne
>>
>>
>> -- 
>> -----------------------------------------
>> Wayne Lam
>> Assistant Library Officer I
>> Systems Development&   Support
>> Fong Sum Wood Library
>> Lingnan University
>> 8 Castle Peak Road
>> Tuen Mun, New Territories
>> Hong Kong SAR
>> China
>> Phone:   +852 26168585
>> Email:   waynelam@ln.edu.hk
>> Website: http://www.library.ln.edu.hk
>>


-- 
-----------------------------------------
Wayne Lam
Assistant Library Officer I
Systems Development&  Support
Fong Sum Wood Library
Lingnan University
8 Castle Peak Road
Tuen Mun, New Territories
Hong Kong SAR
China
Phone:   +852 26168585
Email:   waynelam@ln.edu.hk
Website: http://www.library.ln.edu.hk


Re: Searching in Traditional / Simplified Chinese Record

Posted by François Schiettecatte <fs...@gmail.com>.
Wayne

I am not sure what you mean by 'changing the record'.

One option would be to implement something like the synonyms filter to generate the TC for SC when you index the document, which would index both the TC and the SC in the same location. That way your users would be able to search with either TC or SC.

Another option would be to use the same synonyms filter but do the expansion at search time.

Cheers

François


On Jun 20, 2011, at 5:41 AM, waynelam wrote:

> Hi,
> 
> I 've recently make change to my schema.xml to support import of Chinese Record.
> What i want to do is to search both Traditional Chinese(TC) (e.g. ?? )and Simplified Chinese (SC) (e.g. ??) Record
> when in the same query. I know I can do that by encoding all SC Record to TC. I want to change to way to index
> rather that change the record.
> 
> Anyone should show me the way in much appreciated.
> 
> 
> Thanks
> 
> Wayne
> 
> 
> -- 
> -----------------------------------------
> Wayne Lam
> Assistant Library Officer I
> Systems Development&  Support
> Fong Sum Wood Library
> Lingnan University
> 8 Castle Peak Road
> Tuen Mun, New Territories
> Hong Kong SAR
> China
> Phone:   +852 26168585
> Email:   waynelam@ln.edu.hk
> Website: http://www.library.ln.edu.hk
>