You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Floyd Wu <fl...@gmail.com> on 2011/10/21 07:33:10 UTC

Want to support "did you mean xxx" but is Chinese

Does anybody know how to implement this idea in SOLR. Please kindly
point me a direction.

For example, when user enter a keyword in Chinese "貝多芬" (this is
Beethoven in Chinese)
but key in a wrong combination of characters  "背多分" (this is
pronouncation the same with previous keyword "貝多芬").

There in solr index exist token "貝多芬" actually. How to hit documents
where "貝多芬" exist when "背多分" is enter.

This is basic function of commercial search engine especially in
Chinese processing. I wonder how to implements in SOLR and where is
the start point.

Floyd

Re: Want to support "did you mean xxx" but is Chinese

Posted by Ken Krugler <kk...@transpac.com>.

Hi Floyd,

Typically you'd do this by creating a custom analyzer that

 - segments Chinese text into words
 - Converts from words to pinyin or zhuyin

Your index would have both the actual Hanzi characters, plus (via copyfield) this phonetic version.

During search, you'd use dismax to search against both fields, with a higher weighting to the Hanzi field.

But segmentation can be error prone, and requires embedding specialized code that you typically license (for high quality results) from a commercial vendor.

So my first cut approach would be to use the current synonym support to map each Hanzi to all possible pronunciations. There are numerous open source datasets that contain this information. Note that there might be performance issues with having such a huge set of synonyms.

Then, by weighting phrase matches sufficiently high (again using dismax) I think you could get reasonable results.

-- Ken

On Oct 21, 2011, at 7:33am, Floyd Wu wrote:

> Does anybody know how to implement this idea in SOLR. Please kindly
> point me a direction.
> 
> For example, when user enter a keyword in Chinese "貝多芬" (this is
> Beethoven in Chinese)
> but key in a wrong combination of characters  "背多分" (this is
> pronouncation the same with previous keyword "貝多芬").
> 
> There in solr index exist token "貝多芬" actually. How to hit documents
> where "貝多芬" exist when "背多分" is enter.
> 
> This is basic function of commercial search engine especially in
> Chinese processing. I wonder how to implements in SOLR and where is
> the start point.
> 
> Floyd

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: Want to support "did you mean xxx" but is Chinese

Posted by Floyd Wu <fl...@gmail.com>.

Hi Li Li,

Thanks for your detail explanation. Basically I have similar
implementation like yours. I just want to know if there is a better
and total solution. I'll keep trying and see if I have any improvement
that can share with you and the community.

Any idea or advice are welcome .

Floyd



2011/10/21 Li Li <fa...@gmail.com>:
>    we have implemented one supporting "did you mean" and preffix suggestion
> for Chinese. But we base our working on solr 1.4 and we did many
> modifications so it will cost time to integrate it to current solr/lucene.
>
>     Here are our solution. glad to see any advices.
>
>     1. offline words and phrases discovery.
>           we discovery new words and new phrases by mining query logs
>
>     2. online matching algorithm
>           for each word, e.g., �����
>           we convert it to pinyin bei duo fen, then we indexing it using
> n-gram, which means gram3:bei gram3:eid ...
>           to get "did you mean" result, we convert query ����� into n-gram,
> it's a boolean or query, so there are many results( the words' pinyin
> similar to query will be ranked top)
>          Then we reranks top 500 results by fine-grained algorithm
>          we use edit distance to align query and result, we also take
> character into consideration. e.g query ʮ��,matches are ʮ�� and �Ƕ�,their
> pinyins are exactly the same the ʮ�� is better than �Ƕ� because ʮ occured in
> both query and match
>          also you need consider the hotness(popular degree) of different
> words/phrases. which can be known from query logs
>
>          Another question is to convert Chinese into pinyin. because some
> character has more than one pinyin.
>         e.g. ��ɳ ���� ��'s pinyin is chang in ��ɳ,you should segment query and
> words/phrases first. word segmentation is a basic problem is Chinese IR
>
>
> 2011/10/21 Floyd Wu <fl...@gmail.com>
>
>> Does anybody know how to implement this idea in SOLR. Please kindly
>> point me a direction.
>>
>> For example, when user enter a keyword in Chinese "ؐ���" (this is
>> Beethoven in Chinese)
>> but key in a wrong combination of characters  "�����" (this is
>> pronouncation the same with previous keyword "ؐ���").
>>
>> There in solr index exist token "ؐ���" actually. How to hit documents
>> where "ؐ���" exist when "�����" is enter.
>>
>> This is basic function of commercial search engine especially in
>> Chinese processing. I wonder how to implements in SOLR and where is
>> the start point.
>>
>> Floyd
>>
>

Re: Want to support "did you mean xxx" but is Chinese

Posted by Li Li <fa...@gmail.com>.

    we have implemented one supporting "did you mean" and preffix suggestion
for Chinese. But we base our working on solr 1.4 and we did many
modifications so it will cost time to integrate it to current solr/lucene.

     Here are our solution. glad to see any advices.

     1. offline words and phrases discovery.
           we discovery new words and new phrases by mining query logs

     2. online matching algorithm
           for each word, e.g., 贝多芬
           we convert it to pinyin bei duo fen, then we indexing it using
n-gram, which means gram3:bei gram3:eid ...
           to get "did you mean" result, we convert query 背朵分 into n-gram,
it's a boolean or query, so there are many results( the words' pinyin
similar to query will be ranked top)
          Then we reranks top 500 results by fine-grained algorithm
          we use edit distance to align query and result, we also take
character into consideration. e.g query 十度,matches are 十渡 and 是度,their
pinyins are exactly the same the 十渡 is better than 是度 because 十 occured in
both query and match
          also you need consider the hotness(popular degree) of different
words/phrases. which can be known from query logs

          Another question is to convert Chinese into pinyin. because some
character has more than one pinyin.
         e.g. 长沙 长大 长's pinyin is chang in 长沙,you should segment query and
words/phrases first. word segmentation is a basic problem is Chinese IR


2011/10/21 Floyd Wu <fl...@gmail.com>

> Does anybody know how to implement this idea in SOLR. Please kindly
> point me a direction.
>
> For example, when user enter a keyword in Chinese "貝多芬" (this is
> Beethoven in Chinese)
> but key in a wrong combination of characters  "背多分" (this is
> pronouncation the same with previous keyword "貝多芬").
>
> There in solr index exist token "貝多芬" actually. How to hit documents
> where "貝多芬" exist when "背多分" is enter.
>
> This is basic function of commercial search engine especially in
> Chinese processing. I wonder how to implements in SOLR and where is
> the start point.
>
> Floyd
>