You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andy <an...@yahoo.com> on 2011/03/16 02:07:36 UTC

Tokenizing Chinese & multi-language search

Hi,

I remember reading in this list a while ago that Solr will only tokenize on whitespace even when using CJKAnalyzer. That would make Solr unusable on Chinese or any other languages that don't use whitespace as separator.

1) I remember reading about a workaround. Unfortunately I can't find the post that mentioned it. Could someone give me pointers on how to address this issue?

2) Let's say I have fixed this issue and have properly analyzed and indexed the Chinese documents. My documents are in multiple languages. I plan to use separate fields for documents in different languages: text_en, text_zh, text_ja, text_fr, etc. Each field will be associated with the appropriate analyzer. 
My problem now is how to deal with the query string. I don't know what language the query is in, so I won't be able to select the appropriate analyzer for the query string. If I just use the standard analyzer on the query string, any query that's in Chinese won't be tokenized correctly. So would the whole system still work in this case?

This must be a pretty common use case, handling multi-language search. What is the recommended way of dealing with this problem?

Thanks.
Andy

Re: Tokenizing Chinese & multi-language search

Posted by Andy <an...@yahoo.com>.

Hi Otis,

It doesn't look like the last 2 options would work for me. So I guess my best bet is to ask the user to specify the language when they type in the query.

Once I get that information from the user, how do I dynamically pick an analyzer for the query string?

Thanks

Andy

--- On Tue, 3/15/11, Otis Gospodnetic <ot...@yahoo.com> wrote:

> From: Otis Gospodnetic <ot...@yahoo.com>
> Subject: Re: Tokenizing Chinese & multi-language search
> To: solr-user@lucene.apache.org
> Date: Tuesday, March 15, 2011, 11:51 PM
> Hi Andy,
> 
> Is the "I don't know what language the query is in"
> something you could change 
> by...
> - asking the user
> - deriving from HTTP request headers
> - identifying the query language (if queries are long
> enough and "texty")
> - ...
> 
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> ----- Original Message ----
> > From: Andy <an...@yahoo.com>
> > To: solr-user@lucene.apache.org
> > Sent: Tue, March 15, 2011 9:07:36 PM
> > Subject: Tokenizing Chinese & multi-language
> search
> > 
> > Hi,
> > 
> > I remember reading in this list a while ago that Solr
> will only  tokenize on 
> >whitespace even when using CJKAnalyzer. That would make
> Solr  unusable on 
> >Chinese or any other languages that don't use
> whitespace as  separator.
> > 
> > 1) I remember reading about a workaround.
> Unfortunately I  can't find the post 
> >that mentioned it. Could someone give me pointers on
> how to  address this issue?
> > 
> > 2) Let's say I have fixed this issue and have 
> properly analyzed and indexed 
> >the Chinese documents. My documents are in 
> multiple languages. I plan to use 
> >separate fields for documents in different 
> languages: text_en, text_zh, 
> >text_ja, text_fr, etc. Each field will be 
> associated with the appropriate 
> >analyzer. 
> >
> > My problem now is how to deal with  the query
> string. I don't know what 
> >language the query is in, so I won't be able  to
> select the appropriate analyzer 
> >for the query string. If I just use the  standard
> analyzer on the query string, 
> >any query that's in Chinese won't be  tokenized
> correctly. So would the whole 
> >system still work in this  case?
> > 
> > This must be a pretty common use case, handling
> multi-language  search. What is 
> >the recommended way of dealing with this 
> problem?
> > 
> > Thanks.
> > Andy
> > 
> > 
> >       
> > 
>

Re: Tokenizing Chinese & multi-language search

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Andy,

Is the "I don't know what language the query is in" something you could change 
by...
- asking the user
- deriving from HTTP request headers
- identifying the query language (if queries are long enough and "texty")
- ...

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Andy <an...@yahoo.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, March 15, 2011 9:07:36 PM
> Subject: Tokenizing Chinese & multi-language search
> 
> Hi,
> 
> I remember reading in this list a while ago that Solr will only  tokenize on 
>whitespace even when using CJKAnalyzer. That would make Solr  unusable on 
>Chinese or any other languages that don't use whitespace as  separator.
> 
> 1) I remember reading about a workaround. Unfortunately I  can't find the post 
>that mentioned it. Could someone give me pointers on how to  address this issue?
> 
> 2) Let's say I have fixed this issue and have  properly analyzed and indexed 
>the Chinese documents. My documents are in  multiple languages. I plan to use 
>separate fields for documents in different  languages: text_en, text_zh, 
>text_ja, text_fr, etc. Each field will be  associated with the appropriate 
>analyzer. 
>
> My problem now is how to deal with  the query string. I don't know what 
>language the query is in, so I won't be able  to select the appropriate analyzer 
>for the query string. If I just use the  standard analyzer on the query string, 
>any query that's in Chinese won't be  tokenized correctly. So would the whole 
>system still work in this  case?
> 
> This must be a pretty common use case, handling multi-language  search. What is 
>the recommended way of dealing with this  problem?
> 
> Thanks.
> Andy
> 
> 
>       
>