You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tod <li...@gmail.com> on 2010/11/11 20:35:23 UTC

Retrieving indexed content containing multiple languages

My Solr corpus is currently created by indexing metadata from a 
relational database as well as content pointed to by URLs from the 
database.  I'm using a pretty generic out of the box Solr schema.  The 
search results are presented via an AJAX enabled HTML page.

When I perform a search the document title (for example) has a mix of 
english and chinese characters.  Everything there is fine - I can see 
the english and chinese returned from a facet query on title.  I can 
search against the title using english words it contains and I get back 
an expected result.  I asked a chinese friend to perform the same search 
using chinese and nothing is returned.

How should I go about getting this search to work?  Chinese is just one 
language, I'll probably need to support more in the future.

My thought is that the chinese characters are indexed as their unicode 
equivalent so all I'll need to do is make sure the query is encoded 
appropriately and just perform a regular search as I would if the terms 
were in english.  For some reason that sounds too easy.

I see there is a CJK tokenizer that would help here.  Do I need that for 
my situation?  Is there a fairly detailed tutorial on how to handle 
these types of language challenges?


Thanks in advance - Tod

Re: Retrieving indexed content containing multiple languages

Posted by Tod <li...@gmail.com>.
On 11/11/2010 3:24 PM, Dennis Gearon wrote:
> I look forward to the eanswers to this one.

Well, it seems it was as easy as adding the CJKTokenizerFactory:

<fieldtype name="text_cjk" class="solr.TextField" 
positionIncrementGap="100">
  <analyzer>
   <tokenizer class="solr.CJKTokenizerFactory"/>
  </analyzer>
</fieldtype>


Once I did that and reindexed I could search for both english and 
chinese using the default 'text' field.  The next hurdle was getting the 
javascript to cooperate.  The chinese characters were getting corrupted 
on the way to the AJAX call against the Solr server.

As it turned out I was performing a POST to Solr using the jQuery .ajax 
api call.  Apparently when executing a POST you need to make sure the 
characters entered into the input field of the form are converted to 
unicode (\u7968 for example) prior to the AJAX call to Solr. 
Conversely, if executing a GET you need to convert the characters to 
UTF8 (%E7%A5%A8).

So now my customers are happily finding the appropriate document using 
english and chinese.

If someone could check my math I would appreciate it.  If it looks 
reasonable and there is nothing else written about it on the wiki I'll 
create a tutorial to give everybody else a leg up.


- Tod



> ----- Original Message ----
> From: Tod<li...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Thu, November 11, 2010 11:35:23 AM
> Subject: Retrieving indexed content containing multiple languages
>
> My Solr corpus is currently created by indexing metadata from a relational
> database as well as content pointed to by URLs from the database.  I'm using a
> pretty generic out of the box Solr schema.  The search results are presented via
> an AJAX enabled HTML page.
>
> When I perform a search the document title (for example) has a mix of english
> and chinese characters.  Everything there is fine - I can see the english and
> chinese returned from a facet query on title.  I can search against the title
> using english words it contains and I get back an expected result.  I asked a
> chinese friend to perform the same search using chinese and nothing is returned.
>
> How should I go about getting this search to work?  Chinese is just one
> language, I'll probably need to support more in the future.
>
> My thought is that the chinese characters are indexed as their unicode
> equivalent so all I'll need to do is make sure the query is encoded
> appropriately and just perform a regular search as I would if the terms were in
> english.  For some reason that sounds too easy.
>
> I see there is a CJK tokenizer that would help here.  Do I need that for my
> situation?  Is there a fairly detailed tutorial on how to handle these types of
> language challenges?
>
>
> Thanks in advance - Tod
>
>


Re: Retrieving indexed content containing multiple languages

Posted by Dennis Gearon <ge...@sbcglobal.net>.
I look forward to the eanswers to this one.

 Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



----- Original Message ----
From: Tod <li...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Thu, November 11, 2010 11:35:23 AM
Subject: Retrieving indexed content containing multiple languages

My Solr corpus is currently created by indexing metadata from a relational 
database as well as content pointed to by URLs from the database.  I'm using a 
pretty generic out of the box Solr schema.  The search results are presented via 
an AJAX enabled HTML page.

When I perform a search the document title (for example) has a mix of english 
and chinese characters.  Everything there is fine - I can see the english and 
chinese returned from a facet query on title.  I can search against the title 
using english words it contains and I get back an expected result.  I asked a 
chinese friend to perform the same search using chinese and nothing is returned.

How should I go about getting this search to work?  Chinese is just one 
language, I'll probably need to support more in the future.

My thought is that the chinese characters are indexed as their unicode 
equivalent so all I'll need to do is make sure the query is encoded 
appropriately and just perform a regular search as I would if the terms were in 
english.  For some reason that sounds too easy.

I see there is a CJK tokenizer that would help here.  Do I need that for my 
situation?  Is there a fairly detailed tutorial on how to handle these types of 
language challenges?


Thanks in advance - Tod