You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Eric Chow <er...@gmail.com> on 2005/04/11 11:54:53 UTC

Urgent, please help Index/Search in UTF-8 ???

Hello,


I am a beginner in using Lucene.


My files are contains different language (English, Chinese,
Portuguese, Japanese and some Asian languages, non-latin languages).
They always contain in one file.
Therefore, I have to use UTF-8 to save the contents.

I am now developing a web-based search engine. I use Lucene to create
index for those files and search it in web. The charset of the web
page is UTF-8, but it cannot search anything.

I try to use some Analyser (CJKAnalyser, ChineseAnalyser,
StandardAnalyser, SimpleAnalyser), still failed.

Finally, I tested to use original charset, for example, the Chinese
contents I used BIG5, and I can search it very well. For those
English, of couse, no problem.

But I can't use UTF-8 as the charset for documents. Any suggest and examples ?


Best regards,
Eric

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Urgent, please help Index/Search in UTF-8 ???

Posted by Zilverline info <in...@zilverline.org>.

For instance look at http://www.zilverline.org/zilverlineweb/space/faq

Michael

Karl Øie wrote:

> If you use a servlet and a HTML Form to feed queries to the 
> QueryParser take good care of all configurations around the servlet 
> container. If you, like me, use tomcat you might have to recode the 
> query into internal java form (utf-8) before you pass it to lucene.
>
>
> read this:
>
> http://www.crazysquirrel.com/compgen/form-encoding.php
>
>
> then in your receiving servlet:
>
> String query_string = request.getParameter("query");
>
> String query_string = new 
> String(query_string.getBytes(),request.getCharacterEncoding());
>
> then pass query_string to lucene. This ensures that the string fetched 
> by getParameter() is encoded by the right encoding.
>
> Hope this helps!
>
> Mvh Karl Øie
>
> On 11. apr. 2005, at 11.54, Eric Chow wrote:
>
>> Hello,
>>
>>
>> I am a beginner in using Lucene.
>>
>>
>> My files are contains different language (English, Chinese,
>> Portuguese, Japanese and some Asian languages, non-latin languages).
>> They always contain in one file.
>> Therefore, I have to use UTF-8 to save the contents.
>>
>> I am now developing a web-based search engine. I use Lucene to create
>> index for those files and search it in web. The charset of the web
>> page is UTF-8, but it cannot search anything.
>>
>> I try to use some Analyser (CJKAnalyser, ChineseAnalyser,
>> StandardAnalyser, SimpleAnalyser), still failed.
>>
>> Finally, I tested to use original charset, for example, the Chinese
>> contents I used BIG5, and I can search it very well. For those
>> English, of couse, no problem.
>>
>> But I can't use UTF-8 as the charset for documents. Any suggest and 
>> examples ?
>>
>>
>> Best regards,
>> Eric
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> - ...I wonder if the really nerdy Klingons learn how to speak english?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Urgent, please help Index/Search in UTF-8 ???

Posted by Karl Øie <ka...@gan.no>.

If you use a servlet and a HTML Form to feed queries to the QueryParser 
take good care of all configurations around the servlet container. If 
you, like me, use tomcat you might have to recode the query into 
internal java form (utf-8) before you pass it to lucene.


read this:

http://www.crazysquirrel.com/compgen/form-encoding.php


then in your receiving servlet:

String query_string = request.getParameter("query");

String query_string = new 
String(query_string.getBytes(),request.getCharacterEncoding());

then pass query_string to lucene. This ensures that the string fetched 
by getParameter() is encoded by the right encoding.

Hope this helps!

Mvh Karl Øie

On 11. apr. 2005, at 11.54, Eric Chow wrote:

> Hello,
>
>
> I am a beginner in using Lucene.
>
>
> My files are contains different language (English, Chinese,
> Portuguese, Japanese and some Asian languages, non-latin languages).
> They always contain in one file.
> Therefore, I have to use UTF-8 to save the contents.
>
> I am now developing a web-based search engine. I use Lucene to create
> index for those files and search it in web. The charset of the web
> page is UTF-8, but it cannot search anything.
>
> I try to use some Analyser (CJKAnalyser, ChineseAnalyser,
> StandardAnalyser, SimpleAnalyser), still failed.
>
> Finally, I tested to use original charset, for example, the Chinese
> contents I used BIG5, and I can search it very well. For those
> English, of couse, no problem.
>
> But I can't use UTF-8 as the charset for documents. Any suggest and 
> examples ?
>
>
> Best regards,
> Eric
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
- ...I wonder if the really nerdy Klingons learn how to speak english?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org