You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by J B <be...@hotmail.com> on 2005/05/30 19:46:02 UTC

Searching with Ö and Ä?

Hello,

Is there anyone who can help me configure Nutch so that I can use it for 
Swedics or German websites containing characters like "ö" and "ä"? Crawling 
and indexing seems to work fine, it's just the searching that goes wrong. 
When I enter a searchstring like "Köln", knowing that it appears in the 
text, the resultpage says that there are no matching results, and the "ö" is 
replaced by random characters...

I have searched the docs and the web, but I can't find the answer to my 
problem.

Best regards,

Jon

P.S. Sorry if two versions of this message reached the list, I am quite new 
to this...

_________________________________________________________________
Chat: Ha en fest på Habbo Hotel 
http://habbohotel.msn.se/habbo/sv/channelizer Checka in här!


Re: Searching with Ö and Ä?

Posted by Andrzej Bialecki <ab...@getopt.org>.
J B wrote:
> Hello,
> 
> Is there anyone who can help me configure Nutch so that I can use it for 
> Swedics or German websites containing characters like "ö" and "ä"? 
> Crawling and indexing seems to work fine, it's just the searching that 
> goes wrong. When I enter a searchstring like "Köln", knowing that it 
> appears in the text, the resultpage says that there are no matching 
> results, and the "ö" is replaced by random characters...
> 
> I have searched the docs and the web, but I can't find the answer to my 
> problem.


The characters are not random - they correspond to a url-encoding of 
utf-8 encoding of latin1 characters, whereas they should be a 
url-encoding of utf-8 encoding of utf-8 characters.

;-)

For the US-Ascii range each of the above gives the same result, but for 
all other characters it gives wrong results.

Please make sure that you set the page encoding to utf-8 in your JSPs, 
htmls, and preferably the same as the default character encoding, 
somewhere in the configuration of your servlet engine. As the old hands 
say: "choose UTF-8 and stick to it religiously".

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


RE: Searching with Ö and Ä?

Posted by Chirag Chaman <de...@filangy.com>.
Jon,

You'll need to set encoding to UTF-8. 
We don't use the default Nutch JSP pages, so I'm not sure if they have it or
not, but here's the simplified process.

1. make sure your JSP files have the something like this on top
<%@ page contentType="text/html; charset=utf-8" pageEncoding="utf-8"  

2. Your tomcat server.xml should have this line (URIEncoding="UTF-8")
     <Connector port="80"
               maxThreads="250" minSpareThreads="25" maxSpareThreads="75"
               enableLookups="false" redirectPort="8443" acceptCount="100"
               connectionTimeout="15000" disableUploadTimeout="180000"
URIEncoding="UTF-8" useBodyEncodingForURI="false" />

This should take care of it. 

Regards,
CC

--------------------------------------------
Filangy, Inc.
Interested in Improving Search? Join our Team!
http://filangy.com/jointheteam.jsp 



-----Original Message-----
From: J B [mailto:bewalog_33@hotmail.com] 
Sent: Monday, May 30, 2005 1:46 PM
To: nutch-user@incubator.apache.org
Subject: Searching with Ö and Ä?

Hello,

Is there anyone who can help me configure Nutch so that I can use it for
Swedics or German websites containing characters like "ö" and "ä"? Crawling
and indexing seems to work fine, it's just the searching that goes wrong. 
When I enter a searchstring like "Köln", knowing that it appears in the
text, the resultpage says that there are no matching results, and the "ö" is
replaced by random characters...

I have searched the docs and the web, but I can't find the answer to my
problem.

Best regards,

Jon

P.S. Sorry if two versions of this message reached the list, I am quite new
to this...

_________________________________________________________________
Chat: Ha en fest på Habbo Hotel
http://habbohotel.msn.se/habbo/sv/channelizer Checka in här!