You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Daniel López <D....@uib.es> on 2006/12/11 15:03:58 UTC

Nutching different languages and encodings

Hi,

After being able to search and show the content type, etc, now I came 
across the problem that my web pages, encoded in ISO-8859-1, are not 
"properly" indexed as the summaries and titles are missing the "non 
UTF-8" characters.

I tried specifying the property
*******************************************************************
<property>
    <name>parser.character.encoding.default</name>
    <value>ISO-8859-1</value>
</property>
*******************************************************************
but it made no difference.

On a related note, I can see that my documents have been properly 
identified with the "language-identifier" plugin and I can see the 
"lang" detail on the hits. However, I'm trying to do a search limited to 
the documents in one given language but I cannot get the query to 
identify which language I'm talking about.
I tried using  the same way one can search documents from one site using 
"site:my.site.com criteria" but using lang, language, Language... but 
nothing works and I can see in the logs:
**************
061211 145357 10 query: lang:ca CRUE
061211 145357 10 Language: null <<<<<<< here it should read ca??
061211 145357 10 searching for 20 raw hits
**************
I tried browsing the documentation and searching the web but I could not 
find explicit information on how to build the query to make use of that 
field, now that I know the documents are properly indexed.

Any hints on those subjects?

Thanks in advance,
D.