You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Daniel López <D....@uib.es> on 2006/12/11 15:03:58 UTC
Nutching different languages and encodings
Hi,
After being able to search and show the content type, etc, now I came
across the problem that my web pages, encoded in ISO-8859-1, are not
"properly" indexed as the summaries and titles are missing the "non
UTF-8" characters.
I tried specifying the property
*******************************************************************
<property>
<name>parser.character.encoding.default</name>
<value>ISO-8859-1</value>
</property>
*******************************************************************
but it made no difference.
On a related note, I can see that my documents have been properly
identified with the "language-identifier" plugin and I can see the
"lang" detail on the hits. However, I'm trying to do a search limited to
the documents in one given language but I cannot get the query to
identify which language I'm talking about.
I tried using the same way one can search documents from one site using
"site:my.site.com criteria" but using lang, language, Language... but
nothing works and I can see in the logs:
**************
061211 145357 10 query: lang:ca CRUE
061211 145357 10 Language: null <<<<<<< here it should read ca??
061211 145357 10 searching for 20 raw hits
**************
I tried browsing the documentation and searching the web but I could not
find explicit information on how to build the query to make use of that
field, now that I know the documents are properly indexed.
Any hints on those subjects?
Thanks in advance,
D.