You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by charlie w <sp...@gmail.com> on 2007/11/14 18:28:44 UTC

results display for languages other than English

I am doing open-web crawls that include a number of pages that are not
in English, or at least have content on them that is not in English.

In this specific case, a blog post in English, but with a comment or
two in Russian.  Crawling, indexing and searching all seem to work
fine.  That is, I can put some Russian characters into the search box
and get appropriate looking results back.  (I don't speak Russian or
have any clue what the characters are; somebody else here in the
office gave me the search term.)

But the Nutch results page displays weird characters for the page
summary excerpt.  I can click through to the resulting page, and the
Russian characters are correctly displayed there.

I am using Firefox 2.0.0.9, set to Unicode(UTF-8) encoding for
display.  I've switched the encoding around, but can't get the page to
look right.

I've searched the list, and it seems that language concerns revolve
around stemming and the like, which is not the problem I have here.

Is there some sort of configuration knob I can turn on the search
page?  Is it possible to detect result character sets on the fly and
"do the right thing" on the results page?  Is there any kind of
documentation I can consult about support for this kind of thing in
Nutch?

Thanks