You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by charlie w <sp...@gmail.com> on 2007/11/27 01:04:04 UTC

Problems with mixed English/Russian page

I have crawled a page with both English and Russian (I think) content
into my index but can't seem to get search results when using a
Russian search term.

The page is: http://englishrussia.com/?p=845

The search term is: воды

The term appears in one of the comments ('Comment by Henry').

I've dumped the segment in which the page content is stored, and the
correct UTF-8 characters are stored there so it seems the fetch was
fine.

This is, of course, only an example; I've had similar results with
different terms and other similar pages.  I don't know Russian, but
have tried enough different words that I think I am not using the
equivalent of "the" as a search term.  I had been having issues with
character encodings in the servlet, but seem to have worked those out,
and as far as I can tell by adding some extra logging to the search
servlet that the Query object built by the parser is correct.

Can Nutch (or is the problem with Lucene?) support this kind of
searching into mixed language content?  How can I make this work?

Thanks.