You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Thomas B <av...@gmail.com> on 2011/09/15 13:31:25 UTC

Handling URLs with non-UTF8 characters

I've run into a small issue with my deployment of Nutch. Some of the sites I
crawl use characters such as æøå in their URLs, and these never seem to
parse properly. Is there any way to get around this? I tried adding the
UTF-values (as '\u00e5' and so on) in regex-normalize.xml, but I suppose
they may be misparsed already when they're fetched, so they aren't actually
seen as e.g. character 00e5. Any suggestions would be much appreciated.