You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Árni Hermann Reynisson <ar...@hugsmidjan.is> on 2007/06/15 12:46:47 UTC

URLs and encoding problems

Greetings

I've been using nutch to crawl and index a rather large and complex website. I 
discovered that some of the linked pdf files didn't come up when searching 
for keywords that should've hit something.

I did some digging and found that it's due to the URLs to the pdf files. Some 
of them contain whitespaces and even characters like "ó","ý","æ","þ" or "ö", 
all of them not being encoded properly, somehow causing nutch, either with 
http or httpclient, to fail fetching the document.

Do you know if there's a solution to this problem at nutch's end or if I need 
to take measures myself either by "fixing" this in nutch or venture into 
getting people to properly encode every url that is linked to on the web?

Best regards,
Árni Hermann Reynisson
arni@hugsmidjan.is