You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by 钟逊 <kk...@gmail.com> on 2014/03/02 07:43:27 UTC

how to crawl files named in chinese characters

Hi everyone, I am testing nutch on a chinsese website and encountered a problem. A few .doc files are named in chinese characters, nutch cannot fetch them.
To find out why I tried parsechecker and it turns out nutch re-encoded my url, it adds "25" after every "%", so 404 occured:
bin/nutch parsechecker http://yanjiusheng.bistu.edu.cn/files/%E5%85%B3%E4%BA%8E2013%E5%B9%B412%E6%9C%88%E7%A0%94%E7%A9%B6%E7%94%9F%E8%8B%B1%E8%AF%AD%E5%9B%9B%E5%85%AD%E7%BA%A7%E8%80%83%E8%AF%95%E6%8A%A5%E5%90%8D%E7%9A%84%E9%80%9A%E7%9F%A5.doc
fetching: http://yanjiusheng.bistu.edu.cn/files/%25E5%2585%25B3%25E4%25BA%258E2013%25E5%25B9%25B412%25E6%259C%2588%25E7%25A0%2594%25E7%25A9%25B6%25E7%2594%259F%25E8%258B%25B1%25E8%25AF%25AD%25E5%259B%259B%25E5%2585%25AD%25E7%25BA%25A7%25E8%2580%2583%25E8%25AF%2595%25E6%258A%25A5%25E5%2590%258D%25E7%259A%2584%25E9%2580%259A%25E7%259F%25A5.doc
Fetch failed with protocol status: notfound(14), lastModified=0: http://yanjiusheng.bistu.edu.cn/files/%25E5%2585%25B3%25E4%25BA%258E2013%25E5%25B9%25B412%25E6%259C%2588%25E7%25A0%2594%25E7%25A9%25B6%25E7%2594%259F%25E8%258B%25B1%25E8%25AF%25AD%25E5%259B%259B%25E5%2585%25AD%25E7%25BA%25A7%25E8%2580%2583%25E8%25AF%2595%25E6%258A%25A5%25E5%2590%258D%25E7%259A%2584%25E9%2580%259A%25E7%259F%25A5.doc


How can I fix this? Please help, thanks!




钟逊