You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by 钟逊 <kk...@gmail.com> on 2014/03/02 07:45:53 UTC

how to crawl files named in chinese characters (nutch 1.7)

Hi everyone, I am testing nutch 1.7 on a chinsese website and encountered a problem. A few .doc files are named in chinese characters, nutch cannot fetch them.
To find out why I tried parsechecker and it turns out nutch re-encoded my url, it adds "25" after every "%", so 404 occured:
bin/nutch parsechecker http://yanjiusheng.bistu.edu.cn/files/%E5%85%B3%E4%BA%8E2013%E5%B9%B412%E6%9C%88%E7%A0%94%E7%A9%B6%E7%94%9F%E8%8B%B1%E8%AF%AD%E5%9B%9B%E5%85%AD%E7%BA%A7%E8%80%83%E8%AF%95%E6%8A%A5%E5%90%8D%E7%9A%84%E9%80%9A%E7%9F%A5.doc
fetching: http://yanjiusheng.bistu.edu.cn/files/%25E5%2585%25B3%25E4%25BA%258E2013%25E5%25B9%25B412%25E6%259C%2588%25E7%25A0%2594%25E7%25A9%25B6%25E7%2594%259F%25E8%258B%25B1%25E8%25AF%25AD%25E5%259B%259B%25E5%2585%25AD%25E7%25BA%25A7%25E8%2580%2583%25E8%25AF%2595%25E6%258A%25A5%25E5%2590%258D%25E7%259A%2584%25E9%2580%259A%25E7%259F%25A5.doc
Fetch failed with protocol status: notfound(14), lastModified=0: http://yanjiusheng.bistu.edu.cn/files/%25E5%2585%25B3%25E4%25BA%258E2013%25E5%25B9%25B412%25E6%259C%2588%25E7%25A0%2594%25E7%25A9%25B6%25E7%2594%259F%25E8%258B%25B1%25E8%25AF%25AD%25E5%259B%259B%25E5%2585%25AD%25E7%25BA%25A7%25E8%2580%2583%25E8%25AF%2595%25E6%258A%25A5%25E5%2590%258D%25E7%259A%2584%25E9%2580%259A%25E7%259F%25A5.doc


my system is Ubuntu 12.04 lts 32bit

How can I fix this? Please help, thanks!



钟逊

Re: how to crawl files named in chinese characters (nutch 1.7)

Posted by feng lu <am...@gmail.com>.

I can not download the content of that url you provided. The content is not
exist. But when I try another url in your provided website, it success.

bin/nutch plugin protocol-httpclient
org.apache.nutch.protocol.httpclient.Http
http://yanjiusheng.bistu.edu.cn/files/%E6%80%9D%E6%94%BF%E5%B7%A5%E4%BD%9C/%E6%AC%A3%E8%B5%8F%E5%88%AB%E6%A0%B7%E7%BE%8E%E6%99%AF%EF%BC%8C%E5%A2%9E%E8%BF%9B%E4%B8%AD%E6%BE%B3%E6%84%9F%E6%83%85%E2%80%94%E2%80%94%E7%A0%94%E7%A9%B6%E7%94%9F%E9%83%A8%E7%BB%84%E7%BB%87%E6%BE%B3%E5%A4%A7%E5%88%A9%E4%BA%9A%E6%9D%A5%E8%AE%BF%E5%9B%A2%E7%95%85%E6%B8%B8%E5%A5%A5%E6%9E%97%E5%8C%B9%E5%85%8B%E6%A3%AE%E6%9E%97%E5%85%AC%E5%9B%AD.doc


On Sun, Mar 2, 2014 at 2:45 PM, 钟逊 <kk...@gmail.com> wrote:

> Hi everyone, I am testing nutch 1.7 on a chinsese website and encountered
> a problem. A few .doc files are named in chinese characters, nutch cannot
> fetch them.
> To find out why I tried parsechecker and it turns out nutch re-encoded my
> url, it adds "25" after every "%", so 404 occured:
> bin/nutch parsechecker
> http://yanjiusheng.bistu.edu.cn/files/%E5%85%B3%E4%BA%8E2013%E5%B9%B412%E6%9C%88%E7%A0%94%E7%A9%B6%E7%94%9F%E8%8B%B1%E8%AF%AD%E5%9B%9B%E5%85%AD%E7%BA%A7%E8%80%83%E8%AF%95%E6%8A%A5%E5%90%8D%E7%9A%84%E9%80%9A%E7%9F%A5.doc
> fetching:
> http://yanjiusheng.bistu.edu.cn/files/%25E5%2585%25B3%25E4%25BA%258E2013%25E5%25B9%25B412%25E6%259C%2588%25E7%25A0%2594%25E7%25A9%25B6%25E7%2594%259F%25E8%258B%25B1%25E8%25AF%25AD%25E5%259B%259B%25E5%2585%25AD%25E7%25BA%25A7%25E8%2580%2583%25E8%25AF%2595%25E6%258A%25A5%25E5%2590%258D%25E7%259A%2584%25E9%2580%259A%25E7%259F%25A5.doc
> Fetch failed with protocol status: notfound(14), lastModified=0:
> http://yanjiusheng.bistu.edu.cn/files/%25E5%2585%25B3%25E4%25BA%258E2013%25E5%25B9%25B412%25E6%259C%2588%25E7%25A0%2594%25E7%25A9%25B6%25E7%2594%259F%25E8%258B%25B1%25E8%25AF%25AD%25E5%259B%259B%25E5%2585%25AD%25E7%25BA%25A7%25E8%2580%2583%25E8%25AF%2595%25E6%258A%25A5%25E5%2590%258D%25E7%259A%2584%25E9%2580%259A%25E7%259F%25A5.doc
>
>
> my system is Ubuntu 12.04 lts 32bit
>
> How can I fix this? Please help, thanks!
>
>
>
> 钟逊




-- 
Don't Grow Old, Grow Up... :-)