You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by adu <du...@hzduozhun.com> on 2014/08/20 10:54:05 UTC

Nutch 1.7 content encoding problem

hi all,
I want to crawl a json file from a url.

I use "wget url" ,and find the result file has wrong encoding characters
about Chinese words . And the I

run "iconv -f gbk -t utf-8 file.json " , and get the correct result.

Then , I use nutch. Use the readseg dump to get the result. The
ParseText part is the correct json file, but

the Context part has wrong encoding characters. But the 'iconv ' doesn't
work well.

Does "parser.character.encoding.default" or the local environment result
in this problem? How to fix it?


Thanks.