You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by adu <du...@hzduozhun.com> on 2014/08/20 10:54:05 UTC
Nutch 1.7 content encoding problem
hi all,
I want to crawl a json file from a url.
I use "wget url" ,and find the result file has wrong encoding characters
about Chinese words . And the I
run "iconv -f gbk -t utf-8 file.json " , and get the correct result.
Then , I use nutch. Use the readseg dump to get the result. The
ParseText part is the correct json file, but
the Context part has wrong encoding characters. But the 'iconv ' doesn't
work well.
Does "parser.character.encoding.default" or the local environment result
in this problem? How to fix it?
Thanks.