You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Onur Deniz <de...@yahoo.com> on 2008/09/01 10:36:19 UTC
getting content from url - encoding problem
hi,
I am using nutch just to crawl some web-sites. I'm not using searching facility.
I'm using nutch using only command line options. I did not make any change in source code( but in conf. files like url-filter)...
I'm calling command line options from scripts and execute thoses scripts using Runtime.getRuntime.exec(...) in java. (well, a bit longer way, but it seemed easier than running from eclipse at first)
I know how to get content/parsetext of an URL in commandline. ( bin/nutch readseg -get .... ).
Getting parsetext is ok because nutch handles encoding of the site. But when I try to get content of the page using the command (bin/nutch readseg -get) I faced an encoding problem;
page is in windows-1254. but I think the command returns content in utf-8. because some special characters(ş,ç,ğ,ü,ı) are dislpayed with displayement character ( <?> ).
so, my questions are,
how does the command (bin/nutch readseg -get ... -nofetch -nogenerate -noparse -noparsedata -noparsetext) returns the content of the page? i mean, does it parses the content according to its encoding? or does it returns the content in utf-8 defalut?
any suggestions? any solutions?
thanks all.
onur deniz