You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Onur Deniz <de...@yahoo.com> on 2008/09/01 10:36:19 UTC

getting content from url - encoding problem

	hi,

	I am using nutch just to crawl some web-sites. I'm not using searching facility. 
	I'm using nutch using only command line options. I did not make any change in source code( but in conf. files like url-filter)...
	I'm calling command line options from scripts and execute thoses scripts using Runtime.getRuntime.exec(...) in java. (well, a bit longer way, but it seemed easier than running from eclipse at first)

	I know how to get content/parsetext of an URL in commandline. ( bin/nutch readseg -get .... ).

	Getting parsetext is ok because nutch handles encoding of the site. But when I try to get content of the page using the command (bin/nutch readseg -get) I faced an encoding problem; 
page is in windows-1254. but I think the command returns content in utf-8. because some special characters(ş,ç,ğ,ü,ı) are dislpayed with displayement character ( <?> ).  
	so, my questions are, 
	how does the command (bin/nutch readseg -get ... -nofetch -nogenerate -noparse -noparsedata -noparsetext) returns the content of the page? i mean, does it parses the content according to its encoding? or does it returns the content in utf-8 defalut?
	
	any suggestions? any solutions?

	thanks all.


	onur deniz