You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Edward Ackroyd (JIRA)" <ji...@apache.org> on 2013/02/12 19:39:13 UTC
[jira] [Created] (NUTCH-1530) Umlauts (üäö) garbled when fetch and parse in separate calls (OK when fetcher.parse is true)
Edward Ackroyd created NUTCH-1530:
-------------------------------------
Summary: Umlauts (üäö) garbled when fetch and parse in separate calls (OK when fetcher.parse is true)
Key: NUTCH-1530
URL: https://issues.apache.org/jira/browse/NUTCH-1530
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 2.1
Environment: Using Cassandra-1.2.1 as data store.
Reporter: Edward Ackroyd
When crawling http://www.spiegel.de (popular German news site) in separate fetch and parse calls (nutch fetch, then nutch parse, fetcher.parse=false) this lands in Cassandra (umlauts all garbled, for example '�' instead of 'ö'):
[default@webpage] list p;
RowKey: de.spiegel.www:http/
=> (column=c, value=SPIEGEL ONLINE - Nachrichten Schlagzeilen Hilfe RSS Newsletter Mobil Wetter TV-Programm Dienstag, 12. Februar 2013 SPIEGEL ONLINE NACHRICHTEN Home Politik Deutschland Ausland Wirtschaft B�rse Verbraucher & Service Unternehmen & M�rkte Staat & Soziales Jobsuche Immowelt Panorama Justiz Leute Gesellschaft Partnersuche Eurojackpot Tarifvergleiche Sport Wintersport Fu�ball Bundesliga...
However, when fetcher.parse=true and the fetch call does the parsing, the correct umlauts land in Cassandra:
[default@webpage] list p;
RowKey: de.spiegel.www:http/
=> (column=c, value=SPIEGEL ONLINE - Nachrichten Schlagzeilen Hilfe RSS Newsletter Mobil Wetter TV-Programm Dienstag, 12. Februar 2013 SPIEGEL ONLINE NACHRICHTEN Home Politik Deutschland Ausland Wirtschaft Börse Verbraucher & Service Unternehmen & Märkte Staat & Soziales Jobsuche Immowelt Panorama Justiz Leute Gesellschaft Partnersuche Eurojackpot Tarifvergleiche Sport Wintersport Fußball Bundesliga...
Seems the content is over-encoded when fetching/parsing in separate calls.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira