You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Roland (JIRA)" <ji...@apache.org> on 2013/02/12 23:13:14 UTC

[jira] [Comment Edited] (NUTCH-1530) Umlauts (üäö) garbled when fetch and parse in separate calls (OK when fetcher.parse is true)

    [ https://issues.apache.org/jira/browse/NUTCH-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13577094#comment-13577094 ] 

Roland edited comment on NUTCH-1530 at 2/12/13 10:13 PM:
---------------------------------------------------------

Ok, here it is:

get f['de.spiegel.www:http/'];
{code}
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
[...]
<meta name="description" content="Deutschlands f�hrende Nachrichtenseite. Alles Wichtige aus Politik, Wirtschaft, Sport, Kultur, Wissenschaft, Technik und mehr." />
[...]
Reaktionen auf Papst-R�cktritt
{code}

get sc['de.spiegel.www:http/'];
{code}
=> (super_column=h,
     (column=Cache-Control, value=max-age=120, timestamp=1360703830403000)
     (column=Connection, value=close, timestamp=1360703830404004)
     (column=Content-Encoding, value=gzip, timestamp=1360703830404000)
     (column=Content-Type, value=text/html;charset=ISO-8859-1, timestamp=1360703830404006)
[...]
{code}

get p['de.spiegel.www:http/'];
{code}
[default@webpage] get p['de.spiegel.www:http/'];
=> (column=c, value=SPIEGEL ONLINE - Nachrichten Schlagzeilen Hilfe RSS Newsletter Mobil Wetter TV-Programm Dienstag, 12. Februar 2013 SPIEGEL ONLINE NACHRICHTEN Home Politik Deutschland Ausland   Wirtschaft B�rse Verbraucher & Service Unternehmen & M�rkte Sta...
{code}

You're right, seems to be a problem with ISO charsets. (Looks like ISO-8859 treated as UTF-8 and saved again)

--Roland
                
      was (Author: rherget):
    Ok, here it is:

get f['de.spiegel.www:http/'];
{code}
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
[...]
<meta name="description" content="Deutschlands f�hrende Nachrichtenseite. Alles Wichtige aus Politik, Wirtschaft, Sport, Kultur, Wissenschaft, Technik und mehr." />
[...]
Reaktionen auf Papst-R�cktritt
{code}

get sc['de.spiegel.www:http/'];
{code}
=> (super_column=h,
     (column=Cache-Control, value=max-age=120, timestamp=1360703830403000)
     (column=Connection, value=close, timestamp=1360703830404004)
     (column=Content-Encoding, value=gzip, timestamp=1360703830404000)
     (column=Content-Type, value=text/html;charset=ISO-8859-1, timestamp=1360703830404006)
[...]
{code}

get p['de.spiegel.www:http/'];
{code}
[default@webpage] get p['de.spiegel.www:http/'];
=> (column=c, value=SPIEGEL ONLINE - Nachrichten Schlagzeilen Hilfe RSS Newsletter Mobil Wetter TV-Programm Dienstag, 12. Februar 2013 SPIEGEL ONLINE NACHRICHTEN Home Politik Deutschland Ausland   Wirtschaft B�rse Verbraucher & Service Unternehmen & M�rkte Sta...
{code}

You're right, seems to be a problem with ISO charsets.

--Roland
                  
> Umlauts (üäö) garbled when fetch and parse in separate calls (OK when fetcher.parse is true)
> --------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1530
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.1
>         Environment: Using Cassandra-1.2.1 as data store.
>            Reporter: Edward Ackroyd
>
> When crawling http://www.spiegel.de (popular German news site) in separate fetch and parse calls (nutch fetch, then nutch parse, fetcher.parse=false) this lands in Cassandra (umlauts all garbled, for example '�' instead of 'ö'):
> [default@webpage] list p;
> RowKey: de.spiegel.www:http/
> => (column=c, value=SPIEGEL ONLINE - Nachrichten Schlagzeilen Hilfe RSS Newsletter Mobil Wetter TV-Programm Dienstag, 12. Februar 2013 SPIEGEL ONLINE NACHRICHTEN Home Politik Deutschland Ausland   Wirtschaft B�rse Verbraucher & Service Unternehmen & M�rkte Staat & Soziales Jobsuche Immowelt   Panorama Justiz Leute Gesellschaft Partnersuche Eurojackpot Tarifvergleiche   Sport Wintersport Fu�ball Bundesliga...
> However, when fetcher.parse=true and the fetch call does the parsing, the correct umlauts land in Cassandra:
> [default@webpage] list p;
> RowKey: de.spiegel.www:http/
> => (column=c, value=SPIEGEL ONLINE - Nachrichten Schlagzeilen Hilfe RSS Newsletter Mobil Wetter TV-Programm Dienstag, 12. Februar 2013 SPIEGEL ONLINE NACHRICHTEN Home Politik Deutschland Ausland   Wirtschaft Börse Verbraucher & Service Unternehmen & Märkte Staat & Soziales Jobsuche Immowelt   Panorama Justiz Leute Gesellschaft Partnersuche Eurojackpot Tarifvergleiche   Sport Wintersport Fußball Bundesliga...
> Seems the content is over-encoded when fetching/parsing in separate calls.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira