You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2017/10/26 15:33:17 UTC
Wrong encoding
Hello,
I have this URL that says according to parsechecker it has Content-Type=text/html; charset=windows-1252, which is incorrect. There is also Content-Type=text/html; charset=utf-8 in the metadata, which i do find in the HTML, at least i see <meta charset="utf-8">. This is Nutch 1.14-SNAPSHOT.
But anyway, the text extracted is completely messed up, not all, but most accents are unreadable.
No idea, do you have any?
Many thanks,
Markus
https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
RE: Wrong encoding
Posted by Markus Jelsma <ma...@openindex.io>.
Note: setting parser.character.encoding.default to UTF-8 doesn't work.
Many thanks,
Markus
-----Original message-----
> From:Markus Jelsma <ma...@openindex.io>
> Sent: Thursday 26th October 2017 17:33
> To: User <us...@nutch.apache.org>
> Subject: Wrong encoding
>
> Hello,
>
> I have this URL that says according to parsechecker it has Content-Type=text/html; charset=windows-1252, which is incorrect. There is also Content-Type=text/html; charset=utf-8 in the metadata, which i do find in the HTML, at least i see <meta charset="utf-8">. This is Nutch 1.14-SNAPSHOT.
>
> But anyway, the text extracted is completely messed up, not all, but most accents are unreadable.
>
> No idea, do you have any?
>
> Many thanks,
> Markus
>
> https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
>
RE: Wrong encoding
Posted by Markus Jelsma <ma...@openindex.io>.
Update: the problem occurs only in the TikaParser!
Ideas?
Markus
-----Original message-----
> From:Markus Jelsma <ma...@openindex.io>
> Sent: Thursday 26th October 2017 17:53
> To: user@nutch.apache.org; User <us...@nutch.apache.org>
> Subject: RE: Wrong encoding
>
> Note: setting parser.character.encoding.default to UTF-8 doesn't work.
>
> Many thanks,
> Markus
>
> -----Original message-----
> > From:Markus Jelsma <ma...@openindex.io>
> > Sent: Thursday 26th October 2017 17:33
> > To: User <us...@nutch.apache.org>
> > Subject: Wrong encoding
> >
> > Hello,
> >
> > I have this URL that says according to parsechecker it has Content-Type=text/html; charset=windows-1252, which is incorrect. There is also Content-Type=text/html; charset=utf-8 in the metadata, which i do find in the HTML, at least i see <meta charset="utf-8">. This is Nutch 1.14-SNAPSHOT.
> >
> > But anyway, the text extracted is completely messed up, not all, but most accents are unreadable.
> >
> > No idea, do you have any?
> >
> > Many thanks,
> > Markus
> >
> > https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> >
>