You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by reinhard schwab <re...@aon.at> on 2010/09/30 14:24:20 UTC

encoding detected by HtmlParser

how is tika reasoning about the encodings in html files?
im asking because i have to parse files which have wrong encodings in
the html header.

example:
http://www.brz.gv.at/Portal.Node/brz/public/content/aktuelles/pressemeldungen/41263.html

<meta http-equiv="content-type" content="application/xhtml+xml;
charset=iso-8859-1" />

in real, the encoding is UTF-8.
this is also the encoding provided by the http response header.

looking at the code in HtmlParser,
the method
private String getEncoding(InputStream stream, Metadata metadata)
tries to identify the encoding by checking the meta tags.
if it finds an encoding there, it returns this encoding.

i set the content type and also the content encoding in tika metadata to
bias the HtmlParser, but this seems to be ignored first.
it is only used later, when no encoding is found in meta tags.

so how will tika in future handle such situations when
a/ the encoding in meta tag is wrong
b/ the encoding in http response header is ok and different from the one
in meta tag

regards
reinhard






Re: encoding detected by HtmlParser

Posted by reinhard schwab <re...@aon.at>.
i have modified the getEncoding method in HtmlParser.
it checks first for charset in meta tags.
if no incomingCharset is provided, this charset will be returned
if an incomingCharset is provided, it is compared to the charset found
in meta tags.
if they are equal, this charset will be returned.
if they are not equal, the detection will be used.
i can open an issue and contribute a patch.

regards
reinhard


reinhard schwab schrieb:
> how is tika reasoning about the encodings in html files?
> im asking because i have to parse files which have wrong encodings in
> the html header.
>
> example:
> http://www.brz.gv.at/Portal.Node/brz/public/content/aktuelles/pressemeldungen/41263.html
>
> <meta http-equiv="content-type" content="application/xhtml+xml;
> charset=iso-8859-1" />
>
> in real, the encoding is UTF-8.
> this is also the encoding provided by the http response header.
>
> looking at the code in HtmlParser,
> the method
> private String getEncoding(InputStream stream, Metadata metadata)
> tries to identify the encoding by checking the meta tags.
> if it finds an encoding there, it returns this encoding.
>
> i set the content type and also the content encoding in tika metadata to
> bias the HtmlParser, but this seems to be ignored first.
> it is only used later, when no encoding is found in meta tags.
>
> so how will tika in future handle such situations when
> a/ the encoding in meta tag is wrong
> b/ the encoding in http response header is ok and different from the one
> in meta tag
>
> regards
> reinhard
>
>
>
>
>
>
>