You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Chris A. Mattmann (Updated) (JIRA)" <ji...@apache.org> on 2011/10/25 23:12:32 UTC

[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag

     [ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-539:
-----------------------------------

    Fix Version/s:     (was: 1.0)
                   1.1

- push out to 1.1: prep for 1.0.
                
> Encoding detection is too biased by encoding in meta tag
> --------------------------------------------------------
>
>                 Key: TIKA-539
>                 URL: https://issues.apache.org/jira/browse/TIKA-539
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 0.8, 0.9, 0.10
>            Reporter: Reinhard Schwab
>            Assignee: Ken Krugler
>             Fix For: 1.1
>
>         Attachments: TIKA-539.patch, TIKA-539_2.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from http response header).
> test code to reproduce:
> static String content = "<html><head>\n"
> 			+ "<meta http-equiv=\"content-type\" content=\"application/xhtml+xml; charset=iso-8859-1\" />"
> 			+ "</head><body>Über den Wolken\n</body></html>";
> 	/**
> 	 * @param args
> 	 * @throws IOException
> 	 * @throws TikaException
> 	 * @throws SAXException
> 	 */
> 	public static void main(String[] args) throws IOException, SAXException,
> 			TikaException {
> 		Metadata metadata = new Metadata();
> 		metadata.set(Metadata.CONTENT_TYPE, "text/html");
> 		metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
> 		System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
> 		InputStream in = new ByteArrayInputStream(content.getBytes("UTF-8"));
> 		AutoDetectParser parser = new AutoDetectParser();
> 		BodyContentHandler h = new BodyContentHandler(10000);
> 		parser.parse(in, h, metadata, new ParseContext());
> 		System.out.print(h.toString());
> 		System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
> 	}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira