You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (Resolved) (JIRA)" <ji...@apache.org> on 2011/11/05 20:32:51 UTC

[jira] [Resolved] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

     [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-772.
--------------------------------

    Resolution: Cannot Reproduce
      Assignee: Jukka Zitting

Works for me:

{code}
$ for f in *.html; do echo -n "$f: "; java -jar tika-app-1.0.jar --detect < $f; done
bg.html: text/html
cs.html: text/html
da.html: text/html
de.html: text/html
el.html: text/html
en.html: text/html
es.html: text/html
et.html: text/html
fi.html: text/html
fr.html: text/html
hu.html: text/html
it.html: text/html
lt.html: text/html
lv.html: text/html
mt.html: text/html
nl.html: text/html
pl.html: text/html
pt.html: text/html
ro.html: text/html
sk.html: text/html
sl.html: text/html
sv.html: text/html
{code}
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira