You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Joseph Vychtrle (Created) (JIRA)" <ji...@apache.org> on 2011/11/03 22:15:32 UTC

[jira] [Created] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

media type detection fails for html documents, results in text/plain instead of text/html
-----------------------------------------------------------------------------------------

                 Key: TIKA-772
                 URL: https://issues.apache.org/jira/browse/TIKA-772
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 0.10
            Reporter: Joseph Vychtrle


Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
<?xml version="1.0" encoding="UTF-8"?>

composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...

{code:title=Bar.java|borderStyle=solid}
@Test
public void testMediaType() throws Exception {
        List<Document> allDocs = DocumentProvider.docsAsList();
	Map<Document, String> failed = new HashMap<Document, String>();
	for (Document doc : allDocs) {
		Tika tika = new Tika();
		String type = tika.detect(TikaInputStream.get(doc.getFile()));

		if(!doc.getMediaType().toString().equals(type))
				failed.put(doc, type);	
	}
	
	for (Document doc : failed.keySet()) {
		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
	}
	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
}
{code}

Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144763#comment-13144763 ] 

Jukka Zitting commented on TIKA-772:
------------------------------------

Can you attach an example document that illustrates this problem?

PS. You can simplify (and improve) your code by using {{tika.detect(doc.getFile())}} instead of {{tika.detect(TikaInputStream.get(doc.getFile()))}}.
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>              Labels: detection, media-type
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144849#comment-13144849 ] 

Jukka Zitting commented on TIKA-772:
------------------------------------

The latter method makes also the .html suffix available to the detector, which helps Tika guess the type of the document. Anyway, Tika should be able to detect the correct type also with the former version.

Can you check what output you get from the following two commands:

{code}
$ java -jar tika-app-0.10.jar --detect < it.html
$ java -jar tika-app-0.10.jar --detect it.html
{code}

These calls are roughly equivalent to the two method calls you mentioned. On my computer both return text/html.
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144836#comment-13144836 ] 

Jukka Zitting commented on TIKA-772:
------------------------------------

I piped the files to tika-app to prevent it from seeing the file extension.

I wonder if the problem has something to do with the way you're providing the files to Tika. Can you try the following code and paste the output here?

{code}
File file = doc.getFile();
Tika tika = new Tika();
String type = tika.detect(file);
if (!"text/html".equals(type)) {
    System.out.println(file.getName() + ": " + type);
    Reader reader = new FileReader(file);
    try {
        char[] c = new char[200];
        int n = reader.read(c);
        System.out.println(new String(c, 0, n));
    } finally {
        reader.close();
    }
}
{code}

Alternatively, can you modify your code to something I could run without access to the rest of your codebase? Without a test case that I can execute locally it's hard to tell where the problem may be.
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Joseph Vychtrle (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144851#comment-13144851 ] 

Joseph Vychtrle commented on TIKA-772:
--------------------------------------

Weird,
{noformat}
java -jar tika-app-0.10.jar --detect < /tmp/docProv/html/it.html 
text/html
java -jar tika-app-0.10.jar --detect /tmp/docProv/html/it.html 
text/html
{noformat}

You can reproduce it like this :
{code}
@Test
public void test2Tika() throws Exception {
	File file = new File("/tmp/docProv/html/it.html");
	Tika tika = new Tika();
	String type = tika.detect(TikaInputStream.get(file));
	System.out.println(type);
}
{code}

Output :
{noformat}text/plain{noformat}
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Joseph Vychtrle (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144772#comment-13144772 ] 

Joseph Vychtrle commented on TIKA-772:
--------------------------------------

Hey Jukka, 

I found it happened only for html documents that started with

{noformat}<?xml version="1.0" encoding="UTF-8"?>{noformat}

I attached a zip archive, first 2 documents(bg.html, cs.html) that are free of this xml encoding specification are detected correctly.
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>              Labels: detection, media-type
>         Attachments: html.zip
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Joseph Vychtrle (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph Vychtrle updated TIKA-772:
---------------------------------

    Attachment: html.zip
    
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>              Labels: detection, media-type
>         Attachments: html.zip
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Joseph Vychtrle (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144853#comment-13144853 ] 

Joseph Vychtrle commented on TIKA-772:
--------------------------------------

But to be honest, it makes sense. Tika doesn't have a detector that would detect html media type, does it ? It doesn't have magic prefix and it doesn't know file name, all other Detectors are irrelevant here. What am I missing ?
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Joseph Vychtrle (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144828#comment-13144828 ] 

Joseph Vychtrle commented on TIKA-772:
--------------------------------------

MimeType detector doesn't find it, name of the file is not specified in metadata and the ZipContainer and POIFS Detectors don't find it either...

I think that "java -jar tika-app-1.0.jar --detect" succeeds because it decides on Media Type based on file name and its html extension...
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Jukka Zitting (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-772.
--------------------------------

    Resolution: Cannot Reproduce
      Assignee: Jukka Zitting

Works for me:

{code}
$ for f in *.html; do echo -n "$f: "; java -jar tika-app-1.0.jar --detect < $f; done
bg.html: text/html
cs.html: text/html
da.html: text/html
de.html: text/html
el.html: text/html
en.html: text/html
es.html: text/html
et.html: text/html
fi.html: text/html
fr.html: text/html
hu.html: text/html
it.html: text/html
lt.html: text/html
lv.html: text/html
mt.html: text/html
nl.html: text/html
pl.html: text/html
pt.html: text/html
ro.html: text/html
sk.html: text/html
sl.html: text/html
sv.html: text/html
{code}
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Joseph Vychtrle (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144865#comment-13144865 ] 

Joseph Vychtrle commented on TIKA-772:
--------------------------------------

Funny thing Jukka, I will talk to Cedric Beust about it, because it happens only if it is part of testNG test :-) I just run it in main method instead of @Test and it works.
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, it.html, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144854#comment-13144854 ] 

Jukka Zitting commented on TIKA-772:
------------------------------------

The test case you added prints out "text/html" for me when run against the it.html file included in the zip you attached. Can you attach the exact "/tmp/docProv/html/it.html" file that produces the "text/plain" output?
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144862#comment-13144862 ] 

Jukka Zitting commented on TIKA-772:
------------------------------------

The metacharacters you mention do sound suspicious. Here's what the attached it.html looks inside; no weird metacharacters here:

{noformat}
$ od -c it.html | head
0000000   <   ?   x   m   l       v   e   r   s   i   o   n   =   "   1
0000020   .   0   "       e   n   c   o   d   i   n   g   =   "   U   T
0000040   F   -   8   "   ?   >  \n   <   h   t   m   l   >   <   p   >
0000060   P   a   r   e   r   e       d   e   l       C   o   m   i   t
0000100   a   t   o       e   c   o   n   o   m   i   c   o       e
0000120   s   o   c   i   a   l   e       e   u   r   o   p   e   o
0000140   s   u   l       t   e   m   a       I   l       r   u   o   l
0000160   o       d   e   l   l   a       s   o   c   i   e   t 303 240
0000200       c   i   v   i   l   e       n   e   l   l   e       r   e
0000220   l   a   z   i   o   n   i       U   E   -   S   e   r   b   i
{noformat}

I still get "text/html" when running the test against this file.
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, it.html, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Joseph Vychtrle (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph Vychtrle updated TIKA-772:
---------------------------------

    Attachment: tika.png

I don't know then. Take a look at my results with tika v 0.10
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Joseph Vychtrle (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph Vychtrle updated TIKA-772:
---------------------------------

    Attachment: it.html
    
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, it.html, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Joseph Vychtrle (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144840#comment-13144840 ] 

Joseph Vychtrle commented on TIKA-772:
--------------------------------------

Got it, if I do

{code}tika.detect(TikaInputStream.get(doc.getFile())){code}

it fails (except for those 2 documents without the encoding header).

But this :

{code}tika.detect(doc.getFile()){code}

succeeds for all of them.

                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Posted by "Joseph Vychtrle (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144855#comment-13144855 ] 

Joseph Vychtrle commented on TIKA-772:
--------------------------------------

Attached... I'm on linux, using UTF-8 encoding by default for OS and java... Although now when I'm looking at that file in vim editor, there are some weird metacharacters between the the encoding element and <html> ... 
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, it.html, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira