You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/03/17 22:59:39 UTC

[jira] [Commented] (TIKA-1365) Incorrectly MimeType detection for Apache Lucene web site

    [ https://issues.apache.org/jira/browse/TIKA-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366176#comment-14366176 ] 

ASF GitHub Bot commented on TIKA-1365:
--------------------------------------

GitHub user mkr opened a pull request:

    https://github.com/apache/tika/pull/35

    TIKA-1365: Lower priority for XML starting with comment

    TIKA-1365: Lower priority for XML starting with comment, allow HTML starting with comment to be detected as text/html

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mkr/tika TIKA-1365

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tika/pull/35.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #35
    
----
commit f9655d44978af188018bee81b2d554770ddcd7f9
Author: Matthias Krueger <mk...@mkr.io>
Date:   2015-03-17T21:45:36Z

    TIKA-1365: Lower priority for XML starting with comment, allow HTML starting with comment to be detected as text/html

----


> Incorrectly MimeType detection for Apache Lucene web site
> ---------------------------------------------------------
>
>                 Key: TIKA-1365
>                 URL: https://issues.apache.org/jira/browse/TIKA-1365
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.5
>            Reporter: Tien Nguyen Manh
>         Attachments: discussion.html
>
>
> Tika 1.5 detect many page from apache lucene web site as xml, for example this page 
> http://lucene.apache.org/core/discussion.html
> Here are error log:, it failed to parse becuase it use xml parser
> Apache Tika was unable to parse the document
> at http://lucene.apache.org/core/discussion.html.
> The full exception stack trace is included below:
> org.apache.tika.exception.TikaException: XML parse error
> 	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
> 	at org.apache.tika.gui.TikaGUI.openURL(TikaGUI.java:293)
> 	at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:247)
> 	at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)