You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Brett S. (JIRA)" <ji...@apache.org> on 2010/02/09 20:01:28 UTC

[jira] Created: (TIKA-377) Error parsing HTML partial with AutoDetect parser

Error parsing HTML partial with AutoDetect parser
-------------------------------------------------

                 Key: TIKA-377
                 URL: https://issues.apache.org/jira/browse/TIKA-377
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.6
            Reporter: Brett S.
         Attachments: test.html

I get the following error parsing a html file containing a partial HTML document.  

TIKA-237: Illegal SAXException from org.apache.tika.parser.xml.DcXMLParser@3a43af 

The following conditions need to exist in the file for the error to be thrown:

+ A HTML comment before any HTML tags
+ More than one top level HTML tag

I will attach a test file to reproduce

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-377) Error parsing HTML partial with AutoDetect parser

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-377.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.7
         Assignee: Jukka Zitting

In cases like this the Tika type detection code is fooled into thinking that the document is XML, and obviously any draconian XML parser will reject such documents.

In revisions 908554 and 908560 I added some more heuristics to Tika for better detecting such tag soup HTML. With these changes the attached test document is correctly recognized as HTML and parsed with the lenient HTML parser.

> Error parsing HTML partial with AutoDetect parser
> -------------------------------------------------
>
>                 Key: TIKA-377
>                 URL: https://issues.apache.org/jira/browse/TIKA-377
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.6
>            Reporter: Brett S.
>            Assignee: Jukka Zitting
>             Fix For: 0.7
>
>         Attachments: test.html
>
>
> I get the following error parsing a html file containing a partial HTML document.  
> TIKA-237: Illegal SAXException from org.apache.tika.parser.xml.DcXMLParser@3a43af 
> The following conditions need to exist in the file for the error to be thrown:
> + A HTML comment before any HTML tags
> + More than one top level HTML tag
> I will attach a test file to reproduce

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-377) Error parsing HTML partial with AutoDetect parser

Posted by "Brett S. (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brett S. updated TIKA-377:
--------------------------

    Attachment: test.html

Sample HTML document to produce parse error described in ticket TIKA-377

> Error parsing HTML partial with AutoDetect parser
> -------------------------------------------------
>
>                 Key: TIKA-377
>                 URL: https://issues.apache.org/jira/browse/TIKA-377
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.6
>            Reporter: Brett S.
>         Attachments: test.html
>
>
> I get the following error parsing a html file containing a partial HTML document.  
> TIKA-237: Illegal SAXException from org.apache.tika.parser.xml.DcXMLParser@3a43af 
> The following conditions need to exist in the file for the error to be thrown:
> + A HTML comment before any HTML tags
> + More than one top level HTML tag
> I will attach a test file to reproduce

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.