You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Brett S. (JIRA)" <ji...@apache.org> on 2010/02/09 20:01:28 UTC
[jira] Created: (TIKA-377) Error parsing HTML partial with
AutoDetect parser
Error parsing HTML partial with AutoDetect parser
-------------------------------------------------
Key: TIKA-377
URL: https://issues.apache.org/jira/browse/TIKA-377
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.6
Reporter: Brett S.
Attachments: test.html
I get the following error parsing a html file containing a partial HTML document.
TIKA-237: Illegal SAXException from org.apache.tika.parser.xml.DcXMLParser@3a43af
The following conditions need to exist in the file for the error to be thrown:
+ A HTML comment before any HTML tags
+ More than one top level HTML tag
I will attach a test file to reproduce
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (TIKA-377) Error parsing HTML partial with
AutoDetect parser
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-377.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.7
Assignee: Jukka Zitting
In cases like this the Tika type detection code is fooled into thinking that the document is XML, and obviously any draconian XML parser will reject such documents.
In revisions 908554 and 908560 I added some more heuristics to Tika for better detecting such tag soup HTML. With these changes the attached test document is correctly recognized as HTML and parsed with the lenient HTML parser.
> Error parsing HTML partial with AutoDetect parser
> -------------------------------------------------
>
> Key: TIKA-377
> URL: https://issues.apache.org/jira/browse/TIKA-377
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.6
> Reporter: Brett S.
> Assignee: Jukka Zitting
> Fix For: 0.7
>
> Attachments: test.html
>
>
> I get the following error parsing a html file containing a partial HTML document.
> TIKA-237: Illegal SAXException from org.apache.tika.parser.xml.DcXMLParser@3a43af
> The following conditions need to exist in the file for the error to be thrown:
> + A HTML comment before any HTML tags
> + More than one top level HTML tag
> I will attach a test file to reproduce
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-377) Error parsing HTML partial with
AutoDetect parser
Posted by "Brett S. (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brett S. updated TIKA-377:
--------------------------
Attachment: test.html
Sample HTML document to produce parse error described in ticket TIKA-377
> Error parsing HTML partial with AutoDetect parser
> -------------------------------------------------
>
> Key: TIKA-377
> URL: https://issues.apache.org/jira/browse/TIKA-377
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.6
> Reporter: Brett S.
> Attachments: test.html
>
>
> I get the following error parsing a html file containing a partial HTML document.
> TIKA-237: Illegal SAXException from org.apache.tika.parser.xml.DcXMLParser@3a43af
> The following conditions need to exist in the file for the error to be thrown:
> + A HTML comment before any HTML tags
> + More than one top level HTML tag
> I will attach a test file to reproduce
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.