You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2015/12/08 17:20:10 UTC
[jira] [Commented] (TIKA-1808) Head section closed too eager
[ https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047029#comment-15047029 ]
Ken Krugler commented on TIKA-1808:
-----------------------------------
Hi Markus - I don't think this is actually a bug. I created a similar doc using an HTML editor, with "<div style="scroll:no;"></div>" in the <head> section. It's not valid HTML, and the HTML checker reports:
untitled text 5:6: Optional open tag for element “<body>” not specified.
untitled text 5:7: Document type does not permit element “<title>” in content of element “<body>”.
untitled text 5:8: Close element “</head>” found but element wasn't open.
untitled text 5:9: Element “<body>” implicitly closed here.
So it's doing the same thing as Tika, by implicitly closing the <head> element when it sees a tag (the <div>) that can only be in the <body>.
If this was going to be fixed up anywhere, it would have to be in TagSoup (before it gets to Tika), but that's apparently not happening. I wonder what JSoup does with this type of broken HTML?
> Head section closed too eager
> -----------------------------
>
> Key: TIKA-1808
> URL: https://issues.apache.org/jira/browse/TIKA-1808
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.11
> Reporter: Markus Jelsma
> Fix For: 1.12
>
>
> XHTMLContentHandler has some logic that closes the head section too early, or this is a problem in TagSoup. In this [1] case a <div> element appears in the head, causing the head to be closed. Subsequent <head> elements do not appear in custom ContentHandlers so i cannot read the document's title, or any other meta tags.
> It can be fixed by using a custom HTMLSchema in the ParseContext, e.g. schema.elementType("div", HTMLSchema.M_EMPTY, 65535, 0); but this isn't really an elegant solution.
> [1] http://www.aljazeera.com/news/2015/05/150516182251747.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)