You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2015/12/08 17:20:10 UTC

[jira] [Commented] (TIKA-1808) Head section closed too eager

    [ https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047029#comment-15047029 ] 

Ken Krugler commented on TIKA-1808:
-----------------------------------

Hi Markus - I don't think this is actually a bug. I created a similar doc using an HTML editor, with "<div style="scroll:no;"></div>" in the <head> section. It's not valid HTML, and the HTML checker reports:

untitled text 5:6:  Optional open tag for element “<body>” not specified.
untitled text 5:7:  Document type does not permit element “<title>” in content of element “<body>”.
untitled text 5:8:  Close element “</head>” found but element wasn't open.
untitled text 5:9:  Element “<body>” implicitly closed here.

So it's doing the same thing as Tika, by implicitly closing the <head> element when it sees a tag (the <div>) that can only be in the <body>.

If this was going to be fixed up anywhere, it would have to be in TagSoup (before it gets to Tika), but that's apparently not happening. I wonder what JSoup does with this type of broken HTML?

> Head section closed too eager
> -----------------------------
>
>                 Key: TIKA-1808
>                 URL: https://issues.apache.org/jira/browse/TIKA-1808
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Markus Jelsma
>             Fix For: 1.12
>
>
> XHTMLContentHandler has some logic that closes the head section too early, or this is a problem in TagSoup. In this [1] case a <div> element appears in the head, causing the head to be closed. Subsequent <head> elements do not appear in custom ContentHandlers so i cannot read the document's title, or any other meta tags.
> It can be fixed by using a custom HTMLSchema in the ParseContext, e.g. schema.elementType("div", HTMLSchema.M_EMPTY, 65535, 0); but this isn't really an elegant solution.
> [1] http://www.aljazeera.com/news/2015/05/150516182251747.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)