You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Benson Margulies (JIRA)" <ji...@apache.org> on 2009/10/09 01:34:31 UTC

[jira] Created: (TIKA-303) XHTMLContentHandler mishandles headers

XHTMLContentHandler mishandles headers
--------------------------------------

                 Key: TIKA-303
                 URL: https://issues.apache.org/jira/browse/TIKA-303
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4
            Reporter: Benson Margulies


XHTMLContentHandler.startDocument does not note that it has been called. So then lazyStartDocument will happen and embed an extra layer of head/title/body processing.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-303) XHTMLContentHandler mishandles headers

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763935#action_12763935 ] 

Jukka Zitting commented on TIKA-303:
------------------------------------

Hmm, I'm not sure I understand the problem. Do you have a test case that illustrates the issue that the proposed fix solves?

> XHTMLContentHandler mishandles headers
> --------------------------------------
>
>                 Key: TIKA-303
>                 URL: https://issues.apache.org/jira/browse/TIKA-303
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Benson Margulies
>
> XHTMLContentHandler.startDocument does not note that it has been called. So then lazyStartDocument will happen and embed an extra layer of head/title/body processing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-303) XHTMLContentHandler mishandles headers

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763783#action_12763783 ] 

Benson Margulies commented on TIKA-303:
---------------------------------------

Here's a patch:

diff -r apache-tika-0.4/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java apache-tika-0.4-mod/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java
101a102,103
>     private boolean lazyStarted;
> 
115a118
>         started = true;
140a144
>             lazyStarted = true;
155,157c159,163
<         endElement("body");
<         endElement("html");
<         endPrefixMapping("");
---
>         if (lazyStarted) {
>             endElement("body");
>             endElement("html");
>             endPrefixMapping("");
>         }

Yes it's contributed to the ASF. I'm a member.

> XHTMLContentHandler mishandles headers
> --------------------------------------
>
>                 Key: TIKA-303
>                 URL: https://issues.apache.org/jira/browse/TIKA-303
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Benson Margulies
>
> XHTMLContentHandler.startDocument does not note that it has been called. So then lazyStartDocument will happen and embed an extra layer of head/title/body processing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-303) XHTMLContentHandler mishandles headers

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated TIKA-303:
----------------------------------

    Attachment: tika-tc.patch

Patch to 1.5 that adds test case for this issue.

> XHTMLContentHandler mishandles headers
> --------------------------------------
>
>                 Key: TIKA-303
>                 URL: https://issues.apache.org/jira/browse/TIKA-303
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4, 1.0
>            Reporter: Benson Margulies
>         Attachments: tika-tc.patch
>
>
> XHTMLContentHandler.startDocument does not note that it has been called. So then lazyStartDocument will happen and embed an extra layer of head/title/body processing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-303) XHTMLContentHandler mishandles headers

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-303.
--------------------------------

    Resolution: Invalid
      Assignee: Jukka Zitting

The XHTMLContentHandler is not meant to be used the way you use it in the test case.

The purpose of the XHTMLContentHandler wrapper is to make it easier for Tika Parser implementations to generate valid XHTML output. There's no need for code that calls the Parser interface to use XHTMLContentHandler, as the parse() method is already guaranteed to produce valid XHTML.


> XHTMLContentHandler mishandles headers
> --------------------------------------
>
>                 Key: TIKA-303
>                 URL: https://issues.apache.org/jira/browse/TIKA-303
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4, 1.0
>            Reporter: Benson Margulies
>            Assignee: Jukka Zitting
>         Attachments: tika-tc.patch
>
>
> XHTMLContentHandler.startDocument does not note that it has been called. So then lazyStartDocument will happen and embed an extra layer of head/title/body processing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-303) XHTMLContentHandler mishandles headers

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated TIKA-303:
----------------------------------

    Affects Version/s: 1.0

> XHTMLContentHandler mishandles headers
> --------------------------------------
>
>                 Key: TIKA-303
>                 URL: https://issues.apache.org/jira/browse/TIKA-303
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4, 1.0
>            Reporter: Benson Margulies
>         Attachments: tika-tc.patch
>
>
> XHTMLContentHandler.startDocument does not note that it has been called. So then lazyStartDocument will happen and embed an extra layer of head/title/body processing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-303) XHTMLContentHandler mishandles headers

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763976#action_12763976 ] 

Benson Margulies commented on TIKA-303:
---------------------------------------

Feed in any HTML page that already has a title. First the regular startDocument will be called, then the document's html/head/title will be produced. Then lazyStartDocument will add another layer.

You get

<html>
<head>
<title>title</title>
</head>
<body>
<html>
<head><title>...</title></head><body>  the body
</body>
</htm>
</body>
</html>

I'll attach a code example later on.


> XHTMLContentHandler mishandles headers
> --------------------------------------
>
>                 Key: TIKA-303
>                 URL: https://issues.apache.org/jira/browse/TIKA-303
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Benson Margulies
>
> XHTMLContentHandler.startDocument does not note that it has been called. So then lazyStartDocument will happen and embed an extra layer of head/title/body processing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.