You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "John Conwell (Created) (JIRA)" <ji...@apache.org> on 2012/03/30 20:54:27 UTC

[jira] [Created] (TIKA-889) XHTMLContentHandler wont emit newline when html element matches ENDLINE set

XHTMLContentHandler wont emit newline when html element matches ENDLINE set
---------------------------------------------------------------------------

                 Key: TIKA-889
                 URL: https://issues.apache.org/jira/browse/TIKA-889
             Project: Tika
          Issue Type: Bug
            Reporter: John Conwell


XHTMLContentHandler.endElement checks if the element is in the ENDLINE set to see if it should emit a newline.  The html elements in ENDLINE are all lower case, but the HtmlParser class uses the XHTMLDowngradeHandler handler to upper case all html elements.  This means that none of the html elements in the web page will match the elements in the ENDLINE set.  

This also is a problem with the INDENT set as well

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-889) XHTMLContentHandler wont emit newline when html element matches ENDLINE set

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432189#comment-13432189 ] 

Ken Krugler commented on TIKA-889:
----------------------------------

Hi John - I tried this with trunk, and it works as expected.

Yes, it's true that XHTMLDowngradeHandler will uppercase the element names, but then DefaultHtmlMapper.mapSafeElement() lower-cases them (I know, seems odd to me too). So the comparison works, and I see the expected output.

I'm adding a test case to validate behavior, at least for a simple <ul><li>xxx</li></ul> example.
                
> XHTMLContentHandler wont emit newline when html element matches ENDLINE set
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-889
>                 URL: https://issues.apache.org/jira/browse/TIKA-889
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: John Conwell
>            Assignee: Ken Krugler
>
> XHTMLContentHandler.endElement checks if the element is in the ENDLINE set to see if it should emit a newline.  The html elements in ENDLINE are all lower case, but the HtmlParser class uses the XHTMLDowngradeHandler handler to upper case all html elements.  This means that none of the html elements in the web page will match the elements in the ENDLINE set.  
> This also is a problem with the INDENT set as well

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-889) XHTMLContentHandler wont emit newline when html element matches ENDLINE set

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-889:
-----------------------------------

    Component/s: parser

- classify
                
> XHTMLContentHandler wont emit newline when html element matches ENDLINE set
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-889
>                 URL: https://issues.apache.org/jira/browse/TIKA-889
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: John Conwell
>            Assignee: Ken Krugler
>
> XHTMLContentHandler.endElement checks if the element is in the ENDLINE set to see if it should emit a newline.  The html elements in ENDLINE are all lower case, but the HtmlParser class uses the XHTMLDowngradeHandler handler to upper case all html elements.  This means that none of the html elements in the web page will match the elements in the ENDLINE set.  
> This also is a problem with the INDENT set as well

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (TIKA-889) XHTMLContentHandler wont emit newline when html element matches ENDLINE set

Posted by "Ken Krugler (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler reassigned TIKA-889:
--------------------------------

    Assignee: Ken Krugler
    
> XHTMLContentHandler wont emit newline when html element matches ENDLINE set
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-889
>                 URL: https://issues.apache.org/jira/browse/TIKA-889
>             Project: Tika
>          Issue Type: Bug
>            Reporter: John Conwell
>            Assignee: Ken Krugler
>
> XHTMLContentHandler.endElement checks if the element is in the ENDLINE set to see if it should emit a newline.  The html elements in ENDLINE are all lower case, but the HtmlParser class uses the XHTMLDowngradeHandler handler to upper case all html elements.  This means that none of the html elements in the web page will match the elements in the ENDLINE set.  
> This also is a problem with the INDENT set as well

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-889) XHTMLContentHandler wont emit newline when html element matches ENDLINE set

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler resolved TIKA-889.
------------------------------

       Resolution: Cannot Reproduce
    Fix Version/s: 1.3

Added unit test to validate in r137506
                
> XHTMLContentHandler wont emit newline when html element matches ENDLINE set
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-889
>                 URL: https://issues.apache.org/jira/browse/TIKA-889
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: John Conwell
>            Assignee: Ken Krugler
>             Fix For: 1.3
>
>
> XHTMLContentHandler.endElement checks if the element is in the ENDLINE set to see if it should emit a newline.  The html elements in ENDLINE are all lower case, but the HtmlParser class uses the XHTMLDowngradeHandler handler to upper case all html elements.  This means that none of the html elements in the web page will match the elements in the ENDLINE set.  
> This also is a problem with the INDENT set as well

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira