You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@maven.apache.org by "Benjamin Bentmann (JIRA)" <ji...@codehaus.org> on 2008/02/23 00:37:28 UTC

[jira] Created: (DOXIA-226) Make XML based parsers better handle whitespace

Make XML based parsers better handle whitespace
-----------------------------------------------

                 Key: DOXIA-226
                 URL: http://jira.codehaus.org/browse/DOXIA-226
             Project: Maven Doxia
          Issue Type: Improvement
            Reporter: Benjamin Bentmann


Regarding whitespace in XML documents, one needs to consider the following aspects:
- ignorable whitespace, i.e. view "{{<tr> <td/> </tr>}}" and "{{<tr><td/></tr>}}" as equivalent
- collapsible whitespace, i.e. view "{{Text &nbsp; Text}}" and "{{Text Text}}" as equivalent
- trimmable whitespace, i.e. view "{{<p>  Text  </p>}}" and "{{<p>Text</p>}}" as equivalent

Those distinctions require a DTD/XSD in combination with a validating parser and/or application-specific knowledge. For robustness, doxia parsers for XML-based formats should not depend on the existence of a schema definition such that they reliably deliver events into the sinks. Hence I suggest to hard-code the required logic for proper whitespace handling into each parser.

Currently, whitespace handling is rather static, e.g. {{XhtmlBaseParser}} pushes all input whitespace into the sink. This might cause troubles with sinks that are not expected to receive ignorable whitespace. To address this issue, it seems helpful if {{AbstractXmlParser}} provided a default implementation of {{handleText()}} that subclasses can simply control via state flags instead of implementing {{handleText()}} from scratch in each parser. Copy&Paste - which caused DOXIA-225 - needs to be avoided.

More precisely, I image the following changes:
- Have {{AbstractXmlParser}} maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element
- Have {{AbstractXmlParser}} push/pop a tuple from this stack before/after calling {{handleStartTag()}}/{{handleEndTag()}}
- Have {{AbstractXmlParser}} provide setters to allow subclasses to control the desired whitespace handling in their {{handleStartTag()}} implementation
- Have {{AbstractXmlParser}} implement {{handleText()}} where it evalutes the top-most tuple from the stack


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (DOXIA-226) Make XML based parsers better handle whitespace

Posted by "Lukas Theussl (JIRA)" <ji...@codehaus.org>.
    [ http://jira.codehaus.org/browse/DOXIA-226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=169064#action_169064 ] 

Lukas Theussl commented on DOXIA-226:
-------------------------------------

In addition, whitespace is never ignorable/collapsible/trimmable within verbatim blocks, ie within <source></source> or <pre></pre> in xdocs.

> Make XML based parsers better handle whitespace
> -----------------------------------------------
>
>                 Key: DOXIA-226
>                 URL: http://jira.codehaus.org/browse/DOXIA-226
>             Project: Maven Doxia
>          Issue Type: Improvement
>            Reporter: Benjamin Bentmann
>             Fix For: 1.2
>
>
> Regarding whitespace in XML documents, one needs to consider the following aspects:
> - ignorable whitespace, i.e. view "{{<tr> <td/> </tr>}}" and "{{<tr><td/></tr>}}" as equivalent
> - collapsible whitespace, i.e. view "{{Text &nbsp; Text}}" and "{{Text Text}}" as equivalent
> - trimmable whitespace, i.e. view "{{<p>  Text  </p>}}" and "{{<p>Text</p>}}" as equivalent
> Those distinctions require a DTD/XSD in combination with a validating parser and/or application-specific knowledge. For robustness, doxia parsers for XML-based formats should not depend on the existence of a schema definition such that they reliably deliver events into the sinks. Hence I suggest to hard-code the required logic for proper whitespace handling into each parser.
> Currently, whitespace handling is rather static, e.g. {{XhtmlBaseParser}} pushes all input whitespace into the sink. This might cause troubles with sinks that are not expected to receive ignorable whitespace. To address this issue, it seems helpful if {{AbstractXmlParser}} provided a default implementation of {{handleText()}} that subclasses can simply control via state flags instead of implementing {{handleText()}} from scratch in each parser. Copy&Paste - which caused DOXIA-225 - needs to be avoided.
> More precisely, I image the following changes:
> - Have {{AbstractXmlParser}} maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element
> - Have {{AbstractXmlParser}} push/pop a tuple from this stack before/after calling {{handleStartTag()}}/{{handleEndTag()}}
> - Have {{AbstractXmlParser}} provide setters to allow subclasses to control the desired whitespace handling in their {{handleStartTag()}} implementation
> - Have {{AbstractXmlParser}} implement {{handleText()}} where it evalutes the top-most tuple from the stack

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (DOXIA-226) Make XML based parsers better handle whitespace

Posted by "Vincent Siveton (JIRA)" <ji...@codehaus.org>.
     [ http://jira.codehaus.org/browse/DOXIA-226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vincent Siveton updated DOXIA-226:
----------------------------------

    Fix Version/s:     (was: 1.0-beta-1)
                   1.0-beta-2

> Make XML based parsers better handle whitespace
> -----------------------------------------------
>
>                 Key: DOXIA-226
>                 URL: http://jira.codehaus.org/browse/DOXIA-226
>             Project: Maven Doxia
>          Issue Type: Improvement
>            Reporter: Benjamin Bentmann
>             Fix For: 1.0-beta-2
>
>
> Regarding whitespace in XML documents, one needs to consider the following aspects:
> - ignorable whitespace, i.e. view "{{<tr> <td/> </tr>}}" and "{{<tr><td/></tr>}}" as equivalent
> - collapsible whitespace, i.e. view "{{Text &nbsp; Text}}" and "{{Text Text}}" as equivalent
> - trimmable whitespace, i.e. view "{{<p>  Text  </p>}}" and "{{<p>Text</p>}}" as equivalent
> Those distinctions require a DTD/XSD in combination with a validating parser and/or application-specific knowledge. For robustness, doxia parsers for XML-based formats should not depend on the existence of a schema definition such that they reliably deliver events into the sinks. Hence I suggest to hard-code the required logic for proper whitespace handling into each parser.
> Currently, whitespace handling is rather static, e.g. {{XhtmlBaseParser}} pushes all input whitespace into the sink. This might cause troubles with sinks that are not expected to receive ignorable whitespace. To address this issue, it seems helpful if {{AbstractXmlParser}} provided a default implementation of {{handleText()}} that subclasses can simply control via state flags instead of implementing {{handleText()}} from scratch in each parser. Copy&Paste - which caused DOXIA-225 - needs to be avoided.
> More precisely, I image the following changes:
> - Have {{AbstractXmlParser}} maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element
> - Have {{AbstractXmlParser}} push/pop a tuple from this stack before/after calling {{handleStartTag()}}/{{handleEndTag()}}
> - Have {{AbstractXmlParser}} provide setters to allow subclasses to control the desired whitespace handling in their {{handleStartTag()}} implementation
> - Have {{AbstractXmlParser}} implement {{handleText()}} where it evalutes the top-most tuple from the stack

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (DOXIA-226) Make XML based parsers better handle whitespace

Posted by "Benjamin Bentmann (JIRA)" <ji...@codehaus.org>.
    [ http://jira.codehaus.org/browse/DOXIA-226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=147859#action_147859 ] 

Benjamin Bentmann commented on DOXIA-226:
-----------------------------------------

bq. Right but space is important in <p><b>word</b> <i>word</i></p>
Right, that's what I intended to say with
bq. maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the *currently parsed* element
i.e. these flags should be associated with individual markup elements. They are definitively not meant to be global for a parser instance.

> Make XML based parsers better handle whitespace
> -----------------------------------------------
>
>                 Key: DOXIA-226
>                 URL: http://jira.codehaus.org/browse/DOXIA-226
>             Project: Maven Doxia
>          Issue Type: Improvement
>            Reporter: Benjamin Bentmann
>             Fix For: 1.0-beta-1
>
>
> Regarding whitespace in XML documents, one needs to consider the following aspects:
> - ignorable whitespace, i.e. view "{{<tr> <td/> </tr>}}" and "{{<tr><td/></tr>}}" as equivalent
> - collapsible whitespace, i.e. view "{{Text &nbsp; Text}}" and "{{Text Text}}" as equivalent
> - trimmable whitespace, i.e. view "{{<p>  Text  </p>}}" and "{{<p>Text</p>}}" as equivalent
> Those distinctions require a DTD/XSD in combination with a validating parser and/or application-specific knowledge. For robustness, doxia parsers for XML-based formats should not depend on the existence of a schema definition such that they reliably deliver events into the sinks. Hence I suggest to hard-code the required logic for proper whitespace handling into each parser.
> Currently, whitespace handling is rather static, e.g. {{XhtmlBaseParser}} pushes all input whitespace into the sink. This might cause troubles with sinks that are not expected to receive ignorable whitespace. To address this issue, it seems helpful if {{AbstractXmlParser}} provided a default implementation of {{handleText()}} that subclasses can simply control via state flags instead of implementing {{handleText()}} from scratch in each parser. Copy&Paste - which caused DOXIA-225 - needs to be avoided.
> More precisely, I image the following changes:
> - Have {{AbstractXmlParser}} maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element
> - Have {{AbstractXmlParser}} push/pop a tuple from this stack before/after calling {{handleStartTag()}}/{{handleEndTag()}}
> - Have {{AbstractXmlParser}} provide setters to allow subclasses to control the desired whitespace handling in their {{handleStartTag()}} implementation
> - Have {{AbstractXmlParser}} implement {{handleText()}} where it evalutes the top-most tuple from the stack

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (DOXIA-226) Make XML based parsers better handle whitespace

Posted by "Vincent Siveton (JIRA)" <ji...@codehaus.org>.
     [ http://jira.codehaus.org/browse/DOXIA-226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vincent Siveton updated DOXIA-226:
----------------------------------

    Fix Version/s: 1.0-beta-1

> Make XML based parsers better handle whitespace
> -----------------------------------------------
>
>                 Key: DOXIA-226
>                 URL: http://jira.codehaus.org/browse/DOXIA-226
>             Project: Maven Doxia
>          Issue Type: Improvement
>            Reporter: Benjamin Bentmann
>             Fix For: 1.0-beta-1
>
>
> Regarding whitespace in XML documents, one needs to consider the following aspects:
> - ignorable whitespace, i.e. view "{{<tr> <td/> </tr>}}" and "{{<tr><td/></tr>}}" as equivalent
> - collapsible whitespace, i.e. view "{{Text &nbsp; Text}}" and "{{Text Text}}" as equivalent
> - trimmable whitespace, i.e. view "{{<p>  Text  </p>}}" and "{{<p>Text</p>}}" as equivalent
> Those distinctions require a DTD/XSD in combination with a validating parser and/or application-specific knowledge. For robustness, doxia parsers for XML-based formats should not depend on the existence of a schema definition such that they reliably deliver events into the sinks. Hence I suggest to hard-code the required logic for proper whitespace handling into each parser.
> Currently, whitespace handling is rather static, e.g. {{XhtmlBaseParser}} pushes all input whitespace into the sink. This might cause troubles with sinks that are not expected to receive ignorable whitespace. To address this issue, it seems helpful if {{AbstractXmlParser}} provided a default implementation of {{handleText()}} that subclasses can simply control via state flags instead of implementing {{handleText()}} from scratch in each parser. Copy&Paste - which caused DOXIA-225 - needs to be avoided.
> More precisely, I image the following changes:
> - Have {{AbstractXmlParser}} maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element
> - Have {{AbstractXmlParser}} push/pop a tuple from this stack before/after calling {{handleStartTag()}}/{{handleEndTag()}}
> - Have {{AbstractXmlParser}} provide setters to allow subclasses to control the desired whitespace handling in their {{handleStartTag()}} implementation
> - Have {{AbstractXmlParser}} implement {{handleText()}} where it evalutes the top-most tuple from the stack

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (DOXIA-226) Make XML based parsers better handle whitespace

Posted by "Vincent Siveton (JIRA)" <ji...@codehaus.org>.
    [ http://jira.codehaus.org/browse/DOXIA-226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=147834#action_147834 ] 

Vincent Siveton commented on DOXIA-226:
---------------------------------------

First implementation in [r694807|http://svn.apache.org/viewvc?rev=694807&view=rev] which solves DOXIA-251

bq. ignorable whitespace, i.e. view "<tr> <td/> </tr>" and "<tr><td/></tr>" as equivalent

Right but space is important in <p><b>word</b> <i>word</i></p> so we need to take care of spaces for some HTML style tags and not for xml or HTML table tags and others.

> Make XML based parsers better handle whitespace
> -----------------------------------------------
>
>                 Key: DOXIA-226
>                 URL: http://jira.codehaus.org/browse/DOXIA-226
>             Project: Maven Doxia
>          Issue Type: Improvement
>            Reporter: Benjamin Bentmann
>             Fix For: 1.0-beta-1
>
>
> Regarding whitespace in XML documents, one needs to consider the following aspects:
> - ignorable whitespace, i.e. view "{{<tr> <td/> </tr>}}" and "{{<tr><td/></tr>}}" as equivalent
> - collapsible whitespace, i.e. view "{{Text &nbsp; Text}}" and "{{Text Text}}" as equivalent
> - trimmable whitespace, i.e. view "{{<p>  Text  </p>}}" and "{{<p>Text</p>}}" as equivalent
> Those distinctions require a DTD/XSD in combination with a validating parser and/or application-specific knowledge. For robustness, doxia parsers for XML-based formats should not depend on the existence of a schema definition such that they reliably deliver events into the sinks. Hence I suggest to hard-code the required logic for proper whitespace handling into each parser.
> Currently, whitespace handling is rather static, e.g. {{XhtmlBaseParser}} pushes all input whitespace into the sink. This might cause troubles with sinks that are not expected to receive ignorable whitespace. To address this issue, it seems helpful if {{AbstractXmlParser}} provided a default implementation of {{handleText()}} that subclasses can simply control via state flags instead of implementing {{handleText()}} from scratch in each parser. Copy&Paste - which caused DOXIA-225 - needs to be avoided.
> More precisely, I image the following changes:
> - Have {{AbstractXmlParser}} maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element
> - Have {{AbstractXmlParser}} push/pop a tuple from this stack before/after calling {{handleStartTag()}}/{{handleEndTag()}}
> - Have {{AbstractXmlParser}} provide setters to allow subclasses to control the desired whitespace handling in their {{handleStartTag()}} implementation
> - Have {{AbstractXmlParser}} implement {{handleText()}} where it evalutes the top-most tuple from the stack

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira