You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@commons.apache.org by "Nguyen Thanh Son Daniel (JIRA)" <ji...@apache.org> on 2008/03/12 13:10:46 UTC

[jira] Created: (DIGESTER-120) digesting xml content with NodeCreateRule swallows spaces.

digesting xml content with NodeCreateRule swallows spaces.
----------------------------------------------------------

                 Key: DIGESTER-120
                 URL: https://issues.apache.org/jira/browse/DIGESTER-120
             Project: Commons Digester
          Issue Type: Bug
    Affects Versions: 1.8
         Environment: jdk 1.4.2_08, digester 1.8
            Reporter: Nguyen Thanh Son Daniel


i need to process an xml file that contains entities: ie:

<?xml version="1.0" encoding="UTF-8"?>
<top>
<body>&#65; &#65;</body>
</top>

i'm using digester as follows:

Digester digester = new Digester ();
digester.addRule ("top", new ObjectCreateRule (MyContent.class));
digester.addRule ("top/body", new NodeCreateRule ());
digester.addSetNext ("top/body", "setBody");

then
...
digester.parse (file);

MyContent class transforms the node into text as follows:

public class MyContent
{
 public void setBody (Element node)
 {
  String content = serializeNode (node);
  System.out.println (content);
 }
 ...
}

the content displayed is in this case: <body>AA</body>

if the body was encoded in the xml file as: <top><body>A A</body></top>, the content would then be correctly displayed as: 
<body>A A</body>

looking at the NodeCreateRule.NodeBuilder.characters () implementation, the following code generates the problem: 
String str = new String(ch, start, length);
if (str.trim().length() > 0) { 
 top.appendChild(doc.createTextNode(str));

when entities are being used; the characters () method is called for 'A', ' ' and 'A' in the first case. in the second case, it is called once with 'A A'.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DIGESTER-120) digesting xml content with NodeCreateRule swallows spaces.

Posted by "Nguyen Thanh Son Daniel (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DIGESTER-120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579044#action_12579044 ] 

Nguyen Thanh Son Daniel commented on DIGESTER-120:
--------------------------------------------------

oops.

Simon,

I forgot:
I was not so clear about it, but to reproduce the problem, you must use entities in your xml. the file following should help reproduce the problem:

<?xml version="1.0" encoding="UTF-8"?>
<top>
<body>&#65; &#65;</body>
</top>


> digesting xml content with NodeCreateRule swallows spaces.
> ----------------------------------------------------------
>
>                 Key: DIGESTER-120
>                 URL: https://issues.apache.org/jira/browse/DIGESTER-120
>             Project: Commons Digester
>          Issue Type: Bug
>    Affects Versions: 1.8
>         Environment: jdk 1.4.2_08, digester 1.8
>            Reporter: Nguyen Thanh Son Daniel
>         Attachments: digester-patch.txt
>
>
> i need to process an xml file that contains entities: ie:
> <?xml version="1.0" encoding="UTF-8"?>
> <top>
> <body>&#65; &#65;</body>
> </top>
> i'm using digester as follows:
> Digester digester = new Digester ();
> digester.addRule ("top", new ObjectCreateRule (MyContent.class));
> digester.addRule ("top/body", new NodeCreateRule ());
> digester.addSetNext ("top/body", "setBody");
> then
> ...
> digester.parse (file);
> MyContent class transforms the node into text as follows:
> public class MyContent
> {
>  public void setBody (Element node)
>  {
>   String content = serializeNode (node);
>   System.out.println (content);
>  }
>  ...
> }
> the content displayed is in this case: <body>AA</body>
> if the body was encoded in the xml file as: <top><body>A A</body></top>, the content would then be correctly displayed as: 
> <body>A A</body>
> looking at the NodeCreateRule.NodeBuilder.characters () implementation, the following code generates the problem: 
> String str = new String(ch, start, length);
> if (str.trim().length() > 0) { 
>  top.appendChild(doc.createTextNode(str));
> when entities are being used; the characters () method is called for 'A', ' ' and 'A' in the first case. in the second case, it is called once with 'A A'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DIGESTER-120) digesting xml content with NodeCreateRule swallows spaces.

Posted by "Nguyen Thanh Son Daniel (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DIGESTER-120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579054#action_12579054 ] 

Nguyen Thanh Son Daniel commented on DIGESTER-120:
--------------------------------------------------

Simon,

yes, the problem is fixed now.

Thanks,

> digesting xml content with NodeCreateRule swallows spaces.
> ----------------------------------------------------------
>
>                 Key: DIGESTER-120
>                 URL: https://issues.apache.org/jira/browse/DIGESTER-120
>             Project: Commons Digester
>          Issue Type: Bug
>    Affects Versions: 1.8
>         Environment: jdk 1.4.2_08, digester 1.8
>            Reporter: Nguyen Thanh Son Daniel
>         Attachments: digester-patch.txt, simple.xml
>
>
> i need to process an xml file that contains entities: ie:
> <?xml version="1.0" encoding="UTF-8"?>
> <top>
> <body>&#65; &#65;</body>
> </top>
> i'm using digester as follows:
> Digester digester = new Digester ();
> digester.addRule ("top", new ObjectCreateRule (MyContent.class));
> digester.addRule ("top/body", new NodeCreateRule ());
> digester.addSetNext ("top/body", "setBody");
> then
> ...
> digester.parse (file);
> MyContent class transforms the node into text as follows:
> public class MyContent
> {
>  public void setBody (Element node)
>  {
>   String content = serializeNode (node);
>   System.out.println (content);
>  }
>  ...
> }
> the content displayed is in this case: <body>AA</body>
> if the body was encoded in the xml file as: <top><body>A A</body></top>, the content would then be correctly displayed as: 
> <body>A A</body>
> looking at the NodeCreateRule.NodeBuilder.characters () implementation, the following code generates the problem: 
> String str = new String(ch, start, length);
> if (str.trim().length() > 0) { 
>  top.appendChild(doc.createTextNode(str));
> when entities are being used; the characters () method is called for 'A', ' ' and 'A' in the first case. in the second case, it is called once with 'A A'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DIGESTER-120) digesting xml content with NodeCreateRule swallows spaces.

Posted by "Simon Kitching (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DIGESTER-120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579049#action_12579049 ] 

Simon Kitching commented on DIGESTER-120:
-----------------------------------------

Ok, using entities I was able to duplicate this pretty quickly. That also makes sense; the parser has the entity already cached as a string, so of course makes a separate call to the characters method for it.

I have committed a patch to the trunk, and deployed a new 1.8.1-SNAPSHOT version to the apache maven snapshot repository. Could you please try it out and confirm it fixes the problem for you?

Thanks, Simon

> digesting xml content with NodeCreateRule swallows spaces.
> ----------------------------------------------------------
>
>                 Key: DIGESTER-120
>                 URL: https://issues.apache.org/jira/browse/DIGESTER-120
>             Project: Commons Digester
>          Issue Type: Bug
>    Affects Versions: 1.8
>         Environment: jdk 1.4.2_08, digester 1.8
>            Reporter: Nguyen Thanh Son Daniel
>         Attachments: digester-patch.txt, simple.xml
>
>
> i need to process an xml file that contains entities: ie:
> <?xml version="1.0" encoding="UTF-8"?>
> <top>
> <body>&#65; &#65;</body>
> </top>
> i'm using digester as follows:
> Digester digester = new Digester ();
> digester.addRule ("top", new ObjectCreateRule (MyContent.class));
> digester.addRule ("top/body", new NodeCreateRule ());
> digester.addSetNext ("top/body", "setBody");
> then
> ...
> digester.parse (file);
> MyContent class transforms the node into text as follows:
> public class MyContent
> {
>  public void setBody (Element node)
>  {
>   String content = serializeNode (node);
>   System.out.println (content);
>  }
>  ...
> }
> the content displayed is in this case: <body>AA</body>
> if the body was encoded in the xml file as: <top><body>A A</body></top>, the content would then be correctly displayed as: 
> <body>A A</body>
> looking at the NodeCreateRule.NodeBuilder.characters () implementation, the following code generates the problem: 
> String str = new String(ch, start, length);
> if (str.trim().length() > 0) { 
>  top.appendChild(doc.createTextNode(str));
> when entities are being used; the characters () method is called for 'A', ' ' and 'A' in the first case. in the second case, it is called once with 'A A'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (DIGESTER-120) digesting xml content with NodeCreateRule swallows spaces.

Posted by "Simon Kitching (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/DIGESTER-120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Kitching updated DIGESTER-120:
------------------------------------

    Attachment: digester-patch.txt

Patch to fix over-eager discarding of whitespace

> digesting xml content with NodeCreateRule swallows spaces.
> ----------------------------------------------------------
>
>                 Key: DIGESTER-120
>                 URL: https://issues.apache.org/jira/browse/DIGESTER-120
>             Project: Commons Digester
>          Issue Type: Bug
>    Affects Versions: 1.8
>         Environment: jdk 1.4.2_08, digester 1.8
>            Reporter: Nguyen Thanh Son Daniel
>         Attachments: digester-patch.txt
>
>
> i need to process an xml file that contains entities: ie:
> <?xml version="1.0" encoding="UTF-8"?>
> <top>
> <body>&#65; &#65;</body>
> </top>
> i'm using digester as follows:
> Digester digester = new Digester ();
> digester.addRule ("top", new ObjectCreateRule (MyContent.class));
> digester.addRule ("top/body", new NodeCreateRule ());
> digester.addSetNext ("top/body", "setBody");
> then
> ...
> digester.parse (file);
> MyContent class transforms the node into text as follows:
> public class MyContent
> {
>  public void setBody (Element node)
>  {
>   String content = serializeNode (node);
>   System.out.println (content);
>  }
>  ...
> }
> the content displayed is in this case: <body>AA</body>
> if the body was encoded in the xml file as: <top><body>A A</body></top>, the content would then be correctly displayed as: 
> <body>A A</body>
> looking at the NodeCreateRule.NodeBuilder.characters () implementation, the following code generates the problem: 
> String str = new String(ch, start, length);
> if (str.trim().length() > 0) { 
>  top.appendChild(doc.createTextNode(str));
> when entities are being used; the characters () method is called for 'A', ' ' and 'A' in the first case. in the second case, it is called once with 'A A'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (DIGESTER-120) digesting xml content with NodeCreateRule swallows spaces.

Posted by "Simon Kitching (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/DIGESTER-120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Kitching resolved DIGESTER-120.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.8.1

Fixed.

> digesting xml content with NodeCreateRule swallows spaces.
> ----------------------------------------------------------
>
>                 Key: DIGESTER-120
>                 URL: https://issues.apache.org/jira/browse/DIGESTER-120
>             Project: Commons Digester
>          Issue Type: Bug
>    Affects Versions: 1.8
>         Environment: jdk 1.4.2_08, digester 1.8
>            Reporter: Nguyen Thanh Son Daniel
>             Fix For: 1.8.1
>
>         Attachments: digester-patch.txt, simple.xml
>
>
> i need to process an xml file that contains entities: ie:
> <?xml version="1.0" encoding="UTF-8"?>
> <top>
> <body>&#65; &#65;</body>
> </top>
> i'm using digester as follows:
> Digester digester = new Digester ();
> digester.addRule ("top", new ObjectCreateRule (MyContent.class));
> digester.addRule ("top/body", new NodeCreateRule ());
> digester.addSetNext ("top/body", "setBody");
> then
> ...
> digester.parse (file);
> MyContent class transforms the node into text as follows:
> public class MyContent
> {
>  public void setBody (Element node)
>  {
>   String content = serializeNode (node);
>   System.out.println (content);
>  }
>  ...
> }
> the content displayed is in this case: <body>AA</body>
> if the body was encoded in the xml file as: <top><body>A A</body></top>, the content would then be correctly displayed as: 
> <body>A A</body>
> looking at the NodeCreateRule.NodeBuilder.characters () implementation, the following code generates the problem: 
> String str = new String(ch, start, length);
> if (str.trim().length() > 0) { 
>  top.appendChild(doc.createTextNode(str));
> when entities are being used; the characters () method is called for 'A', ' ' and 'A' in the first case. in the second case, it is called once with 'A A'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DIGESTER-120) digesting xml content with NodeCreateRule swallows spaces.

Posted by "Nguyen Thanh Son Daniel (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DIGESTER-120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579043#action_12579043 ] 

Nguyen Thanh Son Daniel commented on DIGESTER-120:
--------------------------------------------------

Simon,

1- The parser being used is identified by the following maven artifact:

groupId: xerces
artifactId: xercesImpl
version: 2.6.2

also, stepping in the code reveals that it is the parser being used.

2- i am not aware of the location where i should be getting the patch. can you let me know how i should proceed to access your patch ?


> digesting xml content with NodeCreateRule swallows spaces.
> ----------------------------------------------------------
>
>                 Key: DIGESTER-120
>                 URL: https://issues.apache.org/jira/browse/DIGESTER-120
>             Project: Commons Digester
>          Issue Type: Bug
>    Affects Versions: 1.8
>         Environment: jdk 1.4.2_08, digester 1.8
>            Reporter: Nguyen Thanh Son Daniel
>         Attachments: digester-patch.txt
>
>
> i need to process an xml file that contains entities: ie:
> <?xml version="1.0" encoding="UTF-8"?>
> <top>
> <body>&#65; &#65;</body>
> </top>
> i'm using digester as follows:
> Digester digester = new Digester ();
> digester.addRule ("top", new ObjectCreateRule (MyContent.class));
> digester.addRule ("top/body", new NodeCreateRule ());
> digester.addSetNext ("top/body", "setBody");
> then
> ...
> digester.parse (file);
> MyContent class transforms the node into text as follows:
> public class MyContent
> {
>  public void setBody (Element node)
>  {
>   String content = serializeNode (node);
>   System.out.println (content);
>  }
>  ...
> }
> the content displayed is in this case: <body>AA</body>
> if the body was encoded in the xml file as: <top><body>A A</body></top>, the content would then be correctly displayed as: 
> <body>A A</body>
> looking at the NodeCreateRule.NodeBuilder.characters () implementation, the following code generates the problem: 
> String str = new String(ch, start, length);
> if (str.trim().length() > 0) { 
>  top.appendChild(doc.createTextNode(str));
> when entities are being used; the characters () method is called for 'A', ' ' and 'A' in the first case. in the second case, it is called once with 'A A'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (DIGESTER-120) digesting xml content with NodeCreateRule swallows spaces.

Posted by "Nguyen Thanh Son Daniel (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/DIGESTER-120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nguyen Thanh Son Daniel updated DIGESTER-120:
---------------------------------------------

    Attachment: simple.xml

this time, the (real) file

> digesting xml content with NodeCreateRule swallows spaces.
> ----------------------------------------------------------
>
>                 Key: DIGESTER-120
>                 URL: https://issues.apache.org/jira/browse/DIGESTER-120
>             Project: Commons Digester
>          Issue Type: Bug
>    Affects Versions: 1.8
>         Environment: jdk 1.4.2_08, digester 1.8
>            Reporter: Nguyen Thanh Son Daniel
>         Attachments: digester-patch.txt, simple.xml
>
>
> i need to process an xml file that contains entities: ie:
> <?xml version="1.0" encoding="UTF-8"?>
> <top>
> <body>&#65; &#65;</body>
> </top>
> i'm using digester as follows:
> Digester digester = new Digester ();
> digester.addRule ("top", new ObjectCreateRule (MyContent.class));
> digester.addRule ("top/body", new NodeCreateRule ());
> digester.addSetNext ("top/body", "setBody");
> then
> ...
> digester.parse (file);
> MyContent class transforms the node into text as follows:
> public class MyContent
> {
>  public void setBody (Element node)
>  {
>   String content = serializeNode (node);
>   System.out.println (content);
>  }
>  ...
> }
> the content displayed is in this case: <body>AA</body>
> if the body was encoded in the xml file as: <top><body>A A</body></top>, the content would then be correctly displayed as: 
> <body>A A</body>
> looking at the NodeCreateRule.NodeBuilder.characters () implementation, the following code generates the problem: 
> String str = new String(ch, start, length);
> if (str.trim().length() > 0) { 
>  top.appendChild(doc.createTextNode(str));
> when entities are being used; the characters () method is called for 'A', ' ' and 'A' in the first case. in the second case, it is called once with 'A A'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DIGESTER-120) digesting xml content with NodeCreateRule swallows spaces.

Posted by "Simon Kitching (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DIGESTER-120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579019#action_12579019 ] 

Simon Kitching commented on DIGESTER-120:
-----------------------------------------

Thanks very much for the bugreport, and your great analysis of the problem.

I think you are right that the NodeBuilder.characters method is being too eager to strip whitespace.

However I am unable to actually trigger the bug using a unit test case. The existing unit test code for digester has something very similar to your example, but with all xml in one line. I therefore added linefeeds as you suggested, but the unit test still passes.

The unit tests I'm talking about can be found here:
  http://svn.apache.org/repos/asf/commons/proper/digester/trunk/src/test/org/apache/commons/digester/NodeCreateRuleTestCase.java

If you could manage to create a unit test for Digester that reproduces the problem that would be really helpful. I'm not sure it is going to be easy though; I think the cause of the problem is that the SAX api says that characters(..) can be invoked by the xml parser as many times as the parser wants. Digester is currently assuming (as you point out) that it is invoked only once with the complete code block. But adding linefeeds to the input doesn't appear to force the xml parser i'm using to call characters() with partial strings. 

In short, I think that this bug is very much dependent on which xml parser implementation is used, and what its internal parsing buffer size happens to be set to etc. That will make it difficult to get a unit test to reproduce this bug.

Which xml parser are you using?

I've created a patch anyway, which looks ok to me and passes the existing digester test cases. Could you try applying this patch and seeing if it fixes your problem?

> digesting xml content with NodeCreateRule swallows spaces.
> ----------------------------------------------------------
>
>                 Key: DIGESTER-120
>                 URL: https://issues.apache.org/jira/browse/DIGESTER-120
>             Project: Commons Digester
>          Issue Type: Bug
>    Affects Versions: 1.8
>         Environment: jdk 1.4.2_08, digester 1.8
>            Reporter: Nguyen Thanh Son Daniel
>
> i need to process an xml file that contains entities: ie:
> <?xml version="1.0" encoding="UTF-8"?>
> <top>
> <body>&#65; &#65;</body>
> </top>
> i'm using digester as follows:
> Digester digester = new Digester ();
> digester.addRule ("top", new ObjectCreateRule (MyContent.class));
> digester.addRule ("top/body", new NodeCreateRule ());
> digester.addSetNext ("top/body", "setBody");
> then
> ...
> digester.parse (file);
> MyContent class transforms the node into text as follows:
> public class MyContent
> {
>  public void setBody (Element node)
>  {
>   String content = serializeNode (node);
>   System.out.println (content);
>  }
>  ...
> }
> the content displayed is in this case: <body>AA</body>
> if the body was encoded in the xml file as: <top><body>A A</body></top>, the content would then be correctly displayed as: 
> <body>A A</body>
> looking at the NodeCreateRule.NodeBuilder.characters () implementation, the following code generates the problem: 
> String str = new String(ch, start, length);
> if (str.trim().length() > 0) { 
>  top.appendChild(doc.createTextNode(str));
> when entities are being used; the characters () method is called for 'A', ' ' and 'A' in the first case. in the second case, it is called once with 'A A'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.