You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xalan.apache.org by Dick Deneer <di...@donkeydevelopment.com> on 2007/02/19 13:40:08 UTC

Fomatting question serializing DOM with pretty-print

I build a DOM with the following xml :
<root><child1>text</child1><child2>text</child2></root>
After serializing using the xalan serializer with the pretty-print option  I
get:
<?xml version="1.0" encoding="UTF-16"?><root>
   <child1>text</child1>
   <child2>text</child2>
</root>

So the opening of the root is also in the first line.

Second when I put in carriage returns or spaces, this effects the formatting
seriously.
For instance when I build a DOM with:
<root>\n\n\n<child1>text</child1> <child2>text</child2></root>
After serializing I get:
<?xml version="1.0" encoding="UTF-16"?><root>


<child1>text</child1> <child2>text</child2>
</root>

In all the cases the xerces serializer returns:
<?xml version="1.0" encoding="UTF-16"?>
<root>
    <child1>text</child1>
    <child2>text</child2>
</root>

Can you tell me if this behaviour i right?
http://www.nabble.com/file/6633/TestSerializer.java TestSerializer.java 
-- 
View this message in context: http://www.nabble.com/Fomatting-question-serializing-DOM-with-pretty-print-tf3252607.html#a9041632
Sent from the Xalan - J - Users mailing list archive at Nabble.com.


Re: Fomatting question serializing DOM with pretty-print

Posted by Dick Deneer <di...@donkeydevelopment.com>.
Brian,

Thanks for the explanation.
As already commented in the example test code, it is possible to filter out
whitespace in the calling program.
To workaround the issue with the indent (newline) of the root may be more
complicated.
I hope that in the future the xalan parser will bring full backwards
compatibility with the deprecated xerces one.

Dick Deneer



Brian Minchau wrote:
> 
> Hi Dick,
> I ran the program that you point to with the URL and got your results.
> 
> For this input:
> String xml = "<root>\n\n\n<child1>text</child1>
> <child2>text</child2></root>";
> 
> For Xalan serializer and then Xerces serializer I get this:
> 
> Program started
> Xalan serializer will be used
> <?xml version="1.0" encoding="UTF-16"?><root>
>    <child1>text</child1>
>    <child2>text</child2>
> </root>
> 
> 
> Program started
> Xerces serializer will be used
> <?xml version="1.0" encoding="UTF-16"?>
> <root>
>     <child1>text</child1>
>     <child2>text</child2>
> </root>
> 
> 
> The differences here are due to the fact that in the past the Xalan
> serializer has decided that the output XML file could be used as an
> external general parsed entity and included in yet another XML file.  As
> such we don't know where it will be included and the extra newline that
> Xerces inserts after the XML header may be included next to non-whitespace
> text and become part of that text node.  Added indentation or not, extra
> whitespace before the document element is not always correct, so Xalan
> doesn't do it.
> 
> I looked at the code recently and saw that the DOM3 save support does
> indeed choose to indent 3 spaces per indentation level, and this code was
> contributed by to Xalan by people on the Xerces team, so I don't know why
> Xerces indents by 4 spaces.  This difference is not important, there is no
> "right" way to do indentation, it depends on implementation.
> 
> 
> 
> Then to add some whitespace, for this input:
> String xml = "<root>\n\n\n<child1>text</child1>
> <child2>text</child2></root>";
> 
> Program started
> Xalan serializer will be used
> <?xml version="1.0" encoding="UTF-16"?><root>
> 
> 
> <child1>text</child1> <child2>text</child2>
> </root>
> 
> 
> 
> 
> Program started
> Xerces serializer will be used
> <?xml version="1.0" encoding="UTF-16"?>
> <root>
>     <child1>text</child1>
>     <child2>text</child2>
> </root>
> 
> 
> Again Xalan does not inject any whitespace between the xml header and the
> document element, for the same reasons as given before.  I'm not sure
> about
> the other whitespace differences. It looks like Xalan has decided that it
> won't add whitespace to existing whitespace and effectively does no
> indentation.  Xerces serializer however rips out the whitespace from the
> document being serialized and replaces it with nicer looking whitespace.
> I'm not sure if that is OK to do that, perhaps someone from Xerces will
> comment on the differences.
> (Michael?)
> 
> 
> - Brian
> - - - - - - - - - - - - - - - - - - - -
> Brian Minchau, Ph.D.
> XSLT Development, IBM Toronto
> e-mail:        minchau@ca.ibm.com
> 
> 
> 
>                                                                            
>              Dick Deneer                                                   
>              <dick.deneer@donk                                             
>              eydevelopment.com                                          To 
>              >                         xalan-j-users@xml.apache.org        
>                                                                         cc 
>              02/19/2007 07:40                                              
>              AM                                                    Subject 
>                                        Fomatting question serializing DOM  
>                                        with pretty-print                   
>                                                                            
>                                                                            
>                                                                            
>                                                                            
>                                                                            
>                                                                            
> 
> 
> 
> 
> 
> I build a DOM with the following xml :
> <root><child1>text</child1><child2>text</child2></root>
> After serializing using the xalan serializer with the pretty-print option
> I
> get:
> <?xml version="1.0" encoding="UTF-16"?><root>
>    <child1>text</child1>
>    <child2>text</child2>
> </root>
> 
> So the opening of the root is also in the first line.
> 
> Second when I put in carriage returns or spaces, this effects the
> formatting
> seriously.
> For instance when I build a DOM with:
> <root>\n\n\n<child1>text</child1> <child2>text</child2></root>
> After serializing I get:
> <?xml version="1.0" encoding="UTF-16"?><root>
> 
> 
> <child1>text</child1> <child2>text</child2>
> </root>
> 
> In all the cases the xerces serializer returns:
> <?xml version="1.0" encoding="UTF-16"?>
> <root>
>     <child1>text</child1>
>     <child2>text</child2>
> </root>
> 
> Can you tell me if this behaviour i right?
> http://www.nabble.com/file/6633/TestSerializer.java TestSerializer.java
> --
> View this message in context:
> http://www.nabble.com/Fomatting-question-serializing-DOM-with-pretty-print-tf3252607.html#a9041632
> 
> Sent from the Xalan - J - Users mailing list archive at Nabble.com.
> 
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Fomatting-question-serializing-DOM-with-pretty-print-tf3252607.html#a9067789
Sent from the Xalan - J - Users mailing list archive at Nabble.com.


Re: Fomatting question serializing DOM with pretty-print

Posted by Eric Kolotyluk <er...@kodak.com>.
OK, done & done.

Thanks, Eric


Henry Zongaro wrote:
> 
> Hi, Eric.
> 
> Eric Kolotyluk <er...@kodak.com> wrote on 2007-06-29 10:54:22 AM:
>> The first obvious problem is that the first element of our document does 
> not
>> have a linebreak before it - it's on the same line as the <?XML ... ?>
> 
> This sounds like it qualifies as a bug.  Please open a bug report in 
> Jira.[1]
> 
>> The second obvious problem is that element with a long list of 
> attributes
>> are not wrapped and indented. They should be wrapped after some 
> reasonable
>> line limit (i.e. 60, 80, 100 characters - pick one). It would be nice if
>> there was a way to specify this through the API. Also, when they are
>> wrapped, they should be intented.
> 
> That sounds like a good suggestion.  Please open that as an improvement in 
> Jira.[1]
> 
> Thanks,
> 
> Henry
> [1] http://issues.apache.org/jira/secure/CreateIssue!default.jspa
> ------------------------------------------------------------------
> Henry Zongaro      XSLT Processors Development
> IBM SWS Toronto Lab   T/L 969-6044;  Phone +1 905 413-6044
> mailto:zongaro@ca.ibm.com
> 
> 

-- 
View this message in context: http://www.nabble.com/Fomatting-question-serializing-DOM-with-pretty-print-tf3252607.html#a11364754
Sent from the Xalan - J - Users mailing list archive at Nabble.com.


Re: Fomatting question serializing DOM with pretty-print

Posted by Henry Zongaro <zo...@ca.ibm.com>.
Hi, Eric.

Eric Kolotyluk <er...@kodak.com> wrote on 2007-06-29 10:54:22 AM:
> The first obvious problem is that the first element of our document does 
not
> have a linebreak before it - it's on the same line as the <?XML ... ?>

This sounds like it qualifies as a bug.  Please open a bug report in 
Jira.[1]

> The second obvious problem is that element with a long list of 
attributes
> are not wrapped and indented. They should be wrapped after some 
reasonable
> line limit (i.e. 60, 80, 100 characters - pick one). It would be nice if
> there was a way to specify this through the API. Also, when they are
> wrapped, they should be intented.

That sounds like a good suggestion.  Please open that as an improvement in 
Jira.[1]

Thanks,

Henry
[1] http://issues.apache.org/jira/secure/CreateIssue!default.jspa
------------------------------------------------------------------
Henry Zongaro      XSLT Processors Development
IBM SWS Toronto Lab   T/L 969-6044;  Phone +1 905 413-6044
mailto:zongaro@ca.ibm.com

Re: Fomatting question serializing DOM with pretty-print

Posted by Eric Kolotyluk <er...@kodak.com>.
Here is an example of our traffic log using XMLSerializer

2007-06-26 13:22:16.066
<?xml version="1.0" encoding="UTF-8"?>
<User clientName="EKolotyluk_380" clientPlatform="Windows XP (5.1)"
    clientProtocolVersion="{DB4AEBDF-A4A9-4521-880B-02310D12723B}"
    clientType="Admin" clientVersion="0.0.0.0"
    cookie="1a741296:11369b28de8:-7fd8" isoLanguageCode="en"
    sendCompressed="true" type="checkProtocolVersion"/>

2007-06-26 13:22:16.379
<?xml version="1.0" encoding="UTF-8"?>
<Server cookie="1a741296:11369b28de8:-7fd8" deviceType="Admin"
    failureText="Protocol Version not supported" friendlyName="CSMP2610"
    ipAddress="10.1.41.70" licenseStatus="0"
    macAddress="00-14-22-38-AA-43" result="Failed" serialNumber="09665"
type="checkProtocolVersion">
    <VersionInfo>
        <AdminServer versionBuild="23" versionMajor="3" versionMinor="1"
            versionOther="5" versionPatch="5"/>
    </VersionInfo>
    <Event eid="85008" hr="0" timeStamp="1182889209"/>
</Server>

The time-stamp we add to the log. Here is the same XML using LSSerializer

2007-06-29 07:42:06.774
<?xml version="1.0" encoding="UTF-8"?><User clientName="EKolotyluk_380"
clientPlatform="Windows XP (5.1)"
clientProtocolVersion="{DB4AEBDF-A4A9-4521-880B-02310D12723B}"
clientType="Admin" clientVersion="0.0.0.0"
cookie="-7dfe3a9b:11377ee20de:-7fdd" isoLanguageCode="en"
sendCompressed="true" type="checkProtocolVersion"/>

2007-06-29 07:42:07.039
<?xml version="1.0" encoding="UTF-8"?><Server
cookie="-7dfe3a9b:11377ee20de:-7fdd" deviceType="Admin"
failureText="Protocol Version not supported" friendlyName="CSMP2610"
ipAddress="10.1.41.70" licenseStatus="0" macAddress="00-14-22-38-AA-43"
result="Failed" serialNumber="09665" type="checkProtocolVersion">
   <VersionInfo>
      <AdminServer versionBuild="23" versionMajor="3" versionMinor="1"
versionOther="5" versionPatch="5"/>
   </VersionInfo>
   <Event eid="85008" hr="0" timeStamp="1183127997"/>
</Server>

The first obvious problem is that the first element of our document does not
have a linebreak before it - it's on the same line as the <?XML ... ?>

The second obvious problem is that element with a long list of attributes
are not wrapped and indented. They should be wrapped after some reasonable
line limit (i.e. 60, 80, 100 characters - pick one). It would be nice if
there was a way to specify this through the API. Also, when they are
wrapped, they should be intented.

What I tried to imply is that whatever XMLSerializer is doing now, make
LSSerializer do the same thing at least. 

Cheers, Eric


keshlam wrote:
> 
>>the pretty-printing is so bad - it's not all that pretty.
> 
> If you were specific about what you want done differently, that would be
> helpful.
> 
> Note too that if you want *really* pretty, the right answer may be to
> write
> a stylesheet that expresses precisely the formatting you want rather than
> taking the (relatively simple-minded) default.
> 
> ______________________________________
> "... Three things see no end: A loop with exit code done wrong,
> A semaphore untested, And the change that comes along. ..."
>   -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish
> (http://www.ovff.org/pegasus/songs/threes-rev-11.html)
> 

-- 
View this message in context: http://www.nabble.com/Fomatting-question-serializing-DOM-with-pretty-print-tf3252607.html#a11361625
Sent from the Xalan - J - Users mailing list archive at Nabble.com.


Re: Fomatting question serializing DOM with pretty-print

Posted by ke...@us.ibm.com.
>the pretty-printing is so bad - it's not all that pretty.

If you were specific about what you want done differently, that would be
helpful.

Note too that if you want *really* pretty, the right answer may be to write
a stylesheet that expresses precisely the formatting you want rather than
taking the (relatively simple-minded) default.

______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish
(http://www.ovff.org/pegasus/songs/threes-rev-11.html)

Re: Fomatting question serializing DOM with pretty-print

Posted by Eric Kolotyluk <er...@kodak.com>.
I think you've hit the nail right on the head. Right now I want to convert my
code that calls the deprecated XMLSerializer to use LSSerializer and I can't
because the pretty-printing is so bad - it's not all that pretty. When using
the pretty-printing option the emphasis should be on maximizing readability.

We use pretty-printing for two main purposes: (1) we log our XML network
traffic and we have to be able to quickly and easily understand what is
going on, and (2) we store some basic data structures in files and during
trouble shooting we need to be able to quickly and easily understand what is
going on.

If XMLSerializer can do a good job why can't LSSerializer?

- Eric


keshlam wrote:
> 
> 
>> The DOM spec doesn't specify what pretty printing does. I believe what
>> Xerces is doing is fine.
> 
> By definition, pretty-printing changes whitespace and should not be used
> in
> situations where the whitespace is significant. If you want to be sure
> you're preserving document semantics, use basic DOM serialization
> instead... or set up a much more detailed prettyprint which understands
> exactly where whitespace is and isn't significant in this kind of
> document.
> 

-- 
View this message in context: http://www.nabble.com/Fomatting-question-serializing-DOM-with-pretty-print-tf3252607.html#a11351554
Sent from the Xalan - J - Users mailing list archive at Nabble.com.


Re: Fomatting question serializing DOM with pretty-print

Posted by ke...@us.ibm.com.
> The DOM spec doesn't specify what pretty printing does. I believe what
> Xerces is doing is fine.

By definition, pretty-printing changes whitespace and should not be used in
situations where the whitespace is significant. If you want to be sure
you're preserving document semantics, use basic DOM serialization
instead... or set up a much more detailed prettyprint which understands
exactly where whitespace is and isn't significant in this kind of document.

Re: Fomatting question serializing DOM with pretty-print

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Brian,

Brian Minchau/Toronto/IBM@IBMCA wrote on 02/19/2007 12:29:32 PM:

<snip/>

> Again Xalan does not inject any whitespace between the xml header and 
the
> document element, for the same reasons as given before.  I'm not sure 
about
> the other whitespace differences. It looks like Xalan has decided that 
it
> won't add whitespace to existing whitespace and effectively does no
> indentation.  Xerces serializer however rips out the whitespace from the
> document being serialized and replaces it with nicer looking whitespace.
> I'm not sure if that is OK to do that, perhaps someone from Xerces will
> comment on the differences.
> (Michael?)

The DOM spec doesn't specify what pretty printing does. I believe what 
Xerces is doing is fine.

> - Brian
> - - - - - - - - - - - - - - - - - - - -
> Brian Minchau, Ph.D.
> XSLT Development, IBM Toronto
> e-mail:        minchau@ca.ibm.com

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org


Re: Fomatting question serializing DOM with pretty-print

Posted by Brian Minchau <mi...@ca.ibm.com>.
Hi Dick,
I ran the program that you point to with the URL and got your results.

For this input:
String xml = "<root>\n\n\n<child1>text</child1>
<child2>text</child2></root>";

For Xalan serializer and then Xerces serializer I get this:

Program started
Xalan serializer will be used
<?xml version="1.0" encoding="UTF-16"?><root>
   <child1>text</child1>
   <child2>text</child2>
</root>


Program started
Xerces serializer will be used
<?xml version="1.0" encoding="UTF-16"?>
<root>
    <child1>text</child1>
    <child2>text</child2>
</root>


The differences here are due to the fact that in the past the Xalan
serializer has decided that the output XML file could be used as an
external general parsed entity and included in yet another XML file.  As
such we don't know where it will be included and the extra newline that
Xerces inserts after the XML header may be included next to non-whitespace
text and become part of that text node.  Added indentation or not, extra
whitespace before the document element is not always correct, so Xalan
doesn't do it.

I looked at the code recently and saw that the DOM3 save support does
indeed choose to indent 3 spaces per indentation level, and this code was
contributed by to Xalan by people on the Xerces team, so I don't know why
Xerces indents by 4 spaces.  This difference is not important, there is no
"right" way to do indentation, it depends on implementation.



Then to add some whitespace, for this input:
String xml = "<root>\n\n\n<child1>text</child1>
<child2>text</child2></root>";

Program started
Xalan serializer will be used
<?xml version="1.0" encoding="UTF-16"?><root>


<child1>text</child1> <child2>text</child2>
</root>




Program started
Xerces serializer will be used
<?xml version="1.0" encoding="UTF-16"?>
<root>
    <child1>text</child1>
    <child2>text</child2>
</root>


Again Xalan does not inject any whitespace between the xml header and the
document element, for the same reasons as given before.  I'm not sure about
the other whitespace differences. It looks like Xalan has decided that it
won't add whitespace to existing whitespace and effectively does no
indentation.  Xerces serializer however rips out the whitespace from the
document being serialized and replaces it with nicer looking whitespace.
I'm not sure if that is OK to do that, perhaps someone from Xerces will
comment on the differences.
(Michael?)


- Brian
- - - - - - - - - - - - - - - - - - - -
Brian Minchau, Ph.D.
XSLT Development, IBM Toronto
e-mail:        minchau@ca.ibm.com



                                                                           
             Dick Deneer                                                   
             <dick.deneer@donk                                             
             eydevelopment.com                                          To 
             >                         xalan-j-users@xml.apache.org        
                                                                        cc 
             02/19/2007 07:40                                              
             AM                                                    Subject 
                                       Fomatting question serializing DOM  
                                       with pretty-print                   
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           





I build a DOM with the following xml :
<root><child1>text</child1><child2>text</child2></root>
After serializing using the xalan serializer with the pretty-print option
I
get:
<?xml version="1.0" encoding="UTF-16"?><root>
   <child1>text</child1>
   <child2>text</child2>
</root>

So the opening of the root is also in the first line.

Second when I put in carriage returns or spaces, this effects the
formatting
seriously.
For instance when I build a DOM with:
<root>\n\n\n<child1>text</child1> <child2>text</child2></root>
After serializing I get:
<?xml version="1.0" encoding="UTF-16"?><root>


<child1>text</child1> <child2>text</child2>
</root>

In all the cases the xerces serializer returns:
<?xml version="1.0" encoding="UTF-16"?>
<root>
    <child1>text</child1>
    <child2>text</child2>
</root>

Can you tell me if this behaviour i right?
http://www.nabble.com/file/6633/TestSerializer.java TestSerializer.java
--
View this message in context:
http://www.nabble.com/Fomatting-question-serializing-DOM-with-pretty-print-tf3252607.html#a9041632

Sent from the Xalan - J - Users mailing list archive at Nabble.com.