You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Alex Rosen <ar...@silverstream.com> on 2002/09/19 00:55:29 UTC

Round-tripping again

A few weeks ago I e-mailed this list, asking about adding round-tripping
support to Xerces - i.e. the ability to output the exact same XML file as
was read in, or at least very close to it. In other words, preserving more
of the non-infoset information that normally gets dropped.

I spent some time working on this, and have a prototype done, which uses
Augmentations to pass in more information about the "raw text" of the
original document than Xerces normally gives. An example is the amount of
whitespace between attributes. Saving this extra information (and using it
on output) means that if the user puts each attribute on its own line, that
will be preserved on output, instead of collapsing them back onto one line.
These sorts of modifications are semantically equivalent, but it really
annoys users when you reformat their document out from under them.

The particular project that needs this is a dom4j project, so I also created
a special dom4j reader that takes this extra information that's given by the
parser and stores it in each dom4j node it creates, and a writer that uses
this saved information to write out a more accurate version of the output
document. (This could easily be extended to DOM and JDOM.) I've attached an
example. Sample.xml is the source file, rt-output.xml is the output using
the new round-trip-enabled Xerces/dom4j code, and the other two are the
output using standard Xerces/dom4j (in both standard and pretty-printing
modes). Not everything is identical, but it's much, much better.

I think it would be nice if this feature were added to Xerces. I think it
fulfills a significant need, and I don't think it adds any overhead when
it's not turned on, and probably minimal overhead with it turned on. It
currently doesn't cover many of the less-used areas of XML (notations, etc.)
but I think it does a very good job of covering the common cases.

There also happened to be a similar thread going on at the same time as my
original post, that I'd like to respond to:

http://marc.theaimsgroup.com/?l=xerces-j-dev&m=103029884901546&w=2

> I can understand the cases in which people would like to
> be able to do this but I also realize what it would take
> to implement it. ;)

I don't the the implementation is too bad. It's not trivial, but not
unreasonably complex, I don't think.

> The "limited usefulness" that I was referring to was the
> fact that reporting character offsets only works if the
> parsed source is already a character stream. If it's
> anything else (say a byte stream in UTF8 or Shift_JIS)
> then the application can't map those offsets back to the
> source without re-reading the file.

But there's *always* a character stream (Reader). Xerces creates one if it's
not handed one. The easy way is to have Xerces send the actual text along to
the user. (The other way is to have the user override createReader() to get
his hands on the relevent character stream, which turns out to be a little
ugly, but works fine.) Thus it's always applicable, even when you hand
Xerces an InputStream. And I think it would be useful to a significant
number of users.

So is there any chance of this modification making it in to Xerces? I'd be
happy to send a patch once it's cleaned up a bit.

Thanks,
Alex

Re: Round-tripping again

Posted by Libor Kramolis <Li...@Sun.COM>.
I would appreciate round-tripping support in Xerces. It is really 
necessary for XML editors/tools -- broken user indentation is annoying.

+1

Regards,
Libor


Alex Rosen wrote:
> A few weeks ago I e-mailed this list, asking about adding round-tripping
> support to Xerces - i.e. the ability to output the exact same XML file as
> was read in, or at least very close to it. In other words, preserving more
> of the non-infoset information that normally gets dropped.
> 
> I spent some time working on this, and have a prototype done, which uses
> Augmentations to pass in more information about the "raw text" of the
> original document than Xerces normally gives. An example is the amount of
> whitespace between attributes. Saving this extra information (and using it
> on output) means that if the user puts each attribute on its own line, that
> will be preserved on output, instead of collapsing them back onto one line.
> These sorts of modifications are semantically equivalent, but it really
> annoys users when you reformat their document out from under them.
> 
> The particular project that needs this is a dom4j project, so I also created
> a special dom4j reader that takes this extra information that's given by the
> parser and stores it in each dom4j node it creates, and a writer that uses
> this saved information to write out a more accurate version of the output
> document. (This could easily be extended to DOM and JDOM.) I've attached an
> example. Sample.xml is the source file, rt-output.xml is the output using
> the new round-trip-enabled Xerces/dom4j code, and the other two are the
> output using standard Xerces/dom4j (in both standard and pretty-printing
> modes). Not everything is identical, but it's much, much better.
> 
> I think it would be nice if this feature were added to Xerces. I think it
> fulfills a significant need, and I don't think it adds any overhead when
> it's not turned on, and probably minimal overhead with it turned on. It
> currently doesn't cover many of the less-used areas of XML (notations, etc.)
> but I think it does a very good job of covering the common cases.
> 
> There also happened to be a similar thread going on at the same time as my
> original post, that I'd like to respond to:
> 
> http://marc.theaimsgroup.com/?l=xerces-j-dev&m=103029884901546&w=2
> 
> 
>>I can understand the cases in which people would like to
>>be able to do this but I also realize what it would take
>>to implement it. ;)
> 
> 
> I don't the the implementation is too bad. It's not trivial, but not
> unreasonably complex, I don't think.
> 
> 
>>The "limited usefulness" that I was referring to was the
>>fact that reporting character offsets only works if the
>>parsed source is already a character stream. If it's
>>anything else (say a byte stream in UTF8 or Shift_JIS)
>>then the application can't map those offsets back to the
>>source without re-reading the file.
> 
> 
> But there's *always* a character stream (Reader). Xerces creates one if it's
> not handed one. The easy way is to have Xerces send the actual text along to
> the user. (The other way is to have the user override createReader() to get
> his hands on the relevent character stream, which turns out to be a little
> ugly, but works fine.) Thus it's always applicable, even when you hand
> Xerces an InputStream. And I think it would be useful to a significant
> number of users.
> 
> So is there any chance of this modification making it in to Xerces? I'd be
> happy to send a patch once it's cleaned up a bit.
> 
> Thanks,
> Alex
> 

-- 
Libor Kramolis, Software Engineer      | <li...@sun.com>
NetBeans/Sun Microsystems, XML Project | http://xml.netbeans.org/


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


RE: Round-tripping again

Posted by Joseph Kesselman <ke...@us.ibm.com>.
On Thursday, 09/19/2002 at 10:10 AST, "Alex Rosen" 
<ar...@silverstream.com> wrote:
> Not sure I understand completely - I'm trying to provide better 
XML-aware
> tools. It's true that if you never look at the raw text of an XML 
document
> then you don't care about round-tripping, but I doubt that the need to 
do
> that occasionally is going to go away any time soon. That certainly 
hasn't
> happened in the 5 years since XML has been around.

The frequency of folks looking at the textfile has decreased SUBSTANTIALLY 
over the past five years; these days it's mostly done when debugging. 

For that purpose, providing locator references and using a viewer/editor 
which will show you the point specified by the locator is generally both 
sufficient and more useful than round-tripping annotations would be. In 
other words: Rather than regenerating the source, _capture_ the source and 
point back to it.



______________________________________
Joe Kesselman  / IBM Research

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


RE: Round-tripping again

Posted by Alex Rosen <ar...@silverstream.com>.
> Not violently opposed to the feature, ***IF*** it has no adverse
> performance impact when not in use.

I totally agree.

> On the other hand: In my experience, most folks who think they want
> round-tripping really want better XML-aware tools -- editors, file
> compares, etc. -- and have only been forced to worry about round-tripping
> because they're trying to process the XML as text, or because their
> current tools are flat-out broken. I honestly think it's more of a service
> to the community to push back on this request as much as
> possible/reasonable.

Not sure I understand completely - I'm trying to provide better XML-aware
tools. It's true that if you never look at the raw text of an XML document
then you don't care about round-tripping, but I doubt that the need to do
that occasionally is going to go away any time soon. That certainly hasn't
happened in the 5 years since XML has been around.

Alex


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: Round-tripping again

Posted by Joseph Kesselman <ke...@us.ibm.com>.
Not violently opposed to the feature, ***IF*** it has no adverse 
performance impact when not in use. If it costs any significant number of 
cycles, you're asking those who don't want it to "pay the rent" for those 
who do, and given that this is essentially a "non-XML behavior" I'm not at 
all convinced that's acceptable.

On the other hand: In my experience, most folks who think they want 
round-tripping really want better XML-aware tools -- editors, file 
compares, etc. -- and have only been forced to worry about round-tripping 
because they're trying to process the XML as text, or because their 
current tools are flat-out broken. I honestly think it's more of a service 
to the community to push back on this request as much as 
possible/reasonable.

______________________________________
Joe Kesselman  / IBM Research

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org