You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Dream Catcher <is...@leonis.nus.edu.sg> on 2001/01/18 05:33:02 UTC

how to ignore whitespace in XML instance document

Hi, all.
 I am parsing a well-formed XML instance document to a DOM tree, but there
are a lot of annoying empty text value nodes in the tree structure, I
guess this is due to the carriage return in original XML document. How I
can ingnore those whitespaces when generating the DOM tree? 

thank you...

Best Regards
Wang Yue

    *------------------------------*
      Attitude makes the difference
    *------------------------------*


Re: ignorable whitespaces, comments and serialization

Posted by Sebastien Ponce <se...@cern.ch>.
I managed to solve these problems. I give a patch on xerces 1.3 in attachement.
I hope this will be integrated soon in the current repository.

Sebastien


Sebastien Ponce wrote:

> I'm trying to serialize a tree that was built using xerces. The point is that this tree has many ignorable
> whitespaces node in it (basically one every 2 nodes).
>
> When I try to serialize with options setPreserveSpace(false) and setIndenting(true), the identation is not
> down correctly. Basically, no carriage return is down. After looking at
> org.apache.xml.serialize.BaseMarkupSerializer and org.apache.xml.serialize.XmlSerializer, it appears that
> there are two main problems :
>     - the content() method is called in BaseMarkupSerializer for a text node and changes the element state
> from empty to non empty.
> Thus, when the first subelements of an element are an ignorable text node and then an element, the state goes
> to non empty while the ignorable text node is serialized and the element don't print a carriage return since
> the state is non empty nor afterElement when it arrives.
>     - in the same way content make state.afterElement equal to false. So when you got element - ignorable
> text node - element, the 2 elements are on the same line
>
> At last, comments are taken for text and thus no carriage return is printed before and after them. So if your
> xml has comments every 2 lines that explains what the data are, it is serialized on a single line...
>
> Sebastien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org

ignorable whitespaces, comments and serialization

Posted by Sebastien Ponce <se...@cern.ch>.
I'm trying to serialize a tree that was built using xerces. The point is that this tree has many ignorable
whitespaces node in it (basically one every 2 nodes).

When I try to serialize with options setPreserveSpace(false) and setIndenting(true), the identation is not
down correctly. Basically, no carriage return is down. After looking at
org.apache.xml.serialize.BaseMarkupSerializer and org.apache.xml.serialize.XmlSerializer, it appears that
there are two main problems :
    - the content() method is called in BaseMarkupSerializer for a text node and changes the element state
from empty to non empty.
Thus, when the first subelements of an element are an ignorable text node and then an element, the state goes
to non empty while the ignorable text node is serialized and the element don't print a carriage return since
the state is non empty nor afterElement when it arrives.
    - in the same way content make state.afterElement equal to false. So when you got element - ignorable
text node - element, the 2 elements are on the same line

At last, comments are taken for text and thus no carriage return is printed before and after them. So if your
xml has comments every 2 lines that explains what the data are, it is serialized on a single line...

Sebastien


ignorable whitespaces and cloneNode

Posted by Sebastien Ponce <se...@cern.ch>.
It seems that text nodes containing ingorable white spaces are cloned as non ignorable white spaces.
I can't understand why since the ignorable data is contained by the field flags of NodeImpl that should be
correctly cloned but I experienced it (with xerces 1.3 this time).

Does someone understand this behavior ?

Sebastien


Re: how to ignore whitespace in XML instance document

Posted by Sebastien Ponce <se...@cern.ch>.
I posted a patch on this subject some times ago dealing with the fact that
all ignorable whitespaces were not removed by puting feature
http://apache.org/xml/features/dom/include-ignorable-whitespace to false. See
http://archive.covalent.net/xml/xerces-j-dev/2000/12/0246.xml.

Here is again this patch and I hope someone will take a look at it and apply
it if it is correct (at least it works fine for my application).

Sebastien


Dream Catcher wrote:

> Hi, all.
>  I am parsing a well-formed XML instance document to a DOM tree, but there
> are a lot of annoying empty text value nodes in the tree structure, I
> guess this is due to the carriage return in original XML document. How I
> can ingnore those whitespaces when generating the DOM tree?
>
> thank you...
>
> Best Regards
> Wang Yue
>
>     *------------------------------*
>       Attitude makes the difference
>     *------------------------------*
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org