You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Rob Davis-5 <te...@robertjdavis.co.uk> on 2008/12/09 17:55:06 UTC

Filtering whitespace outside of xml elements using LSParserFilter

I want to be able to filter any whitespace or carriage return types outside
of xml elements.

I need this to be able to successfully use W3C DOM method Node.isEqualNode()
to compare the elements and attributes of Documents with identical elements
and attributes but which have differing amounts of white space - e.g.
indentation is different, tabs instead of spaced, or documents produced on
different platforms Unix / Windows where carriage-return, line feeds vary. 

This relates to the thread
http://www.nabble.com/How-to-compare-Documents--Existing-library-method-available--or-use-DOMTreeWalker--td20856968.html


If I use isEqualNode on 2 Documents that have identical Elements and
Attributes but which have whitespace that varies (as described above), the
Documents are still regarded by isEqualNode as different.

I have done some searching and found three (3) options:
1) http://apache.org/xml/features/dom/include-ignorable-whitespace  feature
setting - but this is applicable to 
javax.xml.parsers.DocumentBuilderFactory - I am using and wish to remain
using LSParser which is produced by DOMImplementationLS which does not have
the setFeature method.

I think the problem here is that LSParser is written to comply with W3C DOM
interfaces and DocumentBuilderFactory is JAXP interfaces: how can I connect
the two so that I configure the LSParser via the DocumentBuilderFactory
setFeature method?


2) The LSParser is configured with a DOMConfiguration instance, and there is
an option:
 "element-content-whitespace"

    true
        [required] (default)Keep all whitespaces in the document.
    false
        [optional] Discard all Text nodes that contain whitespaces in
element content, as described in [element content whitespace]. The
implementation is expected to use the attribute
Text.isElementContentWhitespace to determine if a Text node should be
discarded or not. 

BUT I'm not concerned with whitespace *within* the elements. I'm only
interested in whitespace outside of elements, as explained above.


3) LSParserFilter interface - this seems like the most suitable solution but
I have seen *NO* implementations of this interface searching the web. I also
have bought O'Reilly Java and XML book edition 3 and there is no mention
here either.


Thoughts please on the above. Thanks.

-- 
View this message in context: http://www.nabble.com/Filtering-whitespace-outside-of-xml-elements-using-LSParserFilter-tp20918689p20918689.html
Sent from the Xerces - J - Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: Filtering whitespace outside of xml elements using LSParserFilter

Posted by Rob Davis-5 <te...@robertjdavis.co.uk>.
I think I have found a solution, using "element-content-whitespace" set to
false. Based on my existing code derived from O'Reilly Java and XML edition
3, I set the "element-content-whitespace" parameter to false on the
DOMConfiguration instance.

The code fragment from a method below is used for parsing different XML
Documents - i.e. different DTDs which means for some,
"element-content-whitespace" being false may be unnecessary and in fact
possibly change their data undesirably. I most cases I don't want text
elements to have their enclosed data cropped. So I provide it as a flag
option to turn on or off. 

I only need to ignore whitespace for a certain XML Document type where I
need to compare just the elements, attributes child elements of these
Documents - for more detail on this see my response to you after the code
below...


// existing code
		DOMConfiguration config;
		
		DOMImplementationRegistry registry;
		DOMImplementationLS lsImpl;
		LSParser parser;
		
	   registry =
			  DOMImplementationRegistry.newInstance( );
		  
		   lsImpl =
			  (DOMImplementationLS)registry.getDOMImplementation("LS");
		   
		   parser =
			  lsImpl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS,
			  null);
		
	  // Set options on the parser
	  config = parser.getDomConfig( );
	  config.setParameter("validate", Boolean.TRUE);
	 
	  config.setParameter("error-handler", aDomErrorHandler );
	  
// end existing code

// additional code

	  if ( ignoreWhitespace )
	  {
	    config.setParameter("element-content-whitespace", Boolean.FALSE);
	  }



Michael Glavassevich-3 wrote:
> 
> 
> Hi Rob,
> 
> Whitespace outside an element is inside of another one (except for
> whitespace outside of the root element). Whether this whitespace is
> "ignorable" depends on your application and/or whether you have a grammar
> which declares that the content of an element is only other elements.
> 

My particular xml Document doesn't care about whitespace at all, it doesn't
have any enclosing elements like <text>....</text> which could contain
whitespace. All I'm interested in is the elements themselves, their
attributes enclosed within the < and /> and their child elements.


Michael Glavassevich-3 wrote:
> 
> The "include-ignorable-whitespace" and "element-content-whitespace"
> features have the same behaviour, however they only apply to DTDs. If you
> have no DTD then I suggest that you use an LSParserFilter. 
> 

A have a DTD defined so I can use these.

The XML document is custom bespoke designed by me for a particular purpose
and I have used the utilities from net.sourceforge.saxon to generate the DTD
from the XML document.




-- 
View this message in context: http://www.nabble.com/Filtering-whitespace-outside-of-xml-elements-using-LSParserFilter-tp20918689p20933774.html
Sent from the Xerces - J - Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: Filtering whitespace outside of xml elements using LSParserFilter

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Rob,

Whitespace outside an element is inside of another one (except for
whitespace outside of the root element). Whether this whitespace is
"ignorable" depends on your application and/or whether you have a grammar
which declares that the content of an element is only other elements.

The "include-ignorable-whitespace" and "element-content-whitespace"
features have the same behaviour, however they only apply to DTDs. If you
have no DTD then I suggest that you use an LSParserFilter. This has come up
before on this list. May want to take a look at the previous discussion [1]
in the archives.

Thanks.

[1] http://marc.info/?t=115874050200003&r=1&w=2

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Rob Davis-5 <te...@robertjdavis.co.uk> wrote on 12/09/2008 11:55:06 AM:

>
> I want to be able to filter any whitespace or carriage return types
outside
> of xml elements.
>
> I need this to be able to successfully use W3C DOM method
Node.isEqualNode()
> to compare the elements and attributes of Documents with identical
elements
> and attributes but which have differing amounts of white space - e.g.
> indentation is different, tabs instead of spaced, or documents produced
on
> different platforms Unix / Windows where carriage-return, line feeds
vary.
>
> This relates to the thread
> http://www.nabble.com/How-to-compare-Documents--Existing-library-
> method-available--or-use-DOMTreeWalker--td20856968.html
>
>
> If I use isEqualNode on 2 Documents that have identical Elements and
> Attributes but which have whitespace that varies (as described above),
the
> Documents are still regarded by isEqualNode as different.
>
> I have done some searching and found three (3) options:
> 1) http://apache.org/xml/features/dom/include-ignorable-whitespace
feature
> setting - but this is applicable to
> javax.xml.parsers.DocumentBuilderFactory - I am using and wish to remain
> using LSParser which is produced by DOMImplementationLS which does not
have
> the setFeature method.
>
> I think the problem here is that LSParser is written to comply with W3C
DOM
> interfaces and DocumentBuilderFactory is JAXP interfaces: how can I
connect
> the two so that I configure the LSParser via the DocumentBuilderFactory
> setFeature method?
>
>
> 2) The LSParser is configured with a DOMConfiguration instance, and there
is
> an option:
>  "element-content-whitespace"
>
>     true
>         [required] (default)Keep all whitespaces in the document.
>     false
>         [optional] Discard all Text nodes that contain whitespaces in
> element content, as described in [element content whitespace]. The
> implementation is expected to use the attribute
> Text.isElementContentWhitespace to determine if a Text node should be
> discarded or not.
>
> BUT I'm not concerned with whitespace *within* the elements. I'm only
> interested in whitespace outside of elements, as explained above.
>
>
> 3) LSParserFilter interface - this seems like the most suitable solution
but
> I have seen *NO* implementations of this interface searching the web. I
also
> have bought O'Reilly Java and XML book edition 3 and there is no mention
> here either.
>
>
> Thoughts please on the above. Thanks.
>
> --
> View this message in context: http://www.nabble.com/Filtering-
> whitespace-outside-of-xml-elements-using-LSParserFilter-
> tp20918689p20918689.html
> Sent from the Xerces - J - Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org