You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Julia Larson <Ju...@nativeminds.com> on 2001/08/30 22:09:42 UTC

Stripping out whitespace

Hello all,

Sorry if this has been brought up before, but 
is there a way to tell the parser to strip out or
ignore extraneous tabs and spaces.

The problem I'm having is parsing the following text:

<?xml version="1.0" encoding="UTF-8"?>
<!-- edited with XML Spy v3.5 NT (http://www.xmlspy.com) by Julia Larson
(NativeMinds) -->
<!DOCTYPE NSAPI SYSTEM "NSAPI.dtd">
<NSAPI>
	<Domain DomainID="0"/>
	<Domain DomainID="1"/>
</NSAPI>

When this is parsed by Xerces, it tells me that 
there is one node NSAPI and that is has only one 
child.  The child's name is "#text" and its content
is a tab.

Why isn't it telling me about the other children?

If I send it the following text instead:

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE NSAPI SYSTEM
"NSAPI.dtd"><NSAPI><Domain DomainID="0"/><Domain DomainID="1"/></NSAPI>

It works as expected.

Help!

Thank you
-Julie



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


Re: Stripping out whitespace

Posted by "Scott A. Herod" <he...@interact-tv.com>.
Hi Julie,

I've got a generic whitespace stripper that I use a lot.  It uses
STL strings but should be easy to adjust.  Notice that it turns any
whitespace into single space character.  It also trims WS off the 
front and rear of the string.

Scott

================================================================

/** Trims whitespace ( newlines, spaces, tabs ) off of a string.
 * 
 *  \param toTrim The string to trim.
 *
 *  \param noQuotes Get rid of bounding single and double quotes,
 *  defaults to false.
 * 
 *  \return The string with WS removed.
 * */
string itvSetUp::stringTrim( string toTrim, bool noQuotes )
{
    const char SPACE = ' ';
    const char TAB = '\t';
    const char NEWLINE = '\n';
    const char RETURN = '\r';
    const char SINGLE_QUOTE = '\'';
    const char DOUBLE_QUOTE = '\"';
    int len;
    int cnt = 0;
    char testchar;
    bool canDelete = true;

    string newString ( toTrim );

    // Get rid of leading whitespace
    len = newString.size();
    while ( len > 0 && cnt < len ) {
	testchar = newString[cnt];

	if ( testchar == SPACE || testchar == TAB ||
	     testchar == NEWLINE || testchar == RETURN ||
	     ( noQuotes && testchar == SINGLE_QUOTE ) ||
	     ( noQuotes && testchar == DOUBLE_QUOTE ) ) {
	    if ( canDelete ) {
		newString.erase( cnt, 1 );
		len--;
	    }
	    else {
		newString.replace( cnt, 1, 1, SPACE );
		canDelete = true;
		cnt++;
	    }
	}
	else {
	    canDelete = false;
	    cnt++;
	}
    }

    // Get rid of the last ws
    testchar = newString[cnt-1];
    if ( len > 0 && 
	 ( testchar == SPACE || testchar == TAB ||
	   testchar == NEWLINE || testchar == RETURN ||
	   ( noQuotes && testchar == SINGLE_QUOTE ) ||
	   ( noQuotes && testchar == DOUBLE_QUOTE ) ) ) {
	newString.erase( cnt-1, 1 );
    }
    // cerr << "|" << newString << "|" << endl;

    return newString;
}


================================================================

Julia Larson wrote:
> 
> Hello all,
> 
> Sorry if this has been brought up before, but
> is there a way to tell the parser to strip out or
> ignore extraneous tabs and spaces.
> 
> The problem I'm having is parsing the following text:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <!-- edited with XML Spy v3.5 NT (http://www.xmlspy.com) by Julia Larson
> (NativeMinds) -->
> <!DOCTYPE NSAPI SYSTEM "NSAPI.dtd">
> <NSAPI>
>         <Domain DomainID="0"/>
>         <Domain DomainID="1"/>
> </NSAPI>
> 
> When this is parsed by Xerces, it tells me that
> there is one node NSAPI and that is has only one
> child.  The child's name is "#text" and its content
> is a tab.
> 
> Why isn't it telling me about the other children?
> 
> If I send it the following text instead:
> 
> <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE NSAPI SYSTEM
> "NSAPI.dtd"><NSAPI><Domain DomainID="0"/><Domain DomainID="1"/></NSAPI>
> 
> It works as expected.
> 
> Help!
> 
> Thank you
> -Julie

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org