You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by James Richardson <ja...@db.com> on 2001/05/31 18:41:20 UTC

Ignorable Whitespace ( and 'terminating with ' )


For those of you that replied with suggestions as to what I could do with my </> issue, I can report that the following snippet fixed the problem:

cat bad.xml | perl -p -e 's/\<(.*?)\>(.*)\<\/\>/<$1>$2<\/$1>/' > good.xml

I can place this as a filter in my input stream, and we will be well away ( as long as we can guarantee that elements are only on one line )

Anyway, now a little question on SAX.

Why are strings containing [\n\t ]* reported as character, rather than ignorable whitespace?

For the text:

<FinHdr>
    <Instrument Act="Subscribe" Dest=":Instrument:">
        <Exchange>Test</Exchange>
    </Instrument>
</FinHdr>


31 May 2001 11:38:51,105 [           main] DEBUG b.ged.ovgw.main.OVStreamReader  - Start Document
31 May 2001 11:38:51,295 [           main] DEBUG b.ged.ovgw.main.OVStreamReader  - startElement: uri=, localName=FinHdr, raw=FinHdr
31 May 2001 11:38:51,311 [           main] DEBUG b.ged.ovgw.main.OVStreamReader  - Got character data at p=36, l=5, Content ='
    '
31 May 2001 11:38:51,328 [           main] DEBUG b.ged.ovgw.main.OVStreamReader  - startElement: uri=, localName=Instrument, raw=Instrument
31 May 2001 11:38:51,329 [           main] DEBUG b.ged.ovgw.main.OVStreamReader  - Got character data at p=89, l=9, Content ='
        '
31 May 2001 11:38:51,333 [           main] DEBUG b.ged.ovgw.main.OVStreamReader  - startElement: uri=, localName=Exchange, raw=Exchange
31 May 2001 11:38:51,336 [           main] DEBUG b.ged.ovgw.main.OVStreamReader  - Got character data at p=108, l=5, Content ='Test'
31 May 2001 11:38:51,347 [           main] DEBUG b.ged.ovgw.main.OVStreamReader  - endElement namespaceURI=, localName = Exchange, qName = Exchange
31 May 2001 11:38:51,349 [           main] DEBUG b.ged.ovgw.main.OVStreamReader  - Got character data at p=124, l=5, Content ='
    '
31 May 2001 11:38:51,349 [           main] DEBUG b.ged.ovgw.main.OVStreamReader  - endElement namespaceURI=, localName = Instrument, qName = Instrument
31 May 2001 11:38:51,350 [           main] DEBUG b.ged.ovgw.main.OVStreamReader  - Got character data at p=142, l=1, Content ='
'
31 May 2001 11:38:51,351 [           main] DEBUG b.ged.ovgw.main.OVStreamReader  - endElement namespaceURI=, localName = FinHdr, qName = FinHdr
31 May 2001 11:38:51,358 [           main] WARN  b.ged.ovgw.main.OVStreamReader  - Exception


Thanks for any help!

James



--

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is strictly forbidden.


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Ignorable Whitespace ( and 'terminating with ' )

Posted by Andy Clark <an...@apache.org>.
James Richardson wrote:
> cat bad.xml | perl -p -e 's/\<(.*?)\>(.*)\<\/\>/<$1>$2<\/$1>/' > good.xml

As long as your file is ASCII, this should be fine. But XML
is based on Unicode which can have any number of encodings.
And unless your Perl understands this (not likely) this is
limited to working for ASCII (and "ASCII-transparent") files.

> Why are strings containing [\n\t ]* reported as character, rather 
> than ignorable whitespace?

Without a grammar present, the parser has no knowledge as
to what is meaningful character data and ignorable white-
space. So if you want the [\n\t ]* reported as ignorable
whitespace, then you have to have a grammer associated.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org