You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by "Leon Radley(digiPlant AB)" <le...@digiplant.se> on 2010/02/15 12:24:54 UTC
Parser removes whitespace before and after html entity, is this a bug?
I've parsed a rss feed from twitter, it contains html entities such as ö
the problem I have is that if a word ends in a entity or begins with an entity, the whitespace before or after gets striped out.
The parser seems to decode the entities correctly but simply removes to much whitespace.
Is this a bug, or can I somehow tell the parser to not decode the html entities?
Cheers
Leon
Re: Parser removes whitespace before and after html entity, is this a bug?
Posted by Michael Glavassevich <mr...@ca.ibm.com>.
There is a lot that you haven't said about your application but the most
likely reason for white space getting stripped is attribute value
normalization [1]. XML parsers are required to normalize attribute values
before passing them to the application. You cannot turn this process off.
Thanks.
[1] http://www.w3.org/TR/2006/REC-xml-20060816/#AVNormalize
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org
Leon Radley <le...@digiplant.se> wrote on 02/15/2010 06:24:54 AM:
> I've parsed a rss feed from twitter, it contains html entities such as
ö
> the problem I have is that if a word ends in a entity or begins with
> an entity, the whitespace before or after gets striped out.
> The parser seems to decode the entities correctly but simply removes
> to much whitespace.
>
> Is this a bug, or can I somehow tell the parser to not decode the
> html entities?
>
> Cheers
> Leon
Re: Parser removes whitespace before and after html entity, is this a bug?
Posted by Michael Ludwig <mi...@gmx.de>.
Leon Radley(digiPlant AB) schrieb am 15.02.2010 um 12:24:54 (+0100):
> I've parsed a rss feed from twitter, it contains html entities
> such as ö
Okay, some nitpicking seems in order here: That's not an HTML entity,
but a numerical character reference. And "ä" isn't an HTML entity
either, but an entity reference; and "รค" (= ó) would be the entity
in this case.
--
Michael Ludwig
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org