You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by "Leon Radley(digiPlant AB)" <le...@digiplant.se> on 2010/02/15 12:24:54 UTC

Parser removes whitespace before and after html entity, is this a bug?

I've parsed a rss feed from twitter, it contains html entities such as &#246;
the problem I have is that if a word ends in a entity or begins with an entity, the whitespace before or after gets striped out.
The parser seems to decode the entities correctly but simply removes to much whitespace.

Is this a bug, or can I somehow tell the parser to not decode the html entities?


Cheers
Leon

Re: Parser removes whitespace before and after html entity, is this a bug?

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

There is a lot that you haven't said about your application but the most
likely reason for white space getting stripped is attribute value
normalization [1]. XML parsers are required to normalize attribute values
before passing them to the application. You cannot turn this process off.

Thanks.

[1] http://www.w3.org/TR/2006/REC-xml-20060816/#AVNormalize

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Leon Radley <le...@digiplant.se> wrote on 02/15/2010 06:24:54 AM:

> I've parsed a rss feed from twitter, it contains html entities such as
&#246;
> the problem I have is that if a word ends in a entity or begins with
> an entity, the whitespace before or after gets striped out.
> The parser seems to decode the entities correctly but simply removes
> to much whitespace.
>
> Is this a bug, or can I somehow tell the parser to not decode the
> html entities?
>
> Cheers
> Leon

Re: Parser removes whitespace before and after html entity, is this a bug?

Posted by Michael Ludwig <mi...@gmx.de>.

Leon Radley(digiPlant AB) schrieb am 15.02.2010 um 12:24:54 (+0100):
> I've parsed a rss feed from twitter, it contains html entities
> such as &#246;

Okay, some nitpicking seems in order here: That's not an HTML entity,
but a numerical character reference. And "&auml;" isn't an HTML entity
either, but an entity reference; and "ä" (= &#243;) would be the entity
in this case.

-- 
Michael Ludwig

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org