You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Jacob Kjome <ho...@visi.com> on 2006/12/15 19:55:38 UTC
DOMNormalizer question
Based on something Michael Glavassevich said about validating an HTML
document in memory using normalizeDocument() [1] (to get "id"
attributes registered as type "ID", for optimized getElementById()
lookup), I tried an experiment. I parsed an HTML document using the
Xerces DOMParser, providing it with the NekoHTML
HTMLConfiguration. First I tried validating against the HTML 4.01
DTD, but since it's totally malformed (XHTML Basic 1.0 DTD is invalid
and now this? who writes these flippin things????), I took the XHTML
1.0 Strict DTD and changed all the elements to be declared in upper
case (and removed "xmlns" and "xml:space" stuff) and obtained the
local URL via a Catalog-based entity resolver. I set the following
parameters...
config.setParameter("validate", Boolean.TRUE);
config.setParameter("schema-type", javax.xml.XMLConstants.XML_DTD_NS_URI);
config.setParameter("schema-location", url.toExternalForm());
config.setParameter("namespaces", Boolean.FALSE);
config.setParameter("well-formed", Boolean.FALSE);
It all loads up just fine, but fails because of a
NullPointerException in HTMLElementImpl when calling
getAttributeNodeNS() inside DOMNormalizer.startElement() (see line 1790)...
for (int i = 0; i < attrCount; i++) {
attributes.getName(i, fAttrQName);
Attr attr = null;
attr = currentElement.getAttributeNodeNS(fAttrQName.uri,
fAttrQName.localpart);
....
....
}
This is because HTMLElementIImpl, on line 158, calls toLowerCase() on
the localName...
return super.getAttributeNode( localName.toLowerCase(Locale.ENGLISH) );
The reason why the localName is null in this case is that the "for"
loop above loops over *all* possible attributes of the element
without checking for attribute.isSpecified() before calling
getAttributeNodeNS(). If the attribute is not specified, of course
it is going to be null, so why bother calling it?
I worked around this by modifying
HTMLElementImpl.getAttributeNodeNS() to return null if the provided
'localName' is null, avoiding the inevitable NullPointerException
upon the toLowerCase() call. The in memory validation works after
this change! Yippie!
So, the question is, where is this properly fixed? I suppose it
would be smart for HTMLElementImpl to be checking for null before
attempting to manipulate the string to put it in all lowercase, so,
maybe that should be patched regardless. However, shouldn't the
first line in the "for" loop of DOMNormalizer.startElement() be....
if (!attributes.isSpecified(i)) continue;
If the attribute isn't specified, why attempt to get the attribute
node? It's already known that it's going to be null, isn't
it? Wouldn't this even be a minor optimization? Is there a good
reason not to do this?
Jake
[1] http://issues.apache.org/jira/browse/XERCESJ-1200
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org
Re: DOMNormalizer question
Posted by Jacob Kjome <ho...@visi.com>.
No matter what the root cause here is, I think it would it still make sense to
check for null in getAttribute*() methods in HTMLElement. No matter what DOM
normalization issues continue to exist, this simple change allows normalization
to succeed. The rest of the issues can be addressed as they are discovered.
So, the first line of these methods would be, essentially...
if (localName == null) return null;
Jake
Quoting Michael Glavassevich <mr...@ca.ibm.com>:
> Hi Jake,
>
> The code you found in DOMNormalizer is looping over the attributes in the
> document not all of the possible attributes in the DTD. If a defaulted
> attribute is missing from the DOM then there's probably a bug somewhere
> else in the class which wouldn't surprise me. Around this time last year
> [1] in memory DTD validation was completely broken. I spent a couple weeks
> fixing most of the major issues [2][3][4][5][6][7][8] but I didn't get
> through all of them and haven't found the time to clear up the rest.
>
> Thanks.
>
> [1] http://marc.theaimsgroup.com/?l=xerces-j-dev&m=113285279019052&w=2
> [2] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113333032523512&w=2
> [3] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113338115425840&w=2
> [4] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113402200124272&w=2
> [5] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113337500722384&w=2
> [6] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113389841006312&w=2
> [7] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113399680924552&w=2
> [8] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113330014128271&w=2
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> Jacob Kjome <ho...@visi.com> wrote on 12/15/2006 01:55:38 PM:
>
> > Based on something Michael Glavassevich said about validating an HTML
> > document in memory using normalizeDocument() [1] (to get "id"
> > attributes registered as type "ID", for optimized getElementById()
> > lookup), I tried an experiment. I parsed an HTML document using the
> > Xerces DOMParser, providing it with the NekoHTML
> > HTMLConfiguration. First I tried validating against the HTML 4.01
> > DTD, but since it's totally malformed (XHTML Basic 1.0 DTD is invalid
> > and now this? who writes these flippin things????), I took the XHTML
> > 1.0 Strict DTD and changed all the elements to be declared in upper
> > case (and removed "xmlns" and "xml:space" stuff) and obtained the
> > local URL via a Catalog-based entity resolver. I set the following
> > parameters...
> >
> > config.setParameter("validate", Boolean.TRUE);
> > config.setParameter("schema-type", javax.xml.XMLConstants.
> > XML_DTD_NS_URI);
> > config.setParameter("schema-location", url.toExternalForm());
> > config.setParameter("namespaces", Boolean.FALSE);
> > config.setParameter("well-formed", Boolean.FALSE);
> >
> > It all loads up just fine, but fails because of a
> > NullPointerException in HTMLElementImpl when calling
> > getAttributeNodeNS() inside DOMNormalizer.startElement() (see line
> 1790)...
> >
> > for (int i = 0; i < attrCount; i++) {
> > attributes.getName(i, fAttrQName);
> > Attr attr = null;
> >
> > attr = currentElement.getAttributeNodeNS(fAttrQName.uri,
> > fAttrQName.localpart);
> > ....
> > ....
> > }
> >
> > This is because HTMLElementIImpl, on line 158, calls toLowerCase() on
> > the localName...
> >
> > return super.getAttributeNode( localName.toLowerCase(Locale.ENGLISH) );
> >
> >
> > The reason why the localName is null in this case is that the "for"
> > loop above loops over *all* possible attributes of the element
> > without checking for attribute.isSpecified() before calling
> > getAttributeNodeNS(). If the attribute is not specified, of course
> > it is going to be null, so why bother calling it?
> >
> > I worked around this by modifying
> > HTMLElementImpl.getAttributeNodeNS() to return null if the provided
> > 'localName' is null, avoiding the inevitable NullPointerException
> > upon the toLowerCase() call. The in memory validation works after
> > this change! Yippie!
> >
> > So, the question is, where is this properly fixed? I suppose it
> > would be smart for HTMLElementImpl to be checking for null before
> > attempting to manipulate the string to put it in all lowercase, so,
> > maybe that should be patched regardless. However, shouldn't the
> > first line in the "for" loop of DOMNormalizer.startElement() be....
> >
> > if (!attributes.isSpecified(i)) continue;
> >
> > If the attribute isn't specified, why attempt to get the attribute
> > node? It's already known that it's going to be null, isn't
> > it? Wouldn't this even be a minor optimization? Is there a good
> > reason not to do this?
> >
> >
> > Jake
> >
> >
> > [1] http://issues.apache.org/jira/browse/XERCESJ-1200
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> > For additional commands, e-mail: j-users-help@xerces.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org
Re: DOMNormalizer question
Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Jake,
The code you found in DOMNormalizer is looping over the attributes in the
document not all of the possible attributes in the DTD. If a defaulted
attribute is missing from the DOM then there's probably a bug somewhere
else in the class which wouldn't surprise me. Around this time last year
[1] in memory DTD validation was completely broken. I spent a couple weeks
fixing most of the major issues [2][3][4][5][6][7][8] but I didn't get
through all of them and haven't found the time to clear up the rest.
Thanks.
[1] http://marc.theaimsgroup.com/?l=xerces-j-dev&m=113285279019052&w=2
[2] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113333032523512&w=2
[3] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113338115425840&w=2
[4] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113402200124272&w=2
[5] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113337500722384&w=2
[6] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113389841006312&w=2
[7] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113399680924552&w=2
[8] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113330014128271&w=2
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org
Jacob Kjome <ho...@visi.com> wrote on 12/15/2006 01:55:38 PM:
> Based on something Michael Glavassevich said about validating an HTML
> document in memory using normalizeDocument() [1] (to get "id"
> attributes registered as type "ID", for optimized getElementById()
> lookup), I tried an experiment. I parsed an HTML document using the
> Xerces DOMParser, providing it with the NekoHTML
> HTMLConfiguration. First I tried validating against the HTML 4.01
> DTD, but since it's totally malformed (XHTML Basic 1.0 DTD is invalid
> and now this? who writes these flippin things????), I took the XHTML
> 1.0 Strict DTD and changed all the elements to be declared in upper
> case (and removed "xmlns" and "xml:space" stuff) and obtained the
> local URL via a Catalog-based entity resolver. I set the following
> parameters...
>
> config.setParameter("validate", Boolean.TRUE);
> config.setParameter("schema-type", javax.xml.XMLConstants.
> XML_DTD_NS_URI);
> config.setParameter("schema-location", url.toExternalForm());
> config.setParameter("namespaces", Boolean.FALSE);
> config.setParameter("well-formed", Boolean.FALSE);
>
> It all loads up just fine, but fails because of a
> NullPointerException in HTMLElementImpl when calling
> getAttributeNodeNS() inside DOMNormalizer.startElement() (see line
1790)...
>
> for (int i = 0; i < attrCount; i++) {
> attributes.getName(i, fAttrQName);
> Attr attr = null;
>
> attr = currentElement.getAttributeNodeNS(fAttrQName.uri,
> fAttrQName.localpart);
> ....
> ....
> }
>
> This is because HTMLElementIImpl, on line 158, calls toLowerCase() on
> the localName...
>
> return super.getAttributeNode( localName.toLowerCase(Locale.ENGLISH) );
>
>
> The reason why the localName is null in this case is that the "for"
> loop above loops over *all* possible attributes of the element
> without checking for attribute.isSpecified() before calling
> getAttributeNodeNS(). If the attribute is not specified, of course
> it is going to be null, so why bother calling it?
>
> I worked around this by modifying
> HTMLElementImpl.getAttributeNodeNS() to return null if the provided
> 'localName' is null, avoiding the inevitable NullPointerException
> upon the toLowerCase() call. The in memory validation works after
> this change! Yippie!
>
> So, the question is, where is this properly fixed? I suppose it
> would be smart for HTMLElementImpl to be checking for null before
> attempting to manipulate the string to put it in all lowercase, so,
> maybe that should be patched regardless. However, shouldn't the
> first line in the "for" loop of DOMNormalizer.startElement() be....
>
> if (!attributes.isSpecified(i)) continue;
>
> If the attribute isn't specified, why attempt to get the attribute
> node? It's already known that it's going to be null, isn't
> it? Wouldn't this even be a minor optimization? Is there a good
> reason not to do this?
>
>
> Jake
>
>
> [1] http://issues.apache.org/jira/browse/XERCESJ-1200
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org