You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Jacob Kjome <ho...@visi.com> on 2006/12/15 19:55:38 UTC

DOMNormalizer question

Based on something Michael Glavassevich said about validating an HTML 
document in memory using normalizeDocument() [1] (to get "id" 
attributes registered as type "ID", for optimized getElementById() 
lookup), I tried an experiment.  I parsed an HTML document using the 
Xerces DOMParser, providing it with the NekoHTML 
HTMLConfiguration.  First I tried validating against the HTML 4.01 
DTD, but since it's totally malformed (XHTML Basic 1.0 DTD is invalid 
and now this? who writes these flippin things????), I took the XHTML 
1.0 Strict DTD and changed all the elements to be declared in upper 
case (and removed "xmlns" and "xml:space" stuff) and obtained the 
local URL via a Catalog-based entity resolver.  I set the following 
parameters...

     config.setParameter("validate", Boolean.TRUE);
     config.setParameter("schema-type", javax.xml.XMLConstants.XML_DTD_NS_URI);
     config.setParameter("schema-location", url.toExternalForm());
         config.setParameter("namespaces", Boolean.FALSE);
     config.setParameter("well-formed", Boolean.FALSE);

It all loads up just fine, but fails because of a 
NullPointerException in HTMLElementImpl when calling 
getAttributeNodeNS() inside DOMNormalizer.startElement() (see line 1790)...

for (int i = 0; i < attrCount; i++) {
     attributes.getName(i, fAttrQName);
     Attr attr = null;

     attr = currentElement.getAttributeNodeNS(fAttrQName.uri, 
fAttrQName.localpart);
         ....
         ....
}

This is because HTMLElementIImpl, on line 158, calls toLowerCase() on 
the localName...

return super.getAttributeNode( localName.toLowerCase(Locale.ENGLISH) );


The reason why the localName is null in this case is that the "for" 
loop above loops over *all* possible attributes of the element 
without checking for attribute.isSpecified() before calling 
getAttributeNodeNS().  If the attribute is not specified, of course 
it is going to be null, so why bother calling it?

I worked around this by modifying 
HTMLElementImpl.getAttributeNodeNS() to return null if the provided 
'localName' is null, avoiding the inevitable NullPointerException 
upon the toLowerCase() call.  The in memory validation works after 
this change!  Yippie!

So, the question is, where is this properly fixed?  I suppose it 
would be smart for HTMLElementImpl to be checking for null before 
attempting to manipulate the string to put it in all lowercase, so, 
maybe that should be patched regardless.  However, shouldn't the 
first line in the "for" loop of DOMNormalizer.startElement() be....

if (!attributes.isSpecified(i)) continue;

If the attribute isn't specified, why attempt to get the attribute 
node?  It's already known that it's going to be null, isn't 
it?  Wouldn't this even be a minor optimization?  Is there a good 
reason not to do this?


Jake


[1] http://issues.apache.org/jira/browse/XERCESJ-1200 


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: DOMNormalizer question

Posted by Jacob Kjome <ho...@visi.com>.

No matter what the root cause here is, I think it would it still make sense to
check for null in getAttribute*() methods in HTMLElement.  No matter what DOM
normalization issues continue to exist, this simple change allows normalization
to succeed.  The rest of the issues can be addressed as they are discovered.

So, the first line of these methods would be, essentially...

if (localName == null) return null;

Jake

Quoting Michael Glavassevich <mr...@ca.ibm.com>:

> Hi Jake,
>
> The code you found in DOMNormalizer is looping over the attributes in the
> document not all of the possible attributes in the DTD. If a defaulted
> attribute is missing from the DOM then there's probably a bug somewhere
> else in the class which wouldn't surprise me. Around this time last year
> [1] in memory DTD validation was completely broken. I spent a couple weeks
> fixing most of the major issues [2][3][4][5][6][7][8] but I didn't get
> through all of them and haven't found the time to clear up the rest.
>
> Thanks.
>
> [1] http://marc.theaimsgroup.com/?l=xerces-j-dev&m=113285279019052&w=2
> [2] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113333032523512&w=2
> [3] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113338115425840&w=2
> [4] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113402200124272&w=2
> [5] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113337500722384&w=2
> [6] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113389841006312&w=2
> [7] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113399680924552&w=2
> [8] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113330014128271&w=2
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> Jacob Kjome <ho...@visi.com> wrote on 12/15/2006 01:55:38 PM:
>
> > Based on something Michael Glavassevich said about validating an HTML
> > document in memory using normalizeDocument() [1] (to get "id"
> > attributes registered as type "ID", for optimized getElementById()
> > lookup), I tried an experiment.  I parsed an HTML document using the
> > Xerces DOMParser, providing it with the NekoHTML
> > HTMLConfiguration.  First I tried validating against the HTML 4.01
> > DTD, but since it's totally malformed (XHTML Basic 1.0 DTD is invalid
> > and now this? who writes these flippin things????), I took the XHTML
> > 1.0 Strict DTD and changed all the elements to be declared in upper
> > case (and removed "xmlns" and "xml:space" stuff) and obtained the
> > local URL via a Catalog-based entity resolver.  I set the following
> > parameters...
> >
> >      config.setParameter("validate", Boolean.TRUE);
> >      config.setParameter("schema-type", javax.xml.XMLConstants.
> > XML_DTD_NS_URI);
> >      config.setParameter("schema-location", url.toExternalForm());
> >          config.setParameter("namespaces", Boolean.FALSE);
> >      config.setParameter("well-formed", Boolean.FALSE);
> >
> > It all loads up just fine, but fails because of a
> > NullPointerException in HTMLElementImpl when calling
> > getAttributeNodeNS() inside DOMNormalizer.startElement() (see line
> 1790)...
> >
> > for (int i = 0; i < attrCount; i++) {
> >      attributes.getName(i, fAttrQName);
> >      Attr attr = null;
> >
> >      attr = currentElement.getAttributeNodeNS(fAttrQName.uri,
> > fAttrQName.localpart);
> >          ....
> >          ....
> > }
> >
> > This is because HTMLElementIImpl, on line 158, calls toLowerCase() on
> > the localName...
> >
> > return super.getAttributeNode( localName.toLowerCase(Locale.ENGLISH) );
> >
> >
> > The reason why the localName is null in this case is that the "for"
> > loop above loops over *all* possible attributes of the element
> > without checking for attribute.isSpecified() before calling
> > getAttributeNodeNS().  If the attribute is not specified, of course
> > it is going to be null, so why bother calling it?
> >
> > I worked around this by modifying
> > HTMLElementImpl.getAttributeNodeNS() to return null if the provided
> > 'localName' is null, avoiding the inevitable NullPointerException
> > upon the toLowerCase() call.  The in memory validation works after
> > this change!  Yippie!
> >
> > So, the question is, where is this properly fixed?  I suppose it
> > would be smart for HTMLElementImpl to be checking for null before
> > attempting to manipulate the string to put it in all lowercase, so,
> > maybe that should be patched regardless.  However, shouldn't the
> > first line in the "for" loop of DOMNormalizer.startElement() be....
> >
> > if (!attributes.isSpecified(i)) continue;
> >
> > If the attribute isn't specified, why attempt to get the attribute
> > node?  It's already known that it's going to be null, isn't
> > it?  Wouldn't this even be a minor optimization?  Is there a good
> > reason not to do this?
> >
> >
> > Jake
> >
> >
> > [1] http://issues.apache.org/jira/browse/XERCESJ-1200
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> > For additional commands, e-mail: j-users-help@xerces.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org
>




---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: DOMNormalizer question

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Jake,

The code you found in DOMNormalizer is looping over the attributes in the 
document not all of the possible attributes in the DTD. If a defaulted 
attribute is missing from the DOM then there's probably a bug somewhere 
else in the class which wouldn't surprise me. Around this time last year 
[1] in memory DTD validation was completely broken. I spent a couple weeks 
fixing most of the major issues [2][3][4][5][6][7][8] but I didn't get 
through all of them and haven't found the time to clear up the rest.

Thanks.

[1] http://marc.theaimsgroup.com/?l=xerces-j-dev&m=113285279019052&w=2
[2] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113333032523512&w=2
[3] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113338115425840&w=2
[4] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113402200124272&w=2
[5] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113337500722384&w=2
[6] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113389841006312&w=2
[7] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113399680924552&w=2
[8] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113330014128271&w=2

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Jacob Kjome <ho...@visi.com> wrote on 12/15/2006 01:55:38 PM:

> Based on something Michael Glavassevich said about validating an HTML 
> document in memory using normalizeDocument() [1] (to get "id" 
> attributes registered as type "ID", for optimized getElementById() 
> lookup), I tried an experiment.  I parsed an HTML document using the 
> Xerces DOMParser, providing it with the NekoHTML 
> HTMLConfiguration.  First I tried validating against the HTML 4.01 
> DTD, but since it's totally malformed (XHTML Basic 1.0 DTD is invalid 
> and now this? who writes these flippin things????), I took the XHTML 
> 1.0 Strict DTD and changed all the elements to be declared in upper 
> case (and removed "xmlns" and "xml:space" stuff) and obtained the 
> local URL via a Catalog-based entity resolver.  I set the following 
> parameters...
> 
>      config.setParameter("validate", Boolean.TRUE);
>      config.setParameter("schema-type", javax.xml.XMLConstants.
> XML_DTD_NS_URI);
>      config.setParameter("schema-location", url.toExternalForm());
>          config.setParameter("namespaces", Boolean.FALSE);
>      config.setParameter("well-formed", Boolean.FALSE);
> 
> It all loads up just fine, but fails because of a 
> NullPointerException in HTMLElementImpl when calling 
> getAttributeNodeNS() inside DOMNormalizer.startElement() (see line 
1790)...
> 
> for (int i = 0; i < attrCount; i++) {
>      attributes.getName(i, fAttrQName);
>      Attr attr = null;
> 
>      attr = currentElement.getAttributeNodeNS(fAttrQName.uri, 
> fAttrQName.localpart);
>          ....
>          ....
> }
> 
> This is because HTMLElementIImpl, on line 158, calls toLowerCase() on 
> the localName...
> 
> return super.getAttributeNode( localName.toLowerCase(Locale.ENGLISH) );
> 
> 
> The reason why the localName is null in this case is that the "for" 
> loop above loops over *all* possible attributes of the element 
> without checking for attribute.isSpecified() before calling 
> getAttributeNodeNS().  If the attribute is not specified, of course 
> it is going to be null, so why bother calling it?
> 
> I worked around this by modifying 
> HTMLElementImpl.getAttributeNodeNS() to return null if the provided 
> 'localName' is null, avoiding the inevitable NullPointerException 
> upon the toLowerCase() call.  The in memory validation works after 
> this change!  Yippie!
> 
> So, the question is, where is this properly fixed?  I suppose it 
> would be smart for HTMLElementImpl to be checking for null before 
> attempting to manipulate the string to put it in all lowercase, so, 
> maybe that should be patched regardless.  However, shouldn't the 
> first line in the "for" loop of DOMNormalizer.startElement() be....
> 
> if (!attributes.isSpecified(i)) continue;
> 
> If the attribute isn't specified, why attempt to get the attribute 
> node?  It's already known that it's going to be null, isn't 
> it?  Wouldn't this even be a minor optimization?  Is there a good 
> reason not to do this?
> 
> 
> Jake
> 
> 
> [1] http://issues.apache.org/jira/browse/XERCESJ-1200 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org