You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-users@xerces.apache.org by Ben Griffin <be...@redsnapper.net> on 2011/05/06 15:02:48 UTC

A xercesc api access for a digest ?

Within any of the the DOM/etc frameworks that Xercesc implements, is there a digest of a DOMDocument available, or will I have to write the document out and then digest it myself?
Primarily, I am looking for a means of being able to identify if a particular DOMDocument is the same as another as a part of a rapid-access hashmap - so I need something that is fast.
Typically, there will be not more than a few hundred hashmap insertions, of which 80% will be insertion clashes (duplicate documents), but there will be hundreds of thousands of finds.

So, my current implementation involves digesting each hashmap candidate, which entails having to write it out.  (This is necessary so as to ensure that the encoding is consistent - the sources use inconsistent encodings, and they cannot be preprocessed, as some of them are availalble via eg URLs )

Re: A xercesc api access for a digest ?

Posted by Ben Griffin <be...@redsnapper.net>.

Jesse, thanks for that.

"Write it out" :- Yes, I meant serialize.  
For our purposes, the problems of canonicalization should be sorted, as we have already done some normalisation of our DOMDocuments,
so I will probably stay with the current approach. However it's a shame (IMO) that DOM API cannot return some sort of signature :D

On 6 May 2011, at 14:17, Jesse Pelton wrote:

> XML Digital Signature requires a rigorous solution to the
> canonicalization problem in order to make hashing work.  (See
> http://www.w3.org/TR/2008/REC-xmldsig-core-20080610/ and
> http://www.w3.org/TR/2001/REC-xml-c14n-20010315.)  One implementation is
> Apache Santuario (http://santuario.apache.org/cindex.html).  It might be
> useful.
> 
> If you decide to do your own thing, it's worth reviewing the DSig spec
> to make sure you handle all the cases.
> 
> You'll need to do some sort of serialization in order to do a hash.
> "Write it out" sounds like you mean to write to disk, which is not
> necessary.

RE: A xercesc api access for a digest ?

Posted by Jesse Pelton <js...@PKC.com>.

XML Digital Signature requires a rigorous solution to the
canonicalization problem in order to make hashing work.  (See
http://www.w3.org/TR/2008/REC-xmldsig-core-20080610/ and
http://www.w3.org/TR/2001/REC-xml-c14n-20010315.)  One implementation is
Apache Santuario (http://santuario.apache.org/cindex.html).  It might be
useful.

If you decide to do your own thing, it's worth reviewing the DSig spec
to make sure you handle all the cases.

You'll need to do some sort of serialization in order to do a hash.
"Write it out" sounds like you mean to write to disk, which is not
necessary.

-----Original Message-----
From: Ben Griffin [mailto:ben@redsnapper.net] 
Sent: Friday, May 06, 2011 9:03 AM
To: c-users@xerces.apache.org
Subject: A xercesc api access for a digest ?

Within any of the the DOM/etc frameworks that Xercesc implements, is
there a digest of a DOMDocument available, or will I have to write the
document out and then digest it myself?
Primarily, I am looking for a means of being able to identify if a
particular DOMDocument is the same as another as a part of a
rapid-access hashmap - so I need something that is fast.
Typically, there will be not more than a few hundred hashmap insertions,
of which 80% will be insertion clashes (duplicate documents), but there
will be hundreds of thousands of finds.

So, my current implementation involves digesting each hashmap candidate,
which entails having to write it out.  (This is necessary so as to
ensure that the encoding is consistent - the sources use inconsistent
encodings, and they cannot be preprocessed, as some of them are
availalble via eg URLs )