You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@daffodil.apache.org by Attila Horvath <at...@gmail.com> on 2021/10/18 17:54:08 UTC

Re: Fwd: FW: DFDL: potential problem

Given recent discussion re: releases

Any inclination to implement validation strategy for reconstituted data in a lossless environment?

Thx - Attila

On 2021/09/17 20:11:54, "Beckerle, Mike" <mb...@owlcyberdefense.com> wrote: 
> Apologies on tardy reply. I missed parts of this thread due to spam email filter.
> 
> (I learned that MS Outlook 365 is misclassifying some Apache email as junk email. )
> 
> Here's the link to what is proposed for checksum calculations, and it has links to some mock-ups showing how this checksum/crc stuff is supposed to work.
> 
> https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Checksums%2C+CRC%2C+Parity+-+Layering+Enhancements
> 
> I do think this could be used to couple a generic hash into data that is verified at unparse.
> 
> 
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Monday, August 30, 2021 9:50 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Fwd: FW: DFDL: potential problem
> 
> Interesting idea.
> 
> I was thinking you could do something like this once we have this new
> feature implemented:
> 
>   <xs:element name="FormatAndChecksum">
>     <xs:sequence>
>       <xs:sequence dfdlx:layer="checksum">
>         <xs:element ref="Format" />
>       </xs:sequence>
>       <xs:element name="checksum" type="xs:string"
>         dfdl:inputValueCalc="$checksum" />
>       <xs:sequence>
>         <xs:annotation>
>           <xs:appinfo source="http://www.ogf.org/dfdl/">
>             <dfdl:assert test="{ ./checksum eq $checksum }" />
>           </xs:appinfo>
>         </xs:annotation>
>       </xs:sequence>
>     </xs:sequence>
>   </xs:element>
> 
> So we parse and checksum the entire data foramt, add the checksum to the
> infoset via input value calc, and then add an assert that the calculated
> checksum matchs the value in the infoset.
> 
> On parse, these two should always be the same. But on unparse, it's
> possible they could be different and the assert would fail.
> Unfortunately, this doesn't actually work because assert's are evaluated
> during unparse.
> 
> This seems like a reasonable use case for asserts during unparse, and I
> imagine there are others, so maybe that's a feature worth considering to
> allow this type of unparse validation.
> 
> 
> 
> 
> On 8/25/21 9:20 AM, Attila Horvath wrote:
> >
> > *Subject:* DFDL: potential problem
> >
> > ALCON
> >
> > re: idea for checksum calculations in DFDL
> > <https://lists.apache.org/thread.html/r85112d45e552a1f5b467406aeeee0f0a4bcaf143372b95c8e72f2669%40%3Cdev.daffodil.apache.org%3E>
> >
> > We may have a potential ‘situation’ as part of our DFDL/Daffodil offering as
> > follows…
> >
> > My DFDL schema development process consists of examining the exit codes of a
> > four (4) part mechanism:
> >
> >  1. DFDL parsing – “Houston, we have a go.”
> >  2. DFDL unparsing – “Houston, we have a go.”
> >  3. *End-to-end source/destination data comparison – “Houston, we have a problem.”*
> >  4. Intermediate xml validation against reconstituted data – “Houston, we have a
> >     go.”
> >
> > I have an *_unintentional_*error in my DFDL schema- unfortunately the
> > data/schema is lost that created this situation. Per above, both parse and
> > unparse execute successfully and xmllint validates Daffodil’s intermediate XML
> > file successfully against the reconstituted/unparsed data as well against the
> > DFDL [erroneous] schema.
> >
> > However, the source and target data are *_NOT_* congruent.This is one situation
> > I did not anticipate this situation.
> >
> > This means, our model and incorporation of Daffodil in our situation leaves
> > [albeit] a /possibility/ to have an erroneous DFDL schema that will ultimately
> > send data end-to-end but because the two [gateway]ends do not
> > communicatedirectly w/ each other there is no way for the destination gateway to
> > verify if the data is identical w/ the data received by the source gateway.
> >
> > To address above and perhaps along the lines of 'checksum calculations' re: IPV4
> > element, what is the collective opinion of having a SHASUM capability added to
> > Daffodil allowing the parser to optionally ("invisibly") incorporate a SHASUM in
> > the intermediate XML file allowing the destination unparser to validate the
> > reconstitute the data against the incorporated SHASUM?
> >
> > Perhaps a lame suggestion, could Daffodil optionally insert a comment tag while
> > parsing identifying it as a Daffodil inserted shasum comment which the unparser
> > can identify and validate the reconstituted data.
> >
> > Thx in advance,
> >
> > v/r
> >
> > Attila
> >
> >
> 
>

Re: Fwd: FW: DFDL: potential problem

Posted by Mike Beckerle <mb...@apache.org>.

I think your idea of adding an sha256 or other kind of checksum is
implementable in Daffodil 3.2.0 using the new layering features which were
just extended to allow checksum verification and calculation.

It's a more complex calculation, but not unlike the checkDigits example in
daffodil-test.

Except checkDigits example assumes small data. For your sha hashes, those
might be applicable to larger data objects, and so need to be
implemented in streaming fashion.

Still doable. Makes it more like the AISPayloadArmoring example, or the
base64 layer that's built into daffodil, except needs to assign the sha
hash value to a DFDL variable once its value has been computed.

I think as of Daffodil 3.2.0 (not yet released, but "real soon now") this
can be a loadable plug-in jar, i.e., doesn't have to be part of daffodil at
all.

I do think if this became important enough to users, I would be happy to
see it added as a supported Daffodil extension (to DFDL). We're supporting
base64 layers, for example, so why not this?

I think the DFDL standard work group would perhaps be less friendly to this
idea as part of DFDL, as it is really quite application specific.

The pushback argument from there would be something like this:
* This is a transformation that has nothing to do with format and nothing
to do with the infoset.
* You're actually ignoring the format when doing an sha hash.
* Hence, why should dealing with this be a DFDL-processor responsibility?
* Such pre/post processors should in principle NOT be part of Daffodil for
proper separation of concerns.

This argument falls apart as soon as the sha hash is not of the entire data
object, but some element within it. That's what the layering extension is
for.





On Mon, Oct 18, 2021 at 1:54 PM Attila Horvath <at...@gmail.com>
wrote:

> Given recent discussion re: releases
>
> Any inclination to implement validation strategy for reconstituted data in
> a lossless environment?
>
> Thx - Attila
>
> On 2021/09/17 20:11:54, "Beckerle, Mike" <mb...@owlcyberdefense.com>
> wrote:
> > Apologies on tardy reply. I missed parts of this thread due to spam
> email filter.
> >
> > (I learned that MS Outlook 365 is misclassifying some Apache email as
> junk email. )
> >
> > Here's the link to what is proposed for checksum calculations, and it
> has links to some mock-ups showing how this checksum/crc stuff is supposed
> to work.
> >
> >
> https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Checksums%2C+CRC%2C+Parity+-+Layering+Enhancements
> >
> > I do think this could be used to couple a generic hash into data that is
> verified at unparse.
> >
> >
> > ________________________________
> > From: Steve Lawrence <sl...@apache.org>
> > Sent: Monday, August 30, 2021 9:50 AM
> > To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> > Subject: Re: Fwd: FW: DFDL: potential problem
> >
> > Interesting idea.
> >
> > I was thinking you could do something like this once we have this new
> > feature implemented:
> >
> >   <xs:element name="FormatAndChecksum">
> >     <xs:sequence>
> >       <xs:sequence dfdlx:layer="checksum">
> >         <xs:element ref="Format" />
> >       </xs:sequence>
> >       <xs:element name="checksum" type="xs:string"
> >         dfdl:inputValueCalc="$checksum" />
> >       <xs:sequence>
> >         <xs:annotation>
> >           <xs:appinfo source="http://www.ogf.org/dfdl/">
> >             <dfdl:assert test="{ ./checksum eq $checksum }" />
> >           </xs:appinfo>
> >         </xs:annotation>
> >       </xs:sequence>
> >     </xs:sequence>
> >   </xs:element>
> >
> > So we parse and checksum the entire data foramt, add the checksum to the
> > infoset via input value calc, and then add an assert that the calculated
> > checksum matchs the value in the infoset.
> >
> > On parse, these two should always be the same. But on unparse, it's
> > possible they could be different and the assert would fail.
> > Unfortunately, this doesn't actually work because assert's are evaluated
> > during unparse.
> >
> > This seems like a reasonable use case for asserts during unparse, and I
> > imagine there are others, so maybe that's a feature worth considering to
> > allow this type of unparse validation.
> >
> >
> >
> >
> > On 8/25/21 9:20 AM, Attila Horvath wrote:
> > >
> > > *Subject:* DFDL: potential problem
> > >
> > > ALCON
> > >
> > > re: idea for checksum calculations in DFDL
> > > <
> https://lists.apache.org/thread.html/r85112d45e552a1f5b467406aeeee0f0a4bcaf143372b95c8e72f2669%40%3Cdev.daffodil.apache.org%3E
> >
> > >
> > > We may have a potential ‘situation’ as part of our DFDL/Daffodil
> offering as
> > > follows…
> > >
> > > My DFDL schema development process consists of examining the exit
> codes of a
> > > four (4) part mechanism:
> > >
> > >  1. DFDL parsing – “Houston, we have a go.”
> > >  2. DFDL unparsing – “Houston, we have a go.”
> > >  3. *End-to-end source/destination data comparison – “Houston, we have
> a problem.”*
> > >  4. Intermediate xml validation against reconstituted data – “Houston,
> we have a
> > >     go.”
> > >
> > > I have an *_unintentional_*error in my DFDL schema- unfortunately the
> > > data/schema is lost that created this situation. Per above, both parse
> and
> > > unparse execute successfully and xmllint validates Daffodil’s
> intermediate XML
> > > file successfully against the reconstituted/unparsed data as well
> against the
> > > DFDL [erroneous] schema.
> > >
> > > However, the source and target data are *_NOT_* congruent.This is one
> situation
> > > I did not anticipate this situation.
> > >
> > > This means, our model and incorporation of Daffodil in our situation
> leaves
> > > [albeit] a /possibility/ to have an erroneous DFDL schema that will
> ultimately
> > > send data end-to-end but because the two [gateway]ends do not
> > > communicatedirectly w/ each other there is no way for the destination
> gateway to
> > > verify if the data is identical w/ the data received by the source
> gateway.
> > >
> > > To address above and perhaps along the lines of 'checksum
> calculations' re: IPV4
> > > element, what is the collective opinion of having a SHASUM capability
> added to
> > > Daffodil allowing the parser to optionally ("invisibly") incorporate a
> SHASUM in
> > > the intermediate XML file allowing the destination unparser to
> validate the
> > > reconstitute the data against the incorporated SHASUM?
> > >
> > > Perhaps a lame suggestion, could Daffodil optionally insert a comment
> tag while
> > > parsing identifying it as a Daffodil inserted shasum comment which the
> unparser
> > > can identify and validate the reconstituted data.
> > >
> > > Thx in advance,
> > >
> > > v/r
> > >
> > > Attila
> > >
> > >
> >
> >
>