You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@daffodil.apache.org by Mike Beckerle <mb...@apache.org> on 2022/04/14 18:27:23 UTC

idea for helping with "left over data error"

Please comment on this idea.

The problem is that users write a schema, get "left over data" when they
test it. The schema works.  The schema is, as far as DFDL and Daffodil is
concerned, correct. It just doesn't express what you intended it to
express. It IS a correct schema, just not for your intended format.


I think Daffodil needs to save the "last failure" purely for the case where
there is left-over data. Daffodil is happily ending the parse successfully
but reporting it did not consume all data.


In some applications where you are consuming messages from a network socket
which is a byte stream, this is 100% normal behavior (and no left-over-data
error would or should be issued.)


In tests and anything that is "file format" oriented, left-over data is a
real error. So the fact that Daffodil/DFDL says the parse ended normally
without error isn't helping.


In DFDL, a variable-occurrences array, the number-of-occurrences of which
is determined by the data itself, always is ended if a parse fails. So long
as maxOccurs has not been reached, the parse attempts another array
element, and if it fails, it *suppresses that error*, backs up to the end
of the prior array element (or start of array if there are no elements at
all), and *discards the failure information*, then goes on to parse "the
rest of the schema" meaning the stuff after the array.


But what if nothing is after the array?


The "suppress the error" and "discard the failure" above,.... those are a
problem, because if the parse ends with left-over data, those are the "last
error before the parse ended", and those *may* be relevant to why all the
data was not consumed.


I think we need to preserve the failure information a bit longer than we
are.


So with that problem in mind here's a possible mechanism to provide better
diagnostics.


Maybe instead of deleting it outright we put it on a queue of depth N
(shallow, like 1 or 2), and as we put more failure info on that queue the
failure info it pushes out the other end is discarded, but at end of
processing you can look back in the parser state and see what the last N
failures were, and hopefully you find there the reason for the last array
ending early.?


N could be set quite deep for debugging/schema-development, so you can look
back through it and see the backtracking decisions in reverse chronological
order as far as you need.


Comments? Variants? Alternatives?

Re: idea for helping with "left over data error"

Posted by Mike Beckerle <mb...@apache.org>.

These counterexamples are interesting.

For the one with a sequence of optional elements, that suggests to me
perhaps there is a stack of saved errors. Or really we hang them
temporarily on the Infoset data structure.

So the error from the elem2 attempt is kept around until another parse
later in that same sequence, succeeds.  If we're off the rails, then none
of them will succeed, and quite possibly it is the first saved error (from
elem2) in the sequence that matters.

This maybe has something to do with the whole "potentially trailing"
concept in DFDL. We want to keep the errors at least temporarily for
anything that is potentially trailing in a sequence, since that error might
be indicative of why the sequence did NOT parse more content from more
sequence children.

On Thu, Apr 14, 2022 at 3:52 PM Steve Lawrence <sl...@apache.org> wrote:

> That doesn't seem unreasonable to me, but here's some counter examples
> where I think this approach won't help, maybe something to consider:
>
> 1) Imagine a simple schema that parses a single byte with no point of
> uncertainties, and the input data was two bytes. In this case, there
> will be no parse errors to show the user since everything parsed exactly
> as expected, yet there is still left over data. This change won't help
> this case at all. But this is maybe trivial and pretty unlikely.
>
> But more generically and maybe more common is using the wrong length for
> a field. This will make things quickly go off the rails, and will not
> generate a parse error related to that length. And if we show any
> following parse errors, they will only be misleading.
>
> 2) Say we have a schema like this:
>
>    <element name="elem1" minOccurs="0" />
>    <element name="elem2" minOccurs="0" />
>    ...
>    <element name="elemN" minOccurs="0" />
>
> And say we fail to parse elem2 because our schema is broken. It's
> optional so we just continue on. And it's likely that everything after
> that is going to fail as well. No big deal, it's all optional. But this
> means we'll have parse errors for every elem after elem2. The one we
> actually care about is waaaay back at the beginning of the parse. But we
> don't know that is where things went off the rails. To make matters
> worse, imagine that elem1 was actually not in the data. So we'd get a
> parse error for every element, and only the one for elem2 is actually
> useful. There's just no way we can know that and suggest it to the user.
>
> And like the first case, showing additional parse errors might be
> confusing or and misleading. In this case, we'll get a slew of parse
> errors that's going to be overwhelming. And if we show only the few most
> recent errors, the user will focus all their energy looking at why elemN
> or elemN - 1 are failing to parse, when really the issue happened waaaay
> back at elem2.
>
> I imagine this kind of things would be pretty common for these left over
> data errors. Something fails early on that is the real error, but a
> bunch of optional/PoU things follow it and also fail which leads to left
> over data. And showing one or more parser errors may not help the user
> know which one to focus on, especially since not all parse errors
> signify a problem.
>
>
> I wonder if improvements to the VScode debugger would help the most?
> With the issue of left over data, we do get an infoset. If we could
> visually overlay that over the actual data in the debugger it would
> probably make it very clear where things start going wrong focus the
> user to the right part of the schema.
>
>
> On 4/14/22 2:27 PM, Mike Beckerle wrote:
> > Please comment on this idea.
> >
> > The problem is that users write a schema, get "left over data" when they
> > test it. The schema works.  The schema is, as far as DFDL and Daffodil is
> > concerned, correct. It just doesn't express what you intended it to
> > express. It IS a correct schema, just not for your intended format.
> >
> >
> > I think Daffodil needs to save the "last failure" purely for the case
> where
> > there is left-over data. Daffodil is happily ending the parse
> successfully
> > but reporting it did not consume all data.
> >
> >
> > In some applications where you are consuming messages from a network
> socket
> > which is a byte stream, this is 100% normal behavior (and no
> left-over-data
> > error would or should be issued.)
> >
> >
> > In tests and anything that is "file format" oriented, left-over data is a
> > real error. So the fact that Daffodil/DFDL says the parse ended normally
> > without error isn't helping.
> >
> >
> > In DFDL, a variable-occurrences array, the number-of-occurrences of which
> > is determined by the data itself, always is ended if a parse fails. So
> long
> > as maxOccurs has not been reached, the parse attempts another array
> > element, and if it fails, it *suppresses that error*, backs up to the end
> > of the prior array element (or start of array if there are no elements at
> > all), and *discards the failure information*, then goes on to parse "the
> > rest of the schema" meaning the stuff after the array.
> >
> >
> > But what if nothing is after the array?
> >
> >
> > The "suppress the error" and "discard the failure" above,.... those are a
> > problem, because if the parse ends with left-over data, those are the
> "last
> > error before the parse ended", and those *may* be relevant to why all the
> > data was not consumed.
> >
> >
> > I think we need to preserve the failure information a bit longer than we
> > are.
> >
> >
> > So with that problem in mind here's a possible mechanism to provide
> better
> > diagnostics.
> >
> >
> > Maybe instead of deleting it outright we put it on a queue of depth N
> > (shallow, like 1 or 2), and as we put more failure info on that queue the
> > failure info it pushes out the other end is discarded, but at end of
> > processing you can look back in the parser state and see what the last N
> > failures were, and hopefully you find there the reason for the last array
> > ending early.?
> >
> >
> > N could be set quite deep for debugging/schema-development, so you can
> look
> > back through it and see the backtracking decisions in reverse
> chronological
> > order as far as you need.
> >
> >
> > Comments? Variants? Alternatives?
> >
>
>

Re: idea for helping with "left over data error"

Posted by Steve Lawrence <sl...@apache.org>.

That doesn't seem unreasonable to me, but here's some counter examples 
where I think this approach won't help, maybe something to consider:

1) Imagine a simple schema that parses a single byte with no point of 
uncertainties, and the input data was two bytes. In this case, there 
will be no parse errors to show the user since everything parsed exactly 
as expected, yet there is still left over data. This change won't help 
this case at all. But this is maybe trivial and pretty unlikely.

But more generically and maybe more common is using the wrong length for 
a field. This will make things quickly go off the rails, and will not 
generate a parse error related to that length. And if we show any 
following parse errors, they will only be misleading.

2) Say we have a schema like this:

   <element name="elem1" minOccurs="0" />
   <element name="elem2" minOccurs="0" />
   ...
   <element name="elemN" minOccurs="0" />

And say we fail to parse elem2 because our schema is broken. It's 
optional so we just continue on. And it's likely that everything after 
that is going to fail as well. No big deal, it's all optional. But this 
means we'll have parse errors for every elem after elem2. The one we 
actually care about is waaaay back at the beginning of the parse. But we 
don't know that is where things went off the rails. To make matters 
worse, imagine that elem1 was actually not in the data. So we'd get a 
parse error for every element, and only the one for elem2 is actually 
useful. There's just no way we can know that and suggest it to the user.

And like the first case, showing additional parse errors might be 
confusing or and misleading. In this case, we'll get a slew of parse 
errors that's going to be overwhelming. And if we show only the few most 
recent errors, the user will focus all their energy looking at why elemN 
or elemN - 1 are failing to parse, when really the issue happened waaaay 
back at elem2.

I imagine this kind of things would be pretty common for these left over 
data errors. Something fails early on that is the real error, but a 
bunch of optional/PoU things follow it and also fail which leads to left 
over data. And showing one or more parser errors may not help the user 
know which one to focus on, especially since not all parse errors 
signify a problem.

I wonder if improvements to the VScode debugger would help the most? 
With the issue of left over data, we do get an infoset. If we could 
visually overlay that over the actual data in the debugger it would 
probably make it very clear where things start going wrong focus the 
user to the right part of the schema.

On 4/14/22 2:27 PM, Mike Beckerle wrote:
> Please comment on this idea.
> 
> The problem is that users write a schema, get "left over data" when they
> test it. The schema works.  The schema is, as far as DFDL and Daffodil is
> concerned, correct. It just doesn't express what you intended it to
> express. It IS a correct schema, just not for your intended format.
> 
> 
> I think Daffodil needs to save the "last failure" purely for the case where
> there is left-over data. Daffodil is happily ending the parse successfully
> but reporting it did not consume all data.
> 
> 
> In some applications where you are consuming messages from a network socket
> which is a byte stream, this is 100% normal behavior (and no left-over-data
> error would or should be issued.)
> 
> 
> In tests and anything that is "file format" oriented, left-over data is a
> real error. So the fact that Daffodil/DFDL says the parse ended normally
> without error isn't helping.
> 
> 
> In DFDL, a variable-occurrences array, the number-of-occurrences of which
> is determined by the data itself, always is ended if a parse fails. So long
> as maxOccurs has not been reached, the parse attempts another array
> element, and if it fails, it *suppresses that error*, backs up to the end
> of the prior array element (or start of array if there are no elements at
> all), and *discards the failure information*, then goes on to parse "the
> rest of the schema" meaning the stuff after the array.
> 
> 
> But what if nothing is after the array?
> 
> 
> The "suppress the error" and "discard the failure" above,.... those are a
> problem, because if the parse ends with left-over data, those are the "last
> error before the parse ended", and those *may* be relevant to why all the
> data was not consumed.
> 
> 
> I think we need to preserve the failure information a bit longer than we
> are.
> 
> 
> So with that problem in mind here's a possible mechanism to provide better
> diagnostics.
> 
> 
> Maybe instead of deleting it outright we put it on a queue of depth N
> (shallow, like 1 or 2), and as we put more failure info on that queue the
> failure info it pushes out the other end is discarded, but at end of
> processing you can look back in the parser state and see what the last N
> failures were, and hopefully you find there the reason for the last array
> ending early.?
> 
> 
> N could be set quite deep for debugging/schema-development, so you can look
> back through it and see the backtracking decisions in reverse chronological
> order as far as you need.
> 
> 
> Comments? Variants? Alternatives?
>

Re: idea for helping with "left over data error"

Posted by Mike Beckerle <mb...@apache.org>.

Not unorthodox Larry. I just used similar ideas in a quite big binary data
schema. I agree with you this adds too much complexity though.

My schema has a mode controlled by a variable called
"captureUnrecognizedData". It defaults to false, but if set true, then data
that fails to parse gets captured.

The capture happens at a few different places. E.g., if there is a valid
message header, but the message content is unrecognized, you get an
invalidMessage element with details about the unregonized message ID.  If
nothing at all can be recognized but the length field at the start of the
item, then you get invalidHex element with some hex bytes corresponding to
the length. If even the length is meaningless (e.g., too big), so you
really have no idea what to do, it parses one byte, creates an invalidByte
element.

These "invalid" elements have facet constraints such that they can never be
valid. E.g., then have maxLength="0" when they're never constructed even
without at least 1 byte.

So with the capture mode on, you can feed line noise to this schema parsing
and you will get a pile of invalid stuff out. Nothing makes it fail.

If you turn off the captureUnrecognizedData, then each place it would have
constructed these invalid elements instead it hits an assertion that
complains about the problem.

I think this does make the schema unreasonably complex. It's too heroic to
have to do this sort of built-in recovery. Also I'm really unhappy with the
backtracking caused by the assertions when the capture mode is off. The
schema ends up needing more discriminators in subtle locations and
debugging this is hard.

Much too heroic for real use.

Conclusion I think is "making users write the diagnostic behavior into
their schemas" is not a good solution to this 'left over data' thing.




On Thu, Apr 14, 2022 at 3:34 PM Larry Barber <la...@nteligen.com>
wrote:

> In some very unorthodox use of DFDL & Daffodil, I needed to ensure that I
> was able to get xml output, even from files that contained extra data after
> the last piece of parsable data.
> I accomplished this by adding a "dataBlob" element that consumed
> everything up to the end of the file:
>
>     <xs:element name="DataBlob"  type="xs:hexBinary"
> dfdl:lengthKind="pattern"
> dfdl:lengthPattern="[\x00-\xFF]*?(?=\xFF+[\x01-\xFE])"
> dfdl:encoding="ISO-8859-1">
>         <xs:annotation>
>             <xs:appinfo source="http://www.ogf.org/dfdl/">
>                 <dfdl:discriminator test="{ dfdl:valueLength(., 'bytes')
> gt 0 }" />
>             </xs:appinfo>
>         </xs:annotation>
>     </xs:element>
>
> As utilized in a modification of the sample jpeg schema, I added this
> element inside of the xs:choice inside of the "Segment" element:
>
>                                 <xs:choice>
>                                     <xs:element ref="DataBlob" />
>                                     <xs:group ref="Markers" />
>                                     <xs:element ref="Datablob" />
>                                 </xs:choice>
>
> I also made use of similar logic to ensure that a file parse would
> complete even if the given length of an element was larger than the actual
> length of data remaining in the file:
>
>     <xs:element name="packet_truncated">
>         <xs:complexType>
>             <xs:sequence>
>                 <xs:element name="Datablob"
>                             type="xs:hexBinary"
>                             dfdl:lengthKind="pattern"
>                             dfdl:lengthPattern="[\x00-\xFF]+$"
>                             dfdl:encoding="ISO-8859-1"
>                             dfdl:outputValueCalc="{xs:hexBinary('00')}" >
>                     <xs:annotation>
>                         <xs:appinfo source="http://www.ogf.org/dfdl/">
>                             <dfdl:discriminator test="{
> dfdl:valueLength(., 'bytes') gt 0 }" />
>                         </xs:appinfo>
>                     </xs:annotation>
>                 </xs:element>
>             </xs:sequence>
>         </xs:complexType>
>     </xs:element>
>
> This required layering an xs:choice into each of the elements. As an
> example here is the modified SOF element:
>
>           <xs:complexType name="SOF">
>  *          <xs:choice>
>                 <xs:sequence>
>                     <xs:element name="Length" type="unsignedint16"
> dfdl:outputValueCalc="{ 6 + (3 *
> ../Number_of_Source_Image_Components_in_the_Frame) + 2}"/>
>                     <xs:element name="Precision" type="unsignedint8"/>
>                     <xs:element name="Number_of_Lines_in_Source_Image"
> type="unsignedint16"/>
>                     <xs:element name="Number_of_Samples_per_Line"
> type="unsignedint16"/>
>                     <xs:element
> name="Number_of_Source_Image_Components_in_the_Frame" type="unsignedint8"
> dfdl:outputValueCalc="{
> fn:count(../Image_Components_in_Frame/Image_Component) }"/>
>                     <xs:element name="Image_Components_in_Frame"
> dfdl:lengthKind="explicit" dfdl:lengthUnits="bytes" dfdl:length="{3 *
> ../Number_of_Source_Image_Components_in_the_Frame}">
>                         <xs:complexType>
>                             <xs:sequence>
>                                 <xs:element name="Image_Component"
> maxOccurs="unbounded" dfdl:occursCountKind="implicit">
>                                     <xs:complexType>
>                                         <xs:sequence>
>                                             <xs:element
> name="Component_Identifier" type="unsignedint8"/>
>                                             <xs:element
> name="Horizontal_Sampling_Factor" type="unsignedint4"/>
>                                             <xs:element
> name="Vertical_Sampling_Factor" type="unsignedint4"/>
>                                             <xs:element
> name="Quantization_Table_Selector" type="unsignedint8"/>
>                                         </xs:sequence>
>                                     </xs:complexType>
>                                 </xs:element>
>                             </xs:sequence>
>                         </xs:complexType>
>                     </xs:element>
>                 </xs:sequence>
>   *            <xs:element ref="packet_truncated" />
>   *        </xs:choice>
>           </xs:complexType>
>
> I have been able to use this technique to create a schema that will allow
> Daffodil to cleanly exit and produce an output xml file with virtually any
> jpeg file - no matter how badly corrupted it is.
> However, if Daffodil were modified to flag an error, but still produce the
> parsed portion of the file, it would allow the schema remain simpler and
> easier to read.
>
>
> -----Original Message-----
> From: Mike Beckerle <mb...@apache.org>
> Sent: Thursday, April 14, 2022 2:27 PM
> To: dev@daffodil.apache.org
> Subject: idea for helping with "left over data error"
>
> Please comment on this idea.
>
> The problem is that users write a schema, get "left over data" when they
> test it. The schema works.  The schema is, as far as DFDL and Daffodil is
> concerned, correct. It just doesn't express what you intended it to
> express. It IS a correct schema, just not for your intended format.
>
>
> I think Daffodil needs to save the "last failure" purely for the case
> where there is left-over data. Daffodil is happily ending the parse
> successfully but reporting it did not consume all data.
>
>
> In some applications where you are consuming messages from a network
> socket which is a byte stream, this is 100% normal behavior (and no
> left-over-data error would or should be issued.)
>
>
> In tests and anything that is "file format" oriented, left-over data is a
> real error. So the fact that Daffodil/DFDL says the parse ended normally
> without error isn't helping.
>
>
> In DFDL, a variable-occurrences array, the number-of-occurrences of which
> is determined by the data itself, always is ended if a parse fails. So long
> as maxOccurs has not been reached, the parse attempts another array
> element, and if it fails, it *suppresses that error*, backs up to the end
> of the prior array element (or start of array if there are no elements at
> all), and *discards the failure information*, then goes on to parse "the
> rest of the schema" meaning the stuff after the array.
>
>
> But what if nothing is after the array?
>
>
> The "suppress the error" and "discard the failure" above,.... those are a
> problem, because if the parse ends with left-over data, those are the "last
> error before the parse ended", and those *may* be relevant to why all the
> data was not consumed.
>
>
> I think we need to preserve the failure information a bit longer than we
> are.
>
>
> So with that problem in mind here's a possible mechanism to provide better
> diagnostics.
>
>
> Maybe instead of deleting it outright we put it on a queue of depth N
> (shallow, like 1 or 2), and as we put more failure info on that queue the
> failure info it pushes out the other end is discarded, but at end of
> processing you can look back in the parser state and see what the last N
> failures were, and hopefully you find there the reason for the last array
> ending early.?
>
>
> N could be set quite deep for debugging/schema-development, so you can
> look back through it and see the backtracking decisions in reverse
> chronological order as far as you need.
>
>
> Comments? Variants? Alternatives?
>

RE: idea for helping with "left over data error"

Posted by Larry Barber <la...@nteligen.com>.

In some very unorthodox use of DFDL & Daffodil, I needed to ensure that I was able to get xml output, even from files that contained extra data after the last piece of parsable data.
I accomplished this by adding a "dataBlob" element that consumed everything up to the end of the file:

    <xs:element name="DataBlob"  type="xs:hexBinary" dfdl:lengthKind="pattern" dfdl:lengthPattern="[\x00-\xFF]*?(?=\xFF+[\x01-\xFE])" dfdl:encoding="ISO-8859-1">
        <xs:annotation>
            <xs:appinfo source="http://www.ogf.org/dfdl/">
                <dfdl:discriminator test="{ dfdl:valueLength(., 'bytes') gt 0 }" />
            </xs:appinfo>
        </xs:annotation>
    </xs:element>

As utilized in a modification of the sample jpeg schema, I added this element inside of the xs:choice inside of the "Segment" element:

                                <xs:choice>
                                    <xs:element ref="DataBlob" />
                                    <xs:group ref="Markers" />
                                    <xs:element ref="Datablob" />
                                </xs:choice>

I also made use of similar logic to ensure that a file parse would complete even if the given length of an element was larger than the actual length of data remaining in the file:

    <xs:element name="packet_truncated">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="Datablob" 
                            type="xs:hexBinary" 
                            dfdl:lengthKind="pattern" 
                            dfdl:lengthPattern="[\x00-\xFF]+$" 
                            dfdl:encoding="ISO-8859-1"
                            dfdl:outputValueCalc="{xs:hexBinary('00')}" >
                    <xs:annotation>
                        <xs:appinfo source="http://www.ogf.org/dfdl/">
                            <dfdl:discriminator test="{ dfdl:valueLength(., 'bytes') gt 0 }" />
                        </xs:appinfo>
                    </xs:annotation>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

This required layering an xs:choice into each of the elements. As an example here is the modified SOF element:

          <xs:complexType name="SOF">
 *          <xs:choice>
                <xs:sequence>
                    <xs:element name="Length" type="unsignedint16" dfdl:outputValueCalc="{ 6 + (3 * ../Number_of_Source_Image_Components_in_the_Frame) + 2}"/>
                    <xs:element name="Precision" type="unsignedint8"/>
                    <xs:element name="Number_of_Lines_in_Source_Image" type="unsignedint16"/>
                    <xs:element name="Number_of_Samples_per_Line" type="unsignedint16"/>
                    <xs:element name="Number_of_Source_Image_Components_in_the_Frame" type="unsignedint8" dfdl:outputValueCalc="{ fn:count(../Image_Components_in_Frame/Image_Component) }"/>
                    <xs:element name="Image_Components_in_Frame" dfdl:lengthKind="explicit" dfdl:lengthUnits="bytes" dfdl:length="{3 * ../Number_of_Source_Image_Components_in_the_Frame}">
                        <xs:complexType>
                            <xs:sequence>
                                <xs:element name="Image_Component" maxOccurs="unbounded" dfdl:occursCountKind="implicit">
                                    <xs:complexType>
                                        <xs:sequence>
                                            <xs:element name="Component_Identifier" type="unsignedint8"/>
                                            <xs:element name="Horizontal_Sampling_Factor" type="unsignedint4"/>
                                            <xs:element name="Vertical_Sampling_Factor" type="unsignedint4"/>
                                            <xs:element name="Quantization_Table_Selector" type="unsignedint8"/>
                                        </xs:sequence>
                                    </xs:complexType>
                                </xs:element>
                            </xs:sequence>
                        </xs:complexType>
                    </xs:element>
                </xs:sequence>
  *            <xs:element ref="packet_truncated" />
  *        </xs:choice>
          </xs:complexType>

I have been able to use this technique to create a schema that will allow Daffodil to cleanly exit and produce an output xml file with virtually any jpeg file - no matter how badly corrupted it is.
However, if Daffodil were modified to flag an error, but still produce the parsed portion of the file, it would allow the schema remain simpler and easier to read.


-----Original Message-----
From: Mike Beckerle <mb...@apache.org> 
Sent: Thursday, April 14, 2022 2:27 PM
To: dev@daffodil.apache.org
Subject: idea for helping with "left over data error"

Please comment on this idea.

The problem is that users write a schema, get "left over data" when they test it. The schema works.  The schema is, as far as DFDL and Daffodil is concerned, correct. It just doesn't express what you intended it to express. It IS a correct schema, just not for your intended format.


I think Daffodil needs to save the "last failure" purely for the case where there is left-over data. Daffodil is happily ending the parse successfully but reporting it did not consume all data.


In some applications where you are consuming messages from a network socket which is a byte stream, this is 100% normal behavior (and no left-over-data error would or should be issued.)


In tests and anything that is "file format" oriented, left-over data is a real error. So the fact that Daffodil/DFDL says the parse ended normally without error isn't helping.


In DFDL, a variable-occurrences array, the number-of-occurrences of which is determined by the data itself, always is ended if a parse fails. So long as maxOccurs has not been reached, the parse attempts another array element, and if it fails, it *suppresses that error*, backs up to the end of the prior array element (or start of array if there are no elements at all), and *discards the failure information*, then goes on to parse "the rest of the schema" meaning the stuff after the array.


But what if nothing is after the array?


The "suppress the error" and "discard the failure" above,.... those are a problem, because if the parse ends with left-over data, those are the "last error before the parse ended", and those *may* be relevant to why all the data was not consumed.


I think we need to preserve the failure information a bit longer than we are.


So with that problem in mind here's a possible mechanism to provide better diagnostics.


Maybe instead of deleting it outright we put it on a queue of depth N (shallow, like 1 or 2), and as we put more failure info on that queue the failure info it pushes out the other end is discarded, but at end of processing you can look back in the parser state and see what the last N failures were, and hopefully you find there the reason for the last array ending early.?


N could be set quite deep for debugging/schema-development, so you can look back through it and see the backtracking decisions in reverse chronological order as far as you need.


Comments? Variants? Alternatives?