You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@daffodil.apache.org by Roger L Costello <co...@mitre.org> on 2021/11/03 13:43:48 UTC

Parse error: no forward progress <-- what does that mean???

Hi Folks,

My input consists of a series of greetings on new lines. Each greeting contains a label, whitespace, name, newline. Here is a sample input:

Hello Roger
Greeting Sally

The label is either Hello or Greeting
The name is lower and uppercase letters

Below is my DFDL schema. When I run it I get the (unhelpful) error message "Parse error: no forward progress" (what does that mean?)

My schema uses regexes to specify the tokens (label, whitespace, name, newline). I know that there are other ways to design the schema, but I want to do it this way. Can you tell me why I am getting that error message, and how to fix it, please? Extra credit if you can tell me how to get rid of those two dfdl:lengthKind="delimited" attributes.

<xs:element name="greetings" dfdl:lengthKind="delimited">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="greeting" maxOccurs="unbounded" dfdl:lengthKind="delimited" dfdl:occursCountKind="implicit">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="label" type="xs:string"
                            dfdl:lengthPattern="Hello|Greeting"/>
                        <xs:element name="whitespace" type="xs:string"
                            dfdl:lengthPattern="[ \t]+"/>
                        <xs:element name="name" type="xs:string"
                            dfdl:lengthPattern="[a-zA-Z]+"/>
                        <xs:element name="newline" type="xs:string"
                            dfdl:lengthPattern="\n"/>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>

Here is the complete schema:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">
    
    <xs:annotation>
        <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:format
                textBidi="no"
                separatorSuppressionPolicy="trailingEmpty"
                floating="no"
                encodingErrorPolicy="replace"
                outputNewLine="%CR;%LF;"
                leadingSkip="0" 
                trailingSkip="0"
                alignment="1" 
                alignmentUnits="bytes"
                textPadKind="none"
                textTrimKind="none" 
                truncateSpecifiedLengthString="no"
                escapeSchemeRef=""
                representation="text"
                encoding="ASCII"
                separator = ""
                initiator = ""
                terminator = ""
                ignoreCase = "yes"
                sequenceKind="ordered"
                initiatedContent="no"
                fillByte="%SP;"
                lengthUnits="characters"
                lengthKind="pattern"
            />
        </xs:appinfo>
    </xs:annotation>
    
    <xs:element name="greetings" dfdl:lengthKind="delimited">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="greeting" maxOccurs="unbounded" dfdl:lengthKind="delimited" dfdl:occursCountKind="implicit">
                    <xs:complexType>
                        <xs:sequence>
                            <xs:element name="label" type="xs:string"
                                dfdl:lengthPattern="Hello|Greeting"/>
                            <xs:element name="whitespace" type="xs:string"
                                dfdl:lengthPattern="[ \t]+"/>
                            <xs:element name="name" type="xs:string"
                                dfdl:lengthPattern="[a-zA-Z]+"/>
                            <xs:element name="newline" type="xs:string"
                                dfdl:lengthPattern="\n"/>
                        </xs:sequence>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
    
</xs:schema>

Re: Parse error: no forward progress <-- what does that mean???

Posted by Mike Beckerle <mb...@apache.org>.

I'll add this suggestion to Steve's answer.

Define yourself a simple type named nzpString (for "non-zero pattern-length
string")

<simpleType name="nzpString" dfdl:lengthKind='pattern'>
     <xs:annotation>
       <xs:appinfo source="http://www.ogf.org/dfdl/">
         <dfdl:assert test='{ . ne "" }' />
       </xs:appinfo>
     </xs:annotation>
  <restriction base="xs:string"/>
<simpleType>

Then exclusively use nzpString instead of string with lengthKind pattern.
This discipline will get the lengthKind pattern to work the way you want,
which is that a failure to match the pattern is a parse error that causes
backtracking (or failure).

I think we need to fix the daffodil error message about no forward progress
be sure to say:
1) the error is about the array element
2) the array element that parsed successfully consumed zero data
3) since the array is unbounded maxOccurs, you are in an infinite loop, as
any further attempt to parse will make no forward progress through the
data.

I will create a JIRA ticket for this.

This technique you are using, where you are modeling even the separators,
and terminators as DFDL elements, has a name. The technique is called
"modeling syntax as data".

There are many times when it is appropriate, or necessary. For example, if
there is more than one allowed  separator (as in your case where any
mixture of one or more tabs or spaces are allowed for the 'whitespace'
element) and the specific separator actually conveys meaning, rather than
being arbitrary.

As an example, suppose you are designing your format because you are
worried that the data contains covert data embedded in the 'whitespace '
element as steganography. A schema that uses a delimiter with %WSP+; will
happily match, and discard that whitespace from the infoset. That is, the
specific whitespace will not be part of the parse infoset.

That is fine for canonicalizing the data via parse-unparse, but it won't
give you any ability to analyze the whitespace in an application that is
trying to not only prevent steganography, but understand it, perhaps to try
to understand its source.

By modeling syntax as data, you can have your 'whitespace' element match
one or more tab/space characters, but add a facet constraint that the
maxLength is say, 10 characters. Then any amount of tabs/apaces is
well-formed, but only if the length is less than 11 characters would the
data be considered to be valid.

If you further add dfdl:outputValueCalc='{ " " }' to your whitespace
element, then on unparse the data would be standardized to a single space,
no matter what it was in the input data.

You then have achieved
(a) any amount of whitespace is well-formed, and so can be parsed and fed
to a forensics system to study the possible steganography
(b) any amount of whitespace greater than 10 characters will be invalid so
if the application doing the parse insists on valid data it is defended
from longer, possibly nefarious, whitespace. This leaves only a tiny threat
of a covert channel using these whitespace elements spread across many
lines of data.
(c) if you do an unparse, the whitespace is standardized to 1 space,
preventing any use of even the short 10-char whitespaces for steganography.

Modeling syntax as data is overkill for some applications. But in data
forensics, it is really required.

-mikeb



On Wed, Nov 3, 2021 at 9:28 AM Steve Lawrence <sl...@apache.org> wrote:

> I think the core thing to understand what's going on is that when a
> lengthPattern fails to match anything, it is not considered an error--it
> is considered a zero length element.
>
> So say we had a line like this:
>
>    "Howdy Joe"
>
> The "Hello|Greeting" regex will not match, so the length of the label
> element is considered zero, i.e.:
>
>    <label></label>
>
> We don't consume any data, and there is no error--we will continue
> parsing the next element.
>
> The whitespace regex too wouldn't match (the next character is an 'H',
> not whitespace), and so we would get another zero length element:
>
>    <whitespace></whitespace>
>
> Again without consuming data.
>
> We would then try the name regex, and that regex would match "Howdy", so
> we would get:
>
>    <name>Howdy</name>
>
> Which is clearly wrong. Things have really gone off the rails.
>
> I could continue, but I think you see the point and how this no-match
> behavior of lengthKind="pattern" can lead to issues.
>
> So it's really important to be very careful when using lengthKind
> pattern, and to consider what happens if the pattern doesn't match--it's
> not an error like most people expect.
>
>
> As to how this causes the "no forward progress" error: What's going on
> is Daffodil is successfully parsing those two Hello and Greeting lines
> as you would expect, and has reached the end of the data. But because
> maxOccurs="unbounded", Daffodil is still going to try to parse more
> greetings. But we've reach the end of data, so none of those regexes are
> going to match. An since none of those are errors, all the elements will
> just be zero length. So after we parse all the data, we end up parsing
> another <greeting> element that is completely empty, e.g.:
>
>    <greeting>
>      <label></label>
>      <whitespace></whitespace>
>      <name></name>
>      <newline></newline>
>    </greeting>
>
> So we were able to successfully parse a complete greeting element, but
> consumed no actual data. And in theory, because the <greeting> element
> is an unbounded array, we could just keep doing this over and over again
> forever. But that means we're stuck in an infinite loop of parsing
> greeting elements, but consuming no data. To avoid this infinite loop,
> Daffodil detects this and errors with a message about "no forward
> progress".
>
> The solution for cases like these is usually to add assertions to each
> element requiring them to not have zero length. So if the regex doesn't
> match, then we create an error. For example:
>
>    <xs:element name="label" type="xs:string"
> dfdl:lengthPattern="Hello|Greeting">
>      <xs:annotation>
>        <xs:appinfo source="http://www.ogf.org/dfdl/">
>          <dfdl:assert test="{ . ne '' }" />
>        </xs:appinfo>
>      </xs:annotation>
>    </xs:element>
>
> By adding this assertion to each element, a failure to match the regex
> will create a parse error rather than consuming zero data, so it parsing
> zero length data errors and has the normal backtracking behavior.
>
>
> I want those extra credit points, so to your question about removing the
> dfdl:lengthKind="delimited", one suggestion would be to make the default
> lengthKind="implicit" in the dfdl:format. Then your <greetings> and
> <greeting> elements will have implicit length (i.e. their length will be
> the length of their children) and you won't need to specify the
> lengthKind attribute for those elements. Then create a special simple
> type for pattern length strings, e.g.
>
>    <xs:simpleType name="patternString" dfdl:lengthKind="pattern">
>      <xs:restriction base="xs:string" />
>    </xs:simpleType>
>
> And then use this type for your string elements that need to be pattern
> length, e.g.:
>
>    <xs:element name="foo" type="patternString" dfdl:lengthPattern="...">
>
> So any element that is a patternString type will already have
> lengthKind="pattern" attribute set, and you only need to specify the
> lengthPattern attribute.
>
> - Steve
>
>
>
>
> On 11/3/21 9:43 AM, Roger L Costello wrote:
> > Hi Folks,
> >
> > My input consists of a series of greetings on new lines. Each greeting
> contains a label, whitespace, name, newline. Here is a sample input:
> >
> > Hello Roger
> > Greeting Sally
> >
> > The label is either Hello or Greeting
> > The name is lower and uppercase letters
> >
> > Below is my DFDL schema. When I run it I get the (unhelpful) error
> message "Parse error: no forward progress" (what does that mean?)
> >
> > My schema uses regexes to specify the tokens (label, whitespace, name,
> newline). I know that there are other ways to design the schema, but I want
> to do it this way. Can you tell me why I am getting that error message, and
> how to fix it, please? Extra credit if you can tell me how to get rid of
> those two dfdl:lengthKind="delimited" attributes.
> >
> > <xs:element name="greetings" dfdl:lengthKind="delimited">
> >      <xs:complexType>
> >          <xs:sequence>
> >              <xs:element name="greeting" maxOccurs="unbounded"
> dfdl:lengthKind="delimited" dfdl:occursCountKind="implicit">
> >                  <xs:complexType>
> >                      <xs:sequence>
> >                          <xs:element name="label" type="xs:string"
> >                              dfdl:lengthPattern="Hello|Greeting"/>
> >                          <xs:element name="whitespace" type="xs:string"
> >                              dfdl:lengthPattern="[ \t]+"/>
> >                          <xs:element name="name" type="xs:string"
> >                              dfdl:lengthPattern="[a-zA-Z]+"/>
> >                          <xs:element name="newline" type="xs:string"
> >                              dfdl:lengthPattern="\n"/>
> >                      </xs:sequence>
> >                  </xs:complexType>
> >              </xs:element>
> >          </xs:sequence>
> >      </xs:complexType>
> > </xs:element>
> >
> > Here is the complete schema:
> >
> > <?xml version="1.0" encoding="UTF-8"?>
> > <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
> >      xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">
> >
> >      <xs:annotation>
> >          <xs:appinfo source="http://www.ogf.org/dfdl/">
> >              <dfdl:format
> >                  textBidi="no"
> >                  separatorSuppressionPolicy="trailingEmpty"
> >                  floating="no"
> >                  encodingErrorPolicy="replace"
> >                  outputNewLine="%CR;%LF;"
> >                  leadingSkip="0"
> >                  trailingSkip="0"
> >                  alignment="1"
> >                  alignmentUnits="bytes"
> >                  textPadKind="none"
> >                  textTrimKind="none"
> >                  truncateSpecifiedLengthString="no"
> >                  escapeSchemeRef=""
> >                  representation="text"
> >                  encoding="ASCII"
> >                  separator = ""
> >                  initiator = ""
> >                  terminator = ""
> >                  ignoreCase = "yes"
> >                  sequenceKind="ordered"
> >                  initiatedContent="no"
> >                  fillByte="%SP;"
> >                  lengthUnits="characters"
> >                  lengthKind="pattern"
> >              />
> >          </xs:appinfo>
> >      </xs:annotation>
> >
> >      <xs:element name="greetings" dfdl:lengthKind="delimited">
> >          <xs:complexType>
> >              <xs:sequence>
> >                  <xs:element name="greeting" maxOccurs="unbounded"
> dfdl:lengthKind="delimited" dfdl:occursCountKind="implicit">
> >                      <xs:complexType>
> >                          <xs:sequence>
> >                              <xs:element name="label" type="xs:string"
> >                                  dfdl:lengthPattern="Hello|Greeting"/>
> >                              <xs:element name="whitespace"
> type="xs:string"
> >                                  dfdl:lengthPattern="[ \t]+"/>
> >                              <xs:element name="name" type="xs:string"
> >                                  dfdl:lengthPattern="[a-zA-Z]+"/>
> >                              <xs:element name="newline" type="xs:string"
> >                                  dfdl:lengthPattern="\n"/>
> >                          </xs:sequence>
> >                      </xs:complexType>
> >                  </xs:element>
> >              </xs:sequence>
> >          </xs:complexType>
> >      </xs:element>
> >
> > </xs:schema>
> >
>
>

Re: Parse error: no forward progress <-- what does that mean???

Posted by Steve Lawrence <sl...@apache.org>.

I think the core thing to understand what's going on is that when a 
lengthPattern fails to match anything, it is not considered an error--it 
is considered a zero length element.

So say we had a line like this:

   "Howdy Joe"

The "Hello|Greeting" regex will not match, so the length of the label 
element is considered zero, i.e.:

   <label></label>

We don't consume any data, and there is no error--we will continue 
parsing the next element.

The whitespace regex too wouldn't match (the next character is an 'H', 
not whitespace), and so we would get another zero length element:

   <whitespace></whitespace>

Again without consuming data.

We would then try the name regex, and that regex would match "Howdy", so 
we would get:

   <name>Howdy</name>

Which is clearly wrong. Things have really gone off the rails.

I could continue, but I think you see the point and how this no-match 
behavior of lengthKind="pattern" can lead to issues.

So it's really important to be very careful when using lengthKind 
pattern, and to consider what happens if the pattern doesn't match--it's 
not an error like most people expect.

As to how this causes the "no forward progress" error: What's going on 
is Daffodil is successfully parsing those two Hello and Greeting lines 
as you would expect, and has reached the end of the data. But because 
maxOccurs="unbounded", Daffodil is still going to try to parse more 
greetings. But we've reach the end of data, so none of those regexes are 
going to match. An since none of those are errors, all the elements will 
just be zero length. So after we parse all the data, we end up parsing 
another <greeting> element that is completely empty, e.g.:

   <greeting>
     <label></label>
     <whitespace></whitespace>
     <name></name>
     <newline></newline>
   </greeting>

So we were able to successfully parse a complete greeting element, but 
consumed no actual data. And in theory, because the <greeting> element 
is an unbounded array, we could just keep doing this over and over again 
forever. But that means we're stuck in an infinite loop of parsing 
greeting elements, but consuming no data. To avoid this infinite loop, 
Daffodil detects this and errors with a message about "no forward progress".

The solution for cases like these is usually to add assertions to each 
element requiring them to not have zero length. So if the regex doesn't 
match, then we create an error. For example:

   <xs:element name="label" type="xs:string" 
dfdl:lengthPattern="Hello|Greeting">
     <xs:annotation>
       <xs:appinfo source="http://www.ogf.org/dfdl/">
         <dfdl:assert test="{ . ne '' }" />
       </xs:appinfo>
     </xs:annotation>
   </xs:element>

By adding this assertion to each element, a failure to match the regex 
will create a parse error rather than consuming zero data, so it parsing 
zero length data errors and has the normal backtracking behavior.

I want those extra credit points, so to your question about removing the 
dfdl:lengthKind="delimited", one suggestion would be to make the default 
lengthKind="implicit" in the dfdl:format. Then your <greetings> and 
<greeting> elements will have implicit length (i.e. their length will be 
the length of their children) and you won't need to specify the 
lengthKind attribute for those elements. Then create a special simple 
type for pattern length strings, e.g.

   <xs:simpleType name="patternString" dfdl:lengthKind="pattern">
     <xs:restriction base="xs:string" />
   </xs:simpleType>

And then use this type for your string elements that need to be pattern 
length, e.g.:

   <xs:element name="foo" type="patternString" dfdl:lengthPattern="...">

So any element that is a patternString type will already have 
lengthKind="pattern" attribute set, and you only need to specify the 
lengthPattern attribute.

- Steve

On 11/3/21 9:43 AM, Roger L Costello wrote:
> Hi Folks,
> 
> My input consists of a series of greetings on new lines. Each greeting contains a label, whitespace, name, newline. Here is a sample input:
> 
> Hello Roger
> Greeting Sally
> 
> The label is either Hello or Greeting
> The name is lower and uppercase letters
> 
> Below is my DFDL schema. When I run it I get the (unhelpful) error message "Parse error: no forward progress" (what does that mean?)
> 
> My schema uses regexes to specify the tokens (label, whitespace, name, newline). I know that there are other ways to design the schema, but I want to do it this way. Can you tell me why I am getting that error message, and how to fix it, please? Extra credit if you can tell me how to get rid of those two dfdl:lengthKind="delimited" attributes.
> 
> <xs:element name="greetings" dfdl:lengthKind="delimited">
>      <xs:complexType>
>          <xs:sequence>
>              <xs:element name="greeting" maxOccurs="unbounded" dfdl:lengthKind="delimited" dfdl:occursCountKind="implicit">
>                  <xs:complexType>
>                      <xs:sequence>
>                          <xs:element name="label" type="xs:string"
>                              dfdl:lengthPattern="Hello|Greeting"/>
>                          <xs:element name="whitespace" type="xs:string"
>                              dfdl:lengthPattern="[ \t]+"/>
>                          <xs:element name="name" type="xs:string"
>                              dfdl:lengthPattern="[a-zA-Z]+"/>
>                          <xs:element name="newline" type="xs:string"
>                              dfdl:lengthPattern="\n"/>
>                      </xs:sequence>
>                  </xs:complexType>
>              </xs:element>
>          </xs:sequence>
>      </xs:complexType>
> </xs:element>
> 
> Here is the complete schema:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
>      xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">
>      
>      <xs:annotation>
>          <xs:appinfo source="http://www.ogf.org/dfdl/">
>              <dfdl:format
>                  textBidi="no"
>                  separatorSuppressionPolicy="trailingEmpty"
>                  floating="no"
>                  encodingErrorPolicy="replace"
>                  outputNewLine="%CR;%LF;"
>                  leadingSkip="0"
>                  trailingSkip="0"
>                  alignment="1"
>                  alignmentUnits="bytes"
>                  textPadKind="none"
>                  textTrimKind="none"
>                  truncateSpecifiedLengthString="no"
>                  escapeSchemeRef=""
>                  representation="text"
>                  encoding="ASCII"
>                  separator = ""
>                  initiator = ""
>                  terminator = ""
>                  ignoreCase = "yes"
>                  sequenceKind="ordered"
>                  initiatedContent="no"
>                  fillByte="%SP;"
>                  lengthUnits="characters"
>                  lengthKind="pattern"
>              />
>          </xs:appinfo>
>      </xs:annotation>
>      
>      <xs:element name="greetings" dfdl:lengthKind="delimited">
>          <xs:complexType>
>              <xs:sequence>
>                  <xs:element name="greeting" maxOccurs="unbounded" dfdl:lengthKind="delimited" dfdl:occursCountKind="implicit">
>                      <xs:complexType>
>                          <xs:sequence>
>                              <xs:element name="label" type="xs:string"
>                                  dfdl:lengthPattern="Hello|Greeting"/>
>                              <xs:element name="whitespace" type="xs:string"
>                                  dfdl:lengthPattern="[ \t]+"/>
>                              <xs:element name="name" type="xs:string"
>                                  dfdl:lengthPattern="[a-zA-Z]+"/>
>                              <xs:element name="newline" type="xs:string"
>                                  dfdl:lengthPattern="\n"/>
>                          </xs:sequence>
>                      </xs:complexType>
>                  </xs:element>
>              </xs:sequence>
>          </xs:complexType>
>      </xs:element>
>      
> </xs:schema>
>