You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by "Mike Beckerle (Jira)" <ji...@apache.org> on 2022/08/16 00:53:00 UTC

[jira] [Created] (DAFFODIL-2722) Add new dfdl:lengthKind 'dfdlx:patternMatch'

Mike Beckerle created DAFFODIL-2722:
---------------------------------------

             Summary: Add new dfdl:lengthKind 'dfdlx:patternMatch'
                 Key: DAFFODIL-2722
                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2722
             Project: Daffodil
          Issue Type: New Feature
          Components: Back End, Diagnostics, Front End
    Affects Versions: 3.3.0
            Reporter: Mike Beckerle


I've run into the problem with lengthKind 'pattern' where no-match just silently returns 0 length many times now. 

I've finally run out of patience with it. 

Consider the idiom used in mil-std-2045 and other related standards for variable length strings with a max length. These use a convention where if the max length is used, no terminator character follows. But if less than the max are used, a DEL character is used as the terminator.

So, consider a zero-length string. This appears in the data stream as just a DEL character.

The standard idiom for a length 20 string would be this:

  
{code:java}
<xs:element name="value" type="xs:string" dfdl:lengthKind="pattern" dfdl:lengthPattern="[^\x7F]{0,19}(?=\x7F)|.{20}">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:maxLength value="20"/>
          </xs:restriction>
        <xs:simpleType>
      </xs:element>
      <xs:sequence dfdl:terminator="{if (fn:string-length(./value) eq 20) then '%ES;' else '%DEL;'}"/>{code}
 

Now consider if this is encountered near end of file, and there is no DEL found, neither are there 20 characters. The data is short.

However, DFDL gives us no way to tell the difference between this and the situation where the data stream did in fact contain just a DEL to terminate a zero-length string.

In both cases we get a successful parse of the element named 'value'. 

However, in the short data case, the terminator will then not be found and a parse error will be issued indicating terminator not found.

This is ok, but really we would get a better diagnostic if the element did not even pattern match successfully because we found no DEL nor 20 characters. 

When you look at the alternatives to improve this, one thing comes to mind:

We add another assert at the start of the group, which uses a dfdl:assert with testKind pattern to detect if enough data is present to parse the field. 

This works, but it is going through matching the regex TWICE. The first regex match is purely so we can tell apart the no-match case from the zero-length match case. 

It works, but feels very heroic, as in way too complex. 
{code:java}
<xs:sequence>
        <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">
          <dfdl:assert testKind='pattern'
             message="String not found. Neither DEL terminator, nor 20 characters could be parsed."
             testPattern="[^\x7F]{0,19}(?=\x7F)|.{20}"/>
        </xs:appinfo></xs:annotation>
      </xs:sequence>
      <xs:element name="value" type="xs:string" dfdl:lengthKind="pattern" dfdl:lengthPattern="[^\x7F]{0,19}(?=\x7F)|.{20}">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:maxLength value="20"/>
          </xs:restriction>
        <xs:simpleType>
      </xs:element>
      <xs:sequence dfdl:terminator="{if (fn:string-length(./value) eq 20) then '%ES;' else '%DEL;'}"/>{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)