You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by Roger L Costello <co...@mitre.org> on 2021/07/12 18:05:46 UTC

How to specify data with two fields, no delimiter, variable length?

Hi Folks,

I have a data field composed to two items. 

The values for the first item can be enumerated:

	A
	ABC
	AB
	AC

The values for the second item is any integer 0-999

So, here is a same data field:

	A250

How do I parse that using DFDL? I reckon I'm stuck.

/Roger

Re: How to specify data with two fields, no delimiter, variable length?

Posted by Steve Lawrence <sl...@apache.org>.
In cases like these, you need to use dfdl:lengthKind="pattern" and a
regular expression to define the length of the first item.

There's lots of different regexs depending on what kinds of infosets you
want to allow.

For example, one approach for the first item is a very strict regex that
matches exactly one of the four values, e.g.

  <xs:element name="item" type="xs:string"
    dfdl:lengthKind="pattern" dfdl:lengthPattern="ABC|AB|AC|A" />

With this approach, the item will get a non-zero length if it is one of
those items. Otherwise the item will be the empty string. And if you
don't want empty string to be allowed, you need to add an assert that
the length is greater than zero. Also, note that order in the regex
matters so it matches the longest possibility first.

On the other end of the spectrum, you could instead model the first item
to match as many non-digits as possible:

  <xs:element name="item" type="xs:string"
    dfdl:lengthKind="pattern" dfdl:lengthPattern="[^0-9]*" />

This will match any of the four allowed values, but will also match
anything else up to the first digit. So this could potentially produce
infosets with an item value of XYZ, for example. In some cases, you
might actually want this--we might consider the data to be "well-formed"
but not "valid". So you still get an infoset, it's just not "valid".
Whereas in the first case, you could only get a valid infoset.

You'll probably also need to use regex length for matching the numeric
item if there's no delimiter after the number.

So putting it together, and using the second approach for both items,
you might do something like this:

  <xs:sequence>
    <xs:element name="item1 type="xs:string"
      dfdl:lengthKind="pattern" dfdl:lengthPattern="[^0-9]*" />
    <xs:element name="item2" type="xs:int"
      dfdl:lengthKind="pattern" dfdl:lengthPattern="[0-9]*" />
  </xs:sequence>

So the first item is string parsing as many non-digits as possible, and
the second is an int parsing as many digits as possible. Note that this
approach probably should have limits on the regex length in case the
data is bad/malformed. For example, if the data didn't contain numbers
then item1 would just consume the entire data. So instead of *, you
might instead want to use something like "{0,10}" for both regexes.

- Steve

On 7/12/21 2:05 PM, Roger L Costello wrote:
> Hi Folks,
> 
> I have a data field composed to two items. 
> 
> The values for the first item can be enumerated:
> 
> 	A
> 	ABC
> 	AB
> 	AC
> 
> The values for the second item is any integer 0-999
> 
> So, here is a same data field:
> 
> 	A250
> 
> How do I parse that using DFDL? I reckon I'm stuck.
> 
> /Roger
>