You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@daffodil.apache.org by Joshua Adams <ja...@tresys.com> on 2017/11/08 13:07:21 UTC

Packed Decimal lengthKind="delimited"

I have been in the process of implementing support for packed decimal, BCD, and Ibm4690Packed binary formats and while I belive I have implemented the parsers and unparsers for these correctly, I am running into an issue getting the IBM4690-TLOG schema project running.  Both the ACE and SA schemas make use of lengthKind="delimited" with ':' as the separator.  Currently there is no support for delimited binary data in the codebase, as we did not have support for these packed formats and lengthKind="delimited" is only allowed on packed formats according to the spec.

So, I'm guessing I will need to add a binary delimited parser in order to handle this data as I am assuming that the the TextDelimitedParser will not work with binary data?  I just want to verify that I am headed in the right direction before committing a bunch of time to implementing a new delimited parser.

Thanks,

Josh

Re: Packed Decimal lengthKind="delimited"

Posted by Joshua Adams <ja...@tresys.com>.

Thanks for the clarification.  I saw the response from the work group and will move forward with this approach for now, documenting bug tickets and code where appropriate.

Josh

________________________________
From: Mike Beckerle <mb...@tresys.com>
Sent: Wednesday, November 8, 2017 10:47:28 AM
To: dev@daffodil.apache.org
Subject: Re: Packed Decimal lengthKind="delimited"

I sent a request for clarifications to the DFDL workgroup to get the other participants to weigh in.

But I think the answer is going to be that delimited binary is as general as text binary, and all of the things like escape schemes etc. are all required because they are not prohibited.

This means we really want to leverage the existing delimiter scanning code.

DFDL actually requires a "byte level scanner". Daffodil currently implements a text character scanner.

DFDL allows one to specify things like

dfdl:terminator="%#rFF;" dfdl:encoding="utf-8"

Which means the terminator is actually byte FF, which isn't a legal character code in utf-8. The FF would screw up the utf-8 decoder/encoder. If you read FF with the UTF-8 decoder, you will either get an error or the unicode replacement character depending on dfdl:encodingErrorPolicy.

But given the above, fundamentally DFDL requires a byte-level scanner.  You can't implement DFDL fully with delimiter scanning consuming the characters from a charset decoder.

Now, I don't think we should rewrite the scanner in Daffodil to fix this. Such time as this gets rewritten for performance or other reasons, that would be when to improve it to work at the byte level.

Honestly I don't think implementing a byte-level scanner adds any value for the DFDL user community.
Right now getting TLOG to work, which uses delimited packed decimal, is the driving use case,

So, the technique I suggest is called "reduction to iso-8859-1". That is, a "binary delimited parser" is implemented by way of a "text delimited parser" using encoding iso-8859-1 under the covers.

In this encoding, every byte is a valid single-byte-wide character code. The correspondence to unicode code points is exact. I.e., the byte 0xF3 found in the data becomes unicode character U+00F3, which is "ó" (aka LATIN SMALL LETTER O WITH ACUTE)

(Btw: I highly recommend this simple utf-8 tool: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi)

You must translate the delimiters from whatever charset encoding they are specified in, to bytes - which are then the iso-8859-1 character codes one is searching for.

For example: if dfdl:encoding="ebcdic-cp-1" dfdl:terminator="$" we must translate the $ from ebcdic to get byte 5B, and then determine the iso-8859-1 character corresponding to 5B which is "[".

Then we artificially, in the implementation, change the encoding to iso-8859-1, and the terminator to "[", and textPadChar, textTrimChar to 'none'.

Once we have isolated the iso-8859-1 string, we can convert to bytes and then interpret it as packed or hexBinary data.

I recommend these restrictions to make the implementation as easy as possible for now.

In Daffodil delimited binary should require:

1) delimiters must not contain character class entities
2) all delimiter characters must have single-byte representations in the specified charset encoding
3) dfdl:encoding must not be a runtime expression and must be a byte-aligned encoding (not 7 bit, 6 bit, etc.).
3a) To insure reasonable diagnostic messages, dfdl:encoding must be single-byte wide encoding, and ascii-derived - practically speaking this means Daffodil would allow only us-ascii and iso-8859-1 encodings.
4) escape schemes must not be specified (dfdl:escapeSchemeRef="" or no definition in scope)
5) delimited binary elements must be byte aligned. (Cannot begin on a 4-bit boundary in the middle of a byte)
6) No support for raw/ byte value entities i.e., %#rHH; notation.

I'd be completely happy with separate JIRA tickets addressing enhancing the implementation to lift any of these restrictions (some certainly exist), but I wouldn't even create them as yet. I'd just write down these restrictions for our Daffodil-specific documentation - our release notes, and in code comments.

________________________________
From: Joshua Adams <ja...@tresys.com>
Sent: Wednesday, November 8, 2017 8:07:21 AM
To: dev@daffodil.apache.org
Subject: Packed Decimal lengthKind="delimited"

I have been in the process of implementing support for packed decimal, BCD, and Ibm4690Packed binary formats and while I belive I have implemented the parsers and unparsers for these correctly, I am running into an issue getting the IBM4690-TLOG schema project running.  Both the ACE and SA schemas make use of lengthKind="delimited" with ':' as the separator.  Currently there is no support for delimited binary data in the codebase, as we did not have support for these packed formats and lengthKind="delimited" is only allowed on packed formats according to the spec.

So, I'm guessing I will need to add a binary delimited parser in order to handle this data as I am assuming that the the TextDelimitedParser will not work with binary data?  I just want to verify that I am headed in the right direction before committing a bunch of time to implementing a new delimited parser.

Thanks,

Josh

Re: Packed Decimal lengthKind="delimited"

Posted by Mike Beckerle <mb...@tresys.com>.

I sent a request for clarifications to the DFDL workgroup to get the other participants to weigh in.

But I think the answer is going to be that delimited binary is as general as text binary, and all of the things like escape schemes etc. are all required because they are not prohibited.

This means we really want to leverage the existing delimiter scanning code.

DFDL actually requires a "byte level scanner". Daffodil currently implements a text character scanner.

DFDL allows one to specify things like

dfdl:terminator="%#rFF;" dfdl:encoding="utf-8"

Which means the terminator is actually byte FF, which isn't a legal character code in utf-8. The FF would screw up the utf-8 decoder/encoder. If you read FF with the UTF-8 decoder, you will either get an error or the unicode replacement character depending on dfdl:encodingErrorPolicy.

But given the above, fundamentally DFDL requires a byte-level scanner.  You can't implement DFDL fully with delimiter scanning consuming the characters from a charset decoder.

Now, I don't think we should rewrite the scanner in Daffodil to fix this. Such time as this gets rewritten for performance or other reasons, that would be when to improve it to work at the byte level.

Honestly I don't think implementing a byte-level scanner adds any value for the DFDL user community.
Right now getting TLOG to work, which uses delimited packed decimal, is the driving use case,

So, the technique I suggest is called "reduction to iso-8859-1". That is, a "binary delimited parser" is implemented by way of a "text delimited parser" using encoding iso-8859-1 under the covers.

In this encoding, every byte is a valid single-byte-wide character code. The correspondence to unicode code points is exact. I.e., the byte 0xF3 found in the data becomes unicode character U+00F3, which is "ó" (aka LATIN SMALL LETTER O WITH ACUTE)

(Btw: I highly recommend this simple utf-8 tool: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi)

You must translate the delimiters from whatever charset encoding they are specified in, to bytes - which are then the iso-8859-1 character codes one is searching for.

For example: if dfdl:encoding="ebcdic-cp-1" dfdl:terminator="$" we must translate the $ from ebcdic to get byte 5B, and then determine the iso-8859-1 character corresponding to 5B which is "[".

Then we artificially, in the implementation, change the encoding to iso-8859-1, and the terminator to "[", and textPadChar, textTrimChar to 'none'.

Once we have isolated the iso-8859-1 string, we can convert to bytes and then interpret it as packed or hexBinary data.

I recommend these restrictions to make the implementation as easy as possible for now.

In Daffodil delimited binary should require:

1) delimiters must not contain character class entities
2) all delimiter characters must have single-byte representations in the specified charset encoding
3) dfdl:encoding must not be a runtime expression and must be a byte-aligned encoding (not 7 bit, 6 bit, etc.).
3a) To insure reasonable diagnostic messages, dfdl:encoding must be single-byte wide encoding, and ascii-derived - practically speaking this means Daffodil would allow only us-ascii and iso-8859-1 encodings.
4) escape schemes must not be specified (dfdl:escapeSchemeRef="" or no definition in scope)
5) delimited binary elements must be byte aligned. (Cannot begin on a 4-bit boundary in the middle of a byte)
6) No support for raw/ byte value entities i.e., %#rHH; notation.

I'd be completely happy with separate JIRA tickets addressing enhancing the implementation to lift any of these restrictions (some certainly exist), but I wouldn't even create them as yet. I'd just write down these restrictions for our Daffodil-specific documentation - our release notes, and in code comments.

________________________________
From: Joshua Adams <ja...@tresys.com>
Sent: Wednesday, November 8, 2017 8:07:21 AM
To: dev@daffodil.apache.org
Subject: Packed Decimal lengthKind="delimited"

I have been in the process of implementing support for packed decimal, BCD, and Ibm4690Packed binary formats and while I belive I have implemented the parsers and unparsers for these correctly, I am running into an issue getting the IBM4690-TLOG schema project running.  Both the ACE and SA schemas make use of lengthKind="delimited" with ':' as the separator.  Currently there is no support for delimited binary data in the codebase, as we did not have support for these packed formats and lengthKind="delimited" is only allowed on packed formats according to the spec.

So, I'm guessing I will need to add a binary delimited parser in order to handle this data as I am assuming that the the TextDelimitedParser will not work with binary data?  I just want to verify that I am headed in the right direction before committing a bunch of time to implementing a new delimited parser.

Thanks,

Josh