You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by Mike Beckerle <mb...@tresys.com> on 2018/10/03 21:15:51 UTC

how hard is encodingErrorPolicy='error' to implement?

Turns out IBM DFDL implements only encodingErrorPolicy='error', and Daffodil only encodingErrorPolicy='replace'.


That means for any data where there are encoding errors the two implementations will not behave the same.

For compatibility testing, this will be problematic.


The I/O layer was recently revised for parsing to use our own decoders.


Not sure anything changed about encoders.


How hard is implementing parse-time encodingErrorPolicy='error', in Daffodil, which just raises a parse error if a decode error occurs?


I know for unparsing, if we're using java encoders, the implementation of encodingErrorPolicy='error' just requires initializing all encoders to have malformed and unmapped error handlers that throw. Then catching this throw and converting to an unparse error is all that is required. This has little or no performance implications as unparse errors are fatal.



Re: how hard is encodingErrorPolicy='error' to implement?

Posted by Mike Beckerle <mb...@tresys.com>.
Yeah I recall the reason we do replace is that one needs to not error on look ahead decoding.

What I had done in the old code, and I think had unit tests for is the old code was careful to return short on decode error, i.e. you just got delivered the characters up to the error., as if no error had occurred. So the caller has to consume all chars up to the error and then call again to insist on more data before the error would be thrown, and not masked as just being part of lookahead. That was the intent anyway.  The notion here is that an I/O call that asks for N bytes of data always has to be prepared to accept less than N on return, and in case of decode errors and encodingErrorPolicy="error", return fewer than N (but at least one) up to the point of the error, and mask the error. If  the I/O layer cannot return even 1 non-error character, then propagate the error as a decode error.

The DFDL spec isn't specific enough here about the exact requirement. It doesn't make clear that one must not issues spurious errors due to pre-fetching and pre-decoding that happens to pre-fetch past the end of the text and so encounters binary data and spurious decode errors.  I've sent email to DFDL workgroup for clarification of this.

It does specify that for asserts/discriminators with testKind pattern, that the regular expression can result in scanning for characters and that decode errors can occur and are handled as per encodingErrorPolicy, so a regex for such an assert/discriminator must be designed with decode errors in mind. However, if the resulting pattern match of the regex is much shorter than what was buffered and pre-decoded (for efficiency reasons), the DFDL spec is again unclear about whether an encoding error should be issued or not.

For other parsing situations, the DFDL spec is specific to say that if lengthUnits='bytes' then if there aren't enough bytes to hold the representation of a character, then on parse the bytes are skipped, and on unparse they're filled with the fillByte. For lengthUnits='characters', such fragments of characters are errors subject to encodingErrorPolicy.



________________________________
From: Steve Lawrence <st...@gmail.com>
Sent: Monday, October 8, 2018 7:26:28 AM
To: dev@daffodil.apache.org; Mike Beckerle
Subject: Re: how hard is encodingErrorPolicy='error' to implement?

BitsCharsetDecoder.scala has a section for handling decode errors, but
has logic commented out and replaced with a NotYetImplemented assertion.
I think it's just a matter of having this section throw an encoding
exception and the caller code handling it appropriately.

There are five callers in InputSourceDataInputStream of the decode()
method, so I suspect those will all need to be updated to handle the
exception and do the right thing, which might be just be to let it
bubble up to the parsers.

However, I think there are some subtleties that make decoder scanning
more difficult. For example, delimiter scanning performs lookahead which
I don't think should immediately cause a parser error. I think it should
only cause a parse error when an invalid character is actually read. So
the InputSourceDataInputStreamCharIterator logic probably becomes a bit
more complex to handle lookahead decode errors correctly. I haven't put
too much thought into this though.

And then it's a matter of ensuring the parsers that end up decoding
characters also handle that parse error and start to backtrack, since I
think many of them currently just assume a call to an IO function that
decodes characters will always succeed.

So I don't think it's going to be particularly difficult, but there are
probably some subtleties in some cases, and we really need to inspect
parsers to make sure they are handling it correctly.

I agree the unparsing should not be too difficult for the reasons you've
provided.

- Steve


On 10/3/18 5:15 PM, Mike Beckerle wrote:
> Turns out IBM DFDL implements only encodingErrorPolicy='error', and Daffodil only encodingErrorPolicy='replace'.
>
>
> That means for any data where there are encoding errors the two implementations will not behave the same.
>
> For compatibility testing, this will be problematic.
>
>
> The I/O layer was recently revised for parsing to use our own decoders.
>
>
> Not sure anything changed about encoders.
>
>
> How hard is implementing parse-time encodingErrorPolicy='error', in Daffodil, which just raises a parse error if a decode error occurs?
>
>
> I know for unparsing, if we're using java encoders, the implementation of encodingErrorPolicy='error' just requires initializing all encoders to have malformed and unmapped error handlers that throw. Then catching this throw and converting to an unparse error is all that is required. This has little or no performance implications as unparse errors are fatal.
>
>
>



Re: how hard is encodingErrorPolicy='error' to implement?

Posted by Steve Lawrence <st...@gmail.com>.
BitsCharsetDecoder.scala has a section for handling decode errors, but
has logic commented out and replaced with a NotYetImplemented assertion.
I think it's just a matter of having this section throw an encoding
exception and the caller code handling it appropriately.

There are five callers in InputSourceDataInputStream of the decode()
method, so I suspect those will all need to be updated to handle the
exception and do the right thing, which might be just be to let it
bubble up to the parsers.

However, I think there are some subtleties that make decoder scanning
more difficult. For example, delimiter scanning performs lookahead which
I don't think should immediately cause a parser error. I think it should
only cause a parse error when an invalid character is actually read. So
the InputSourceDataInputStreamCharIterator logic probably becomes a bit
more complex to handle lookahead decode errors correctly. I haven't put
too much thought into this though.

And then it's a matter of ensuring the parsers that end up decoding
characters also handle that parse error and start to backtrack, since I
think many of them currently just assume a call to an IO function that
decodes characters will always succeed.

So I don't think it's going to be particularly difficult, but there are
probably some subtleties in some cases, and we really need to inspect
parsers to make sure they are handling it correctly.

I agree the unparsing should not be too difficult for the reasons you've
provided.

- Steve


On 10/3/18 5:15 PM, Mike Beckerle wrote:
> Turns out IBM DFDL implements only encodingErrorPolicy='error', and Daffodil only encodingErrorPolicy='replace'.
> 
> 
> That means for any data where there are encoding errors the two implementations will not behave the same.
> 
> For compatibility testing, this will be problematic.
> 
> 
> The I/O layer was recently revised for parsing to use our own decoders.
> 
> 
> Not sure anything changed about encoders.
> 
> 
> How hard is implementing parse-time encodingErrorPolicy='error', in Daffodil, which just raises a parse error if a decode error occurs?
> 
> 
> I know for unparsing, if we're using java encoders, the implementation of encodingErrorPolicy='error' just requires initializing all encoders to have malformed and unmapped error handlers that throw. Then catching this throw and converting to an unparse error is all that is required. This has little or no performance implications as unparse errors are fatal.
> 
> 
>