You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by Steve Lawrence <sl...@apache.org> on 2020/09/11 17:14:00 UTC

Large dfdl:length values on complex types

I recently came across an issue where we have something like this:

  <xs:element name="length" type="xs:int" ... />
  <xs:element name="data"
    dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="field1" ... />
        <xs:element name="field2" ... />
        ...
        <xs:element name="fieldN" ... />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

So we have a length element and a complex data field that uses this
length, and the data field is made up of a bunch of fields.

The issue I come across is related to how we cache bytes in buckets for
backtracking. As we fill up buckets, we currently limit the total amount
cache size of the buckets to 256MB. So if someone ever parses more than
256MB of data and then tries to backtrack past that, we error. The idea
being that we don't want to keep an infinite cache for potential
backtracking and people should have realized that they went down the
wrong branch much earlier.

Though, a problem occurs with the complex types with a large specified
length like above. When we have the complex type with expression
../length, before trying to parse any of the fields, we read that length
number of bytes into our cache buckets to confirm that that number of
bytes exists. The problem occurs if length is more than 256MB. In this
case, we read length number of bytes, and start removing elements from
the cache once we read more than 256MB.

But once that succeeds and we read length bytes, we then try to start
parsing the fields within the complex type, but we've removed those
early cached bytes, and so we fail with an unhelpful backtracking exception.

I'm not sure of the right solution here.

Perhaps we shouldn't be throwing away these bytes when dealing with
complex lengths?

Or perhaps we shouldn't even be trying to determine if that many bytes
are available when we have a specified length. Instead, maybe we should
just set the bit limit to make sure we don't parse more than than that?
And if eventually something tries to read a byte and there aren't enough
and we hit that limit, only then do we fail. This feels like the right
solution, but wanted to start a discussion to see if maybe there's a
reason we try to read the full length, or maybe there's another alternative?

Re: Large dfdl:length values on complex types

Posted by Steve Lawrence <sl...@apache.org>.
This was discovered with withe current NITF schema, which can have very
large chunks of types. So this is definitely a real world use case.
Here's the spot in the schema where this occurs:

https://github.com/DFDLSchemas/NITF/blob/master/src/main/resources/com/tresys/nitf/xsd/nitf.dfdl.xsd#L627

In an example large NITF file, the DataLength field is like 570MB so
this hits the issue of a complex type larger than 256MB. The temporary
work around is to remove the dfdl:length for this complex type, but that
isn't necessarily correct and doesn't allow dealing with padding in the
data.

I don't think there is currently a tunable limit that limits the size of
complex types. Maybe there is one for simple types? I'm not sure. Or if
it is, it must currently be bigger than the 256MB bucket size limit.

- Steve


On 9/11/20 1:24 PM, Beckerle, Mike wrote:
> Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.
> 
> Is this a real format, or a test case designed to push the boundaries?
> 
> 
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Friday, September 11, 2020 1:14 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Large dfdl:length values on complex types
> 
> I recently came across an issue where we have something like this:
> 
>   <xs:element name="length" type="xs:int" ... />
>   <xs:element name="data"
>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>     <xs:complexType>
>       <xs:sequence>
>         <xs:element name="field1" ... />
>         <xs:element name="field2" ... />
>         ...
>         <xs:element name="fieldN" ... />
>       </xs:sequence>
>     </xs:complexType>
>   </xs:element>
> 
> So we have a length element and a complex data field that uses this
> length, and the data field is made up of a bunch of fields.
> 
> The issue I come across is related to how we cache bytes in buckets for
> backtracking. As we fill up buckets, we currently limit the total amount
> cache size of the buckets to 256MB. So if someone ever parses more than
> 256MB of data and then tries to backtrack past that, we error. The idea
> being that we don't want to keep an infinite cache for potential
> backtracking and people should have realized that they went down the
> wrong branch much earlier.
> 
> Though, a problem occurs with the complex types with a large specified
> length like above. When we have the complex type with expression
> ../length, before trying to parse any of the fields, we read that length
> number of bytes into our cache buckets to confirm that that number of
> bytes exists. The problem occurs if length is more than 256MB. In this
> case, we read length number of bytes, and start removing elements from
> the cache once we read more than 256MB.
> 
> But once that succeeds and we read length bytes, we then try to start
> parsing the fields within the complex type, but we've removed those
> early cached bytes, and so we fail with an unhelpful backtracking exception.
> 
> I'm not sure of the right solution here.
> 
> Perhaps we shouldn't be throwing away these bytes when dealing with
> complex lengths?
> 
> Or perhaps we shouldn't even be trying to determine if that many bytes
> are available when we have a specified length. Instead, maybe we should
> just set the bit limit to make sure we don't parse more than than that?
> And if eventually something tries to read a byte and there aren't enough
> and we hit that limit, only then do we fail. This feels like the right
> solution, but wanted to start a discussion to see if maybe there's a
> reason we try to read the full length, or maybe there's another alternative?
> 


Re: Large dfdl:length values on complex types

Posted by Steve Lawrence <sl...@apache.org>.
Yep, we would still call withBitLengthLimit which will restrict child
elements to that length. Fortunately, withBitLengthLimit doesn't
actually care if we've read that amount of data yet.

On 9/14/20 12:02 PM, Beckerle, Mike wrote:
> 
> A further observation: we would need to tell the I/O layer about the bounded length of the complex element (which probably has to be in length units of bytes/bits/fixed-width characters, not characters of a variable-width encoding like UTF-8), for purposes of avoiding the children of the complex type just happening to turn out to be larger than the explicit length, and also to someday implement the lengthKind 'endOfParent' for the final child.
> ________________________________
> From: Beckerle, Mike <mb...@owlcyberdefense.com>
> Sent: Monday, September 14, 2020 11:58 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
> 
> Ok, so assuming the input source is a pipe/stream kind of thing, what exactly are we hoping to achieve?
> 
> The scenario is that a complex element with an explicit length, and that length is vast, is encountered.
> 
> We want to NOT check in advance that all the bytes are present, and just go about parsing the children, and maybe we'll have enough, or maybe we will not have enough data, but we don't want to grab it all in advance.
> 
> Seems like we just remove the upfront check. Proceed to parse the children.
> 
> We still need to save the value of the explicit length for use at the end of the complex element in case there is a need for the ElementUnused region to be skipped.
> 
> The parser behavior would change: we won't immediately backtrack on short data in this case, but that's probably ok.
> 
> So it's a different backend combinator for explicit length complex types, but seems to me the issue is just that the check is too aggressive, and not consistent with streaming behavior.
> 
> Does that make sense?
> 
> 
> 
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Monday, September 14, 2020 11:34 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
> 
> We could have a separate I/O path specific to files. Our current input
> API supports Array[Byte], ByteBuffer, and InputStream as ways to get data.
> 
> For first two, we create an "ByteBufferInputSource", which doesn't
> actually throw away data since it's all already in an array in memory.
> isDefinedForLength doesn't need to read anything to return an answer
> since it knows the full length of the data.
> 
> For the InputStream, we create a "BucketingInputSource" which is what
> does this caching/discarding stuff, and needs to read/cache the input
> stream, which is where we get into thie cache issue.
> 
> We could definitely create a new FileInputSource that could have
> specific logic for a java File/Path and only use this when we have
> length information (which I assume we can't get when dealing with things
> like pipes. We would still have to fall back to the BucketingInputSource
> for those cases.
> 
> But as Brandon points out, it doesn't solve the case when the input data
> is not from a file. And it also requires the user to use a new API
> function to create a File specific InputSource.
> 
> I've created DAFFODIL-2397 to track the API issue.
> 
> On 9/14/20 11:19 AM, Sloane, Brandon wrote:
>> Does that solve the issue? Not all data sources are files. A file centric path could be used as an optimization, but we also need to behave correctly on non-file data sources.
>>
>> Also, if we go this route, we need to be careful about what we consider to be a "file". An anonymous or named pipe are commonly used types of files which do provide a file-length.
>> ________________________________
>> From: Beckerle, Mike <mb...@owlcyberdefense.com>
>> Sent: Monday, September 14, 2020 10:40 AM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Re: Large dfdl:length values on complex types
>>
>> Is it possible to create a separate i/o path that is file centric, so that isDefinedForLength can answer without actually accessing any data at all?
>>
>> The notion that there is no tunable access suggests we do need an API change to a factory pattern so that the I/O abstractions that need them have access to tunables. Efficient I/O is just one of those areas where tuning is generally needed.
>>
>> ...mikeb
>>
>>
>> ________________________________
>> From: Steve Lawrence <sl...@apache.org>
>> Sent: Monday, September 14, 2020 8:32:34 AM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Re: Large dfdl:length values on complex types
>>
>>> The issue isn't so much backtracking, as fortracking. At the moment,
>>> we need to read in all of dfdl:length before we begin parsing the
>>> first time, so we get a problem even if we never attempt to
>>> backtrack.
>>
>> This is correct. I think this is the core issue. As best I can come up
>> with, the right fix is the removal (or very selective use) of
>> isDefinedForLength along with parsers handling some new end of stream
>> error. Not sure how big of a change that is or if maybe there's another
>> approach.
>>
>> I've created DAFFODIL-2395 to track this issue. I think this is probably
>> a blocker for the next release.
>>
>> Note that this isn't actually an issue for blobs. The BlobLengthParser
>> reads blobs in chunks with a default size of 4KB. And it only calls
>> isDefinedForLength for individual chunks, so never comes close the 256MB
>> limit. So large blobs are never an issue. The only issue here is with
>> complex or simple types with a specified length larger than 256MB. A
>> simple type of 256MB seems unlikely, but is very possible for complex types.
>>
>> Also note that the upcoming streaming changes won't help with this
>> issue. That is really just about outputting the infoset while we parse
>> rather than waiting until the end of parse, as well as the use of a SAX
>> API. It won't affect how we handle the data at all.
>>
>> On 9/11/20 3:04 PM, Sloane, Brandon wrote:
>>> We should be careful about making the cache too big due to the memory overhead. I'm not sure where we are with support for streaming parses (beyond the --stream option in the CLI, which assumes multiple distinct documents that have been concatted), so in most cases, a multiple of this overhead would be needed by the infoset itself anyway.
>>>
>>> The reason I bring up blobs is that they bypass the requirement to have a large infoset. Imagine a 4GB document, consisting of a length prefix, 3.99GB worth of binary blobs, and metadata. There is no good reason parsing this document should require anywhere near 4GB memory overhead.
>>>
>>>> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
>>>
>>> The issue isn't so much backtracking, as fortracking. At the moment, we need to read in all of dfdl:length before we begin parsing the first time, so we get a problem even if we never attempt to backtrack.
>>>
>>>
>>>
>>> ________________________________
>>> From: Beckerle, Mike <mb...@owlcyberdefense.com>
>>> Sent: Friday, September 11, 2020 2:51 PM
>>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>>> Subject: Re: Large dfdl:length values on complex types
>>>
>>> We have tunable limits people have to sometimes enlarge. E.g. max size of a regex match is limited.
>>>
>>> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
>>>
>>> That said. We really don't want to have this sort of limit hard coded.
>>>
>>> A non-streamable file format that just stores the total length in a header record is pretty common, and people will use those for really big pieces of data these days.
>>>
>>> Can this 256MB just be a tunable constant that people can enlarge?
>>>
>>> From a security perspective, a stored length field is always something that should be sanity checked - if clobbered, it could contain -1 and that could be interpreted as maxInt or something, resulting in a denial-of-service attack. It would be better to get a processing error in that case.
>>>
>>> Can we just check for complex type lengths against a tunable limit, and error out before we even try to read it in?
>>>
>>> ________________________________
>>> From: Sloane, Brandon <bs...@owlcyberdefense.com>
>>> Sent: Friday, September 11, 2020 1:53 PM
>>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>>> Subject: Re: Large dfdl:length values on complex types
>>>
>>> 256MB doesn't strike me as that big. We haven't ad filesize limits measured in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit as low as 16gb). All it takes is such a file to have a length prefix, and we can very easily run into the limit. Combined with our support of large binary blobs, it is not unreasonable that someone would want to use DFDL on such a format.
>>>
>>> ________________________________
>>> From: Beckerle, Mike <mb...@owlcyberdefense.com>
>>> Sent: Friday, September 11, 2020 1:24 PM
>>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>>> Subject: Re: Large dfdl:length values on complex types
>>>
>>> Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.
>>>
>>> Is this a real format, or a test case designed to push the boundaries?
>>>
>>>
>>> ________________________________
>>> From: Steve Lawrence <sl...@apache.org>
>>> Sent: Friday, September 11, 2020 1:14 PM
>>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>>> Subject: Large dfdl:length values on complex types
>>>
>>> I recently came across an issue where we have something like this:
>>>
>>>   <xs:element name="length" type="xs:int" ... />
>>>   <xs:element name="data"
>>>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>>>     <xs:complexType>
>>>       <xs:sequence>
>>>         <xs:element name="field1" ... />
>>>         <xs:element name="field2" ... />
>>>         ...
>>>         <xs:element name="fieldN" ... />
>>>       </xs:sequence>
>>>     </xs:complexType>
>>>   </xs:element>
>>>
>>> So we have a length element and a complex data field that uses this
>>> length, and the data field is made up of a bunch of fields.
>>>
>>> The issue I come across is related to how we cache bytes in buckets for
>>> backtracking. As we fill up buckets, we currently limit the total amount
>>> cache size of the buckets to 256MB. So if someone ever parses more than
>>> 256MB of data and then tries to backtrack past that, we error. The idea
>>> being that we don't want to keep an infinite cache for potential
>>> backtracking and people should have realized that they went down the
>>> wrong branch much earlier.
>>>
>>> Though, a problem occurs with the complex types with a large specified
>>> length like above. When we have the complex type with expression
>>> ../length, before trying to parse any of the fields, we read that length
>>> number of bytes into our cache buckets to confirm that that number of
>>> bytes exists. The problem occurs if length is more than 256MB. In this
>>> case, we read length number of bytes, and start removing elements from
>>> the cache once we read more than 256MB.
>>>
>>> But once that succeeds and we read length bytes, we then try to start
>>> parsing the fields within the complex type, but we've removed those
>>> early cached bytes, and so we fail with an unhelpful backtracking exception.
>>>
>>> I'm not sure of the right solution here.
>>>
>>> Perhaps we shouldn't be throwing away these bytes when dealing with
>>> complex lengths?
>>>
>>> Or perhaps we shouldn't even be trying to determine if that many bytes
>>> are available when we have a specified length. Instead, maybe we should
>>> just set the bit limit to make sure we don't parse more than than that?
>>> And if eventually something tries to read a byte and there aren't enough
>>> and we hit that limit, only then do we fail. This feels like the right
>>> solution, but wanted to start a discussion to see if maybe there's a
>>> reason we try to read the full length, or maybe there's another alternative?
>>>
>>
>>
> 
> 


Re: Large dfdl:length values on complex types

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
A further observation: we would need to tell the I/O layer about the bounded length of the complex element (which probably has to be in length units of bytes/bits/fixed-width characters, not characters of a variable-width encoding like UTF-8), for purposes of avoiding the children of the complex type just happening to turn out to be larger than the explicit length, and also to someday implement the lengthKind 'endOfParent' for the final child.
________________________________
From: Beckerle, Mike <mb...@owlcyberdefense.com>
Sent: Monday, September 14, 2020 11:58 AM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

Ok, so assuming the input source is a pipe/stream kind of thing, what exactly are we hoping to achieve?

The scenario is that a complex element with an explicit length, and that length is vast, is encountered.

We want to NOT check in advance that all the bytes are present, and just go about parsing the children, and maybe we'll have enough, or maybe we will not have enough data, but we don't want to grab it all in advance.

Seems like we just remove the upfront check. Proceed to parse the children.

We still need to save the value of the explicit length for use at the end of the complex element in case there is a need for the ElementUnused region to be skipped.

The parser behavior would change: we won't immediately backtrack on short data in this case, but that's probably ok.

So it's a different backend combinator for explicit length complex types, but seems to me the issue is just that the check is too aggressive, and not consistent with streaming behavior.

Does that make sense?



________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Monday, September 14, 2020 11:34 AM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

We could have a separate I/O path specific to files. Our current input
API supports Array[Byte], ByteBuffer, and InputStream as ways to get data.

For first two, we create an "ByteBufferInputSource", which doesn't
actually throw away data since it's all already in an array in memory.
isDefinedForLength doesn't need to read anything to return an answer
since it knows the full length of the data.

For the InputStream, we create a "BucketingInputSource" which is what
does this caching/discarding stuff, and needs to read/cache the input
stream, which is where we get into thie cache issue.

We could definitely create a new FileInputSource that could have
specific logic for a java File/Path and only use this when we have
length information (which I assume we can't get when dealing with things
like pipes. We would still have to fall back to the BucketingInputSource
for those cases.

But as Brandon points out, it doesn't solve the case when the input data
is not from a file. And it also requires the user to use a new API
function to create a File specific InputSource.

I've created DAFFODIL-2397 to track the API issue.

On 9/14/20 11:19 AM, Sloane, Brandon wrote:
> Does that solve the issue? Not all data sources are files. A file centric path could be used as an optimization, but we also need to behave correctly on non-file data sources.
>
> Also, if we go this route, we need to be careful about what we consider to be a "file". An anonymous or named pipe are commonly used types of files which do provide a file-length.
> ________________________________
> From: Beckerle, Mike <mb...@owlcyberdefense.com>
> Sent: Monday, September 14, 2020 10:40 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
>
> Is it possible to create a separate i/o path that is file centric, so that isDefinedForLength can answer without actually accessing any data at all?
>
> The notion that there is no tunable access suggests we do need an API change to a factory pattern so that the I/O abstractions that need them have access to tunables. Efficient I/O is just one of those areas where tuning is generally needed.
>
> ...mikeb
>
>
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Monday, September 14, 2020 8:32:34 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
>
>> The issue isn't so much backtracking, as fortracking. At the moment,
>> we need to read in all of dfdl:length before we begin parsing the
>> first time, so we get a problem even if we never attempt to
>> backtrack.
>
> This is correct. I think this is the core issue. As best I can come up
> with, the right fix is the removal (or very selective use) of
> isDefinedForLength along with parsers handling some new end of stream
> error. Not sure how big of a change that is or if maybe there's another
> approach.
>
> I've created DAFFODIL-2395 to track this issue. I think this is probably
> a blocker for the next release.
>
> Note that this isn't actually an issue for blobs. The BlobLengthParser
> reads blobs in chunks with a default size of 4KB. And it only calls
> isDefinedForLength for individual chunks, so never comes close the 256MB
> limit. So large blobs are never an issue. The only issue here is with
> complex or simple types with a specified length larger than 256MB. A
> simple type of 256MB seems unlikely, but is very possible for complex types.
>
> Also note that the upcoming streaming changes won't help with this
> issue. That is really just about outputting the infoset while we parse
> rather than waiting until the end of parse, as well as the use of a SAX
> API. It won't affect how we handle the data at all.
>
> On 9/11/20 3:04 PM, Sloane, Brandon wrote:
>> We should be careful about making the cache too big due to the memory overhead. I'm not sure where we are with support for streaming parses (beyond the --stream option in the CLI, which assumes multiple distinct documents that have been concatted), so in most cases, a multiple of this overhead would be needed by the infoset itself anyway.
>>
>> The reason I bring up blobs is that they bypass the requirement to have a large infoset. Imagine a 4GB document, consisting of a length prefix, 3.99GB worth of binary blobs, and metadata. There is no good reason parsing this document should require anywhere near 4GB memory overhead.
>>
>>> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
>>
>> The issue isn't so much backtracking, as fortracking. At the moment, we need to read in all of dfdl:length before we begin parsing the first time, so we get a problem even if we never attempt to backtrack.
>>
>>
>>
>> ________________________________
>> From: Beckerle, Mike <mb...@owlcyberdefense.com>
>> Sent: Friday, September 11, 2020 2:51 PM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Re: Large dfdl:length values on complex types
>>
>> We have tunable limits people have to sometimes enlarge. E.g. max size of a regex match is limited.
>>
>> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
>>
>> That said. We really don't want to have this sort of limit hard coded.
>>
>> A non-streamable file format that just stores the total length in a header record is pretty common, and people will use those for really big pieces of data these days.
>>
>> Can this 256MB just be a tunable constant that people can enlarge?
>>
>> From a security perspective, a stored length field is always something that should be sanity checked - if clobbered, it could contain -1 and that could be interpreted as maxInt or something, resulting in a denial-of-service attack. It would be better to get a processing error in that case.
>>
>> Can we just check for complex type lengths against a tunable limit, and error out before we even try to read it in?
>>
>> ________________________________
>> From: Sloane, Brandon <bs...@owlcyberdefense.com>
>> Sent: Friday, September 11, 2020 1:53 PM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Re: Large dfdl:length values on complex types
>>
>> 256MB doesn't strike me as that big. We haven't ad filesize limits measured in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit as low as 16gb). All it takes is such a file to have a length prefix, and we can very easily run into the limit. Combined with our support of large binary blobs, it is not unreasonable that someone would want to use DFDL on such a format.
>>
>> ________________________________
>> From: Beckerle, Mike <mb...@owlcyberdefense.com>
>> Sent: Friday, September 11, 2020 1:24 PM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Re: Large dfdl:length values on complex types
>>
>> Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.
>>
>> Is this a real format, or a test case designed to push the boundaries?
>>
>>
>> ________________________________
>> From: Steve Lawrence <sl...@apache.org>
>> Sent: Friday, September 11, 2020 1:14 PM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Large dfdl:length values on complex types
>>
>> I recently came across an issue where we have something like this:
>>
>>   <xs:element name="length" type="xs:int" ... />
>>   <xs:element name="data"
>>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>>     <xs:complexType>
>>       <xs:sequence>
>>         <xs:element name="field1" ... />
>>         <xs:element name="field2" ... />
>>         ...
>>         <xs:element name="fieldN" ... />
>>       </xs:sequence>
>>     </xs:complexType>
>>   </xs:element>
>>
>> So we have a length element and a complex data field that uses this
>> length, and the data field is made up of a bunch of fields.
>>
>> The issue I come across is related to how we cache bytes in buckets for
>> backtracking. As we fill up buckets, we currently limit the total amount
>> cache size of the buckets to 256MB. So if someone ever parses more than
>> 256MB of data and then tries to backtrack past that, we error. The idea
>> being that we don't want to keep an infinite cache for potential
>> backtracking and people should have realized that they went down the
>> wrong branch much earlier.
>>
>> Though, a problem occurs with the complex types with a large specified
>> length like above. When we have the complex type with expression
>> ../length, before trying to parse any of the fields, we read that length
>> number of bytes into our cache buckets to confirm that that number of
>> bytes exists. The problem occurs if length is more than 256MB. In this
>> case, we read length number of bytes, and start removing elements from
>> the cache once we read more than 256MB.
>>
>> But once that succeeds and we read length bytes, we then try to start
>> parsing the fields within the complex type, but we've removed those
>> early cached bytes, and so we fail with an unhelpful backtracking exception.
>>
>> I'm not sure of the right solution here.
>>
>> Perhaps we shouldn't be throwing away these bytes when dealing with
>> complex lengths?
>>
>> Or perhaps we shouldn't even be trying to determine if that many bytes
>> are available when we have a specified length. Instead, maybe we should
>> just set the bit limit to make sure we don't parse more than than that?
>> And if eventually something tries to read a byte and there aren't enough
>> and we hit that limit, only then do we fail. This feels like the right
>> solution, but wanted to start a discussion to see if maybe there's a
>> reason we try to read the full length, or maybe there's another alternative?
>>
>
>


Re: Large dfdl:length values on complex types

Posted by Steve Lawrence <sl...@apache.org>.
Seems like a reasonable solution to me. It does mean that simple types
(excluding blobs) can't be larger than the bucketing max cache size
(i.e. 256MB right now) but that seems like a more than reasonable
restriction.

There may be other restrictions (e.g. a pattern length probably can't
match more than 256MB), but I suspect those to are also resonable.

On 9/14/20 11:58 AM, Beckerle, Mike wrote:
> Ok, so assuming the input source is a pipe/stream kind of thing, what exactly are we hoping to achieve?
> 
> The scenario is that a complex element with an explicit length, and that length is vast, is encountered.
> 
> We want to NOT check in advance that all the bytes are present, and just go about parsing the children, and maybe we'll have enough, or maybe we will not have enough data, but we don't want to grab it all in advance.
> 
> Seems like we just remove the upfront check. Proceed to parse the children.
> 
> We still need to save the value of the explicit length for use at the end of the complex element in case there is a need for the ElementUnused region to be skipped.
> 
> The parser behavior would change: we won't immediately backtrack on short data in this case, but that's probably ok.
> 
> So it's a different backend combinator for explicit length complex types, but seems to me the issue is just that the check is too aggressive, and not consistent with streaming behavior.
> 
> Does that make sense?
> 
> 
> 
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Monday, September 14, 2020 11:34 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
> 
> We could have a separate I/O path specific to files. Our current input
> API supports Array[Byte], ByteBuffer, and InputStream as ways to get data.
> 
> For first two, we create an "ByteBufferInputSource", which doesn't
> actually throw away data since it's all already in an array in memory.
> isDefinedForLength doesn't need to read anything to return an answer
> since it knows the full length of the data.
> 
> For the InputStream, we create a "BucketingInputSource" which is what
> does this caching/discarding stuff, and needs to read/cache the input
> stream, which is where we get into thie cache issue.
> 
> We could definitely create a new FileInputSource that could have
> specific logic for a java File/Path and only use this when we have
> length information (which I assume we can't get when dealing with things
> like pipes. We would still have to fall back to the BucketingInputSource
> for those cases.
> 
> But as Brandon points out, it doesn't solve the case when the input data
> is not from a file. And it also requires the user to use a new API
> function to create a File specific InputSource.
> 
> I've created DAFFODIL-2397 to track the API issue.
> 
> On 9/14/20 11:19 AM, Sloane, Brandon wrote:
>> Does that solve the issue? Not all data sources are files. A file centric path could be used as an optimization, but we also need to behave correctly on non-file data sources.
>>
>> Also, if we go this route, we need to be careful about what we consider to be a "file". An anonymous or named pipe are commonly used types of files which do provide a file-length.
>> ________________________________
>> From: Beckerle, Mike <mb...@owlcyberdefense.com>
>> Sent: Monday, September 14, 2020 10:40 AM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Re: Large dfdl:length values on complex types
>>
>> Is it possible to create a separate i/o path that is file centric, so that isDefinedForLength can answer without actually accessing any data at all?
>>
>> The notion that there is no tunable access suggests we do need an API change to a factory pattern so that the I/O abstractions that need them have access to tunables. Efficient I/O is just one of those areas where tuning is generally needed.
>>
>> ...mikeb
>>
>>
>> ________________________________
>> From: Steve Lawrence <sl...@apache.org>
>> Sent: Monday, September 14, 2020 8:32:34 AM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Re: Large dfdl:length values on complex types
>>
>>> The issue isn't so much backtracking, as fortracking. At the moment,
>>> we need to read in all of dfdl:length before we begin parsing the
>>> first time, so we get a problem even if we never attempt to
>>> backtrack.
>>
>> This is correct. I think this is the core issue. As best I can come up
>> with, the right fix is the removal (or very selective use) of
>> isDefinedForLength along with parsers handling some new end of stream
>> error. Not sure how big of a change that is or if maybe there's another
>> approach.
>>
>> I've created DAFFODIL-2395 to track this issue. I think this is probably
>> a blocker for the next release.
>>
>> Note that this isn't actually an issue for blobs. The BlobLengthParser
>> reads blobs in chunks with a default size of 4KB. And it only calls
>> isDefinedForLength for individual chunks, so never comes close the 256MB
>> limit. So large blobs are never an issue. The only issue here is with
>> complex or simple types with a specified length larger than 256MB. A
>> simple type of 256MB seems unlikely, but is very possible for complex types.
>>
>> Also note that the upcoming streaming changes won't help with this
>> issue. That is really just about outputting the infoset while we parse
>> rather than waiting until the end of parse, as well as the use of a SAX
>> API. It won't affect how we handle the data at all.
>>
>> On 9/11/20 3:04 PM, Sloane, Brandon wrote:
>>> We should be careful about making the cache too big due to the memory overhead. I'm not sure where we are with support for streaming parses (beyond the --stream option in the CLI, which assumes multiple distinct documents that have been concatted), so in most cases, a multiple of this overhead would be needed by the infoset itself anyway.
>>>
>>> The reason I bring up blobs is that they bypass the requirement to have a large infoset. Imagine a 4GB document, consisting of a length prefix, 3.99GB worth of binary blobs, and metadata. There is no good reason parsing this document should require anywhere near 4GB memory overhead.
>>>
>>>> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
>>>
>>> The issue isn't so much backtracking, as fortracking. At the moment, we need to read in all of dfdl:length before we begin parsing the first time, so we get a problem even if we never attempt to backtrack.
>>>
>>>
>>>
>>> ________________________________
>>> From: Beckerle, Mike <mb...@owlcyberdefense.com>
>>> Sent: Friday, September 11, 2020 2:51 PM
>>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>>> Subject: Re: Large dfdl:length values on complex types
>>>
>>> We have tunable limits people have to sometimes enlarge. E.g. max size of a regex match is limited.
>>>
>>> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
>>>
>>> That said. We really don't want to have this sort of limit hard coded.
>>>
>>> A non-streamable file format that just stores the total length in a header record is pretty common, and people will use those for really big pieces of data these days.
>>>
>>> Can this 256MB just be a tunable constant that people can enlarge?
>>>
>>> From a security perspective, a stored length field is always something that should be sanity checked - if clobbered, it could contain -1 and that could be interpreted as maxInt or something, resulting in a denial-of-service attack. It would be better to get a processing error in that case.
>>>
>>> Can we just check for complex type lengths against a tunable limit, and error out before we even try to read it in?
>>>
>>> ________________________________
>>> From: Sloane, Brandon <bs...@owlcyberdefense.com>
>>> Sent: Friday, September 11, 2020 1:53 PM
>>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>>> Subject: Re: Large dfdl:length values on complex types
>>>
>>> 256MB doesn't strike me as that big. We haven't ad filesize limits measured in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit as low as 16gb). All it takes is such a file to have a length prefix, and we can very easily run into the limit. Combined with our support of large binary blobs, it is not unreasonable that someone would want to use DFDL on such a format.
>>>
>>> ________________________________
>>> From: Beckerle, Mike <mb...@owlcyberdefense.com>
>>> Sent: Friday, September 11, 2020 1:24 PM
>>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>>> Subject: Re: Large dfdl:length values on complex types
>>>
>>> Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.
>>>
>>> Is this a real format, or a test case designed to push the boundaries?
>>>
>>>
>>> ________________________________
>>> From: Steve Lawrence <sl...@apache.org>
>>> Sent: Friday, September 11, 2020 1:14 PM
>>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>>> Subject: Large dfdl:length values on complex types
>>>
>>> I recently came across an issue where we have something like this:
>>>
>>>   <xs:element name="length" type="xs:int" ... />
>>>   <xs:element name="data"
>>>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>>>     <xs:complexType>
>>>       <xs:sequence>
>>>         <xs:element name="field1" ... />
>>>         <xs:element name="field2" ... />
>>>         ...
>>>         <xs:element name="fieldN" ... />
>>>       </xs:sequence>
>>>     </xs:complexType>
>>>   </xs:element>
>>>
>>> So we have a length element and a complex data field that uses this
>>> length, and the data field is made up of a bunch of fields.
>>>
>>> The issue I come across is related to how we cache bytes in buckets for
>>> backtracking. As we fill up buckets, we currently limit the total amount
>>> cache size of the buckets to 256MB. So if someone ever parses more than
>>> 256MB of data and then tries to backtrack past that, we error. The idea
>>> being that we don't want to keep an infinite cache for potential
>>> backtracking and people should have realized that they went down the
>>> wrong branch much earlier.
>>>
>>> Though, a problem occurs with the complex types with a large specified
>>> length like above. When we have the complex type with expression
>>> ../length, before trying to parse any of the fields, we read that length
>>> number of bytes into our cache buckets to confirm that that number of
>>> bytes exists. The problem occurs if length is more than 256MB. In this
>>> case, we read length number of bytes, and start removing elements from
>>> the cache once we read more than 256MB.
>>>
>>> But once that succeeds and we read length bytes, we then try to start
>>> parsing the fields within the complex type, but we've removed those
>>> early cached bytes, and so we fail with an unhelpful backtracking exception.
>>>
>>> I'm not sure of the right solution here.
>>>
>>> Perhaps we shouldn't be throwing away these bytes when dealing with
>>> complex lengths?
>>>
>>> Or perhaps we shouldn't even be trying to determine if that many bytes
>>> are available when we have a specified length. Instead, maybe we should
>>> just set the bit limit to make sure we don't parse more than than that?
>>> And if eventually something tries to read a byte and there aren't enough
>>> and we hit that limit, only then do we fail. This feels like the right
>>> solution, but wanted to start a discussion to see if maybe there's a
>>> reason we try to read the full length, or maybe there's another alternative?
>>>
>>
>>
> 
> 


Re: Large dfdl:length values on complex types

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
Ok, so assuming the input source is a pipe/stream kind of thing, what exactly are we hoping to achieve?

The scenario is that a complex element with an explicit length, and that length is vast, is encountered.

We want to NOT check in advance that all the bytes are present, and just go about parsing the children, and maybe we'll have enough, or maybe we will not have enough data, but we don't want to grab it all in advance.

Seems like we just remove the upfront check. Proceed to parse the children.

We still need to save the value of the explicit length for use at the end of the complex element in case there is a need for the ElementUnused region to be skipped.

The parser behavior would change: we won't immediately backtrack on short data in this case, but that's probably ok.

So it's a different backend combinator for explicit length complex types, but seems to me the issue is just that the check is too aggressive, and not consistent with streaming behavior.

Does that make sense?



________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Monday, September 14, 2020 11:34 AM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

We could have a separate I/O path specific to files. Our current input
API supports Array[Byte], ByteBuffer, and InputStream as ways to get data.

For first two, we create an "ByteBufferInputSource", which doesn't
actually throw away data since it's all already in an array in memory.
isDefinedForLength doesn't need to read anything to return an answer
since it knows the full length of the data.

For the InputStream, we create a "BucketingInputSource" which is what
does this caching/discarding stuff, and needs to read/cache the input
stream, which is where we get into thie cache issue.

We could definitely create a new FileInputSource that could have
specific logic for a java File/Path and only use this when we have
length information (which I assume we can't get when dealing with things
like pipes. We would still have to fall back to the BucketingInputSource
for those cases.

But as Brandon points out, it doesn't solve the case when the input data
is not from a file. And it also requires the user to use a new API
function to create a File specific InputSource.

I've created DAFFODIL-2397 to track the API issue.

On 9/14/20 11:19 AM, Sloane, Brandon wrote:
> Does that solve the issue? Not all data sources are files. A file centric path could be used as an optimization, but we also need to behave correctly on non-file data sources.
>
> Also, if we go this route, we need to be careful about what we consider to be a "file". An anonymous or named pipe are commonly used types of files which do provide a file-length.
> ________________________________
> From: Beckerle, Mike <mb...@owlcyberdefense.com>
> Sent: Monday, September 14, 2020 10:40 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
>
> Is it possible to create a separate i/o path that is file centric, so that isDefinedForLength can answer without actually accessing any data at all?
>
> The notion that there is no tunable access suggests we do need an API change to a factory pattern so that the I/O abstractions that need them have access to tunables. Efficient I/O is just one of those areas where tuning is generally needed.
>
> ...mikeb
>
>
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Monday, September 14, 2020 8:32:34 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
>
>> The issue isn't so much backtracking, as fortracking. At the moment,
>> we need to read in all of dfdl:length before we begin parsing the
>> first time, so we get a problem even if we never attempt to
>> backtrack.
>
> This is correct. I think this is the core issue. As best I can come up
> with, the right fix is the removal (or very selective use) of
> isDefinedForLength along with parsers handling some new end of stream
> error. Not sure how big of a change that is or if maybe there's another
> approach.
>
> I've created DAFFODIL-2395 to track this issue. I think this is probably
> a blocker for the next release.
>
> Note that this isn't actually an issue for blobs. The BlobLengthParser
> reads blobs in chunks with a default size of 4KB. And it only calls
> isDefinedForLength for individual chunks, so never comes close the 256MB
> limit. So large blobs are never an issue. The only issue here is with
> complex or simple types with a specified length larger than 256MB. A
> simple type of 256MB seems unlikely, but is very possible for complex types.
>
> Also note that the upcoming streaming changes won't help with this
> issue. That is really just about outputting the infoset while we parse
> rather than waiting until the end of parse, as well as the use of a SAX
> API. It won't affect how we handle the data at all.
>
> On 9/11/20 3:04 PM, Sloane, Brandon wrote:
>> We should be careful about making the cache too big due to the memory overhead. I'm not sure where we are with support for streaming parses (beyond the --stream option in the CLI, which assumes multiple distinct documents that have been concatted), so in most cases, a multiple of this overhead would be needed by the infoset itself anyway.
>>
>> The reason I bring up blobs is that they bypass the requirement to have a large infoset. Imagine a 4GB document, consisting of a length prefix, 3.99GB worth of binary blobs, and metadata. There is no good reason parsing this document should require anywhere near 4GB memory overhead.
>>
>>> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
>>
>> The issue isn't so much backtracking, as fortracking. At the moment, we need to read in all of dfdl:length before we begin parsing the first time, so we get a problem even if we never attempt to backtrack.
>>
>>
>>
>> ________________________________
>> From: Beckerle, Mike <mb...@owlcyberdefense.com>
>> Sent: Friday, September 11, 2020 2:51 PM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Re: Large dfdl:length values on complex types
>>
>> We have tunable limits people have to sometimes enlarge. E.g. max size of a regex match is limited.
>>
>> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
>>
>> That said. We really don't want to have this sort of limit hard coded.
>>
>> A non-streamable file format that just stores the total length in a header record is pretty common, and people will use those for really big pieces of data these days.
>>
>> Can this 256MB just be a tunable constant that people can enlarge?
>>
>> From a security perspective, a stored length field is always something that should be sanity checked - if clobbered, it could contain -1 and that could be interpreted as maxInt or something, resulting in a denial-of-service attack. It would be better to get a processing error in that case.
>>
>> Can we just check for complex type lengths against a tunable limit, and error out before we even try to read it in?
>>
>> ________________________________
>> From: Sloane, Brandon <bs...@owlcyberdefense.com>
>> Sent: Friday, September 11, 2020 1:53 PM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Re: Large dfdl:length values on complex types
>>
>> 256MB doesn't strike me as that big. We haven't ad filesize limits measured in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit as low as 16gb). All it takes is such a file to have a length prefix, and we can very easily run into the limit. Combined with our support of large binary blobs, it is not unreasonable that someone would want to use DFDL on such a format.
>>
>> ________________________________
>> From: Beckerle, Mike <mb...@owlcyberdefense.com>
>> Sent: Friday, September 11, 2020 1:24 PM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Re: Large dfdl:length values on complex types
>>
>> Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.
>>
>> Is this a real format, or a test case designed to push the boundaries?
>>
>>
>> ________________________________
>> From: Steve Lawrence <sl...@apache.org>
>> Sent: Friday, September 11, 2020 1:14 PM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Large dfdl:length values on complex types
>>
>> I recently came across an issue where we have something like this:
>>
>>   <xs:element name="length" type="xs:int" ... />
>>   <xs:element name="data"
>>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>>     <xs:complexType>
>>       <xs:sequence>
>>         <xs:element name="field1" ... />
>>         <xs:element name="field2" ... />
>>         ...
>>         <xs:element name="fieldN" ... />
>>       </xs:sequence>
>>     </xs:complexType>
>>   </xs:element>
>>
>> So we have a length element and a complex data field that uses this
>> length, and the data field is made up of a bunch of fields.
>>
>> The issue I come across is related to how we cache bytes in buckets for
>> backtracking. As we fill up buckets, we currently limit the total amount
>> cache size of the buckets to 256MB. So if someone ever parses more than
>> 256MB of data and then tries to backtrack past that, we error. The idea
>> being that we don't want to keep an infinite cache for potential
>> backtracking and people should have realized that they went down the
>> wrong branch much earlier.
>>
>> Though, a problem occurs with the complex types with a large specified
>> length like above. When we have the complex type with expression
>> ../length, before trying to parse any of the fields, we read that length
>> number of bytes into our cache buckets to confirm that that number of
>> bytes exists. The problem occurs if length is more than 256MB. In this
>> case, we read length number of bytes, and start removing elements from
>> the cache once we read more than 256MB.
>>
>> But once that succeeds and we read length bytes, we then try to start
>> parsing the fields within the complex type, but we've removed those
>> early cached bytes, and so we fail with an unhelpful backtracking exception.
>>
>> I'm not sure of the right solution here.
>>
>> Perhaps we shouldn't be throwing away these bytes when dealing with
>> complex lengths?
>>
>> Or perhaps we shouldn't even be trying to determine if that many bytes
>> are available when we have a specified length. Instead, maybe we should
>> just set the bit limit to make sure we don't parse more than than that?
>> And if eventually something tries to read a byte and there aren't enough
>> and we hit that limit, only then do we fail. This feels like the right
>> solution, but wanted to start a discussion to see if maybe there's a
>> reason we try to read the full length, or maybe there's another alternative?
>>
>
>


Re: Large dfdl:length values on complex types

Posted by Steve Lawrence <sl...@apache.org>.
We could have a separate I/O path specific to files. Our current input
API supports Array[Byte], ByteBuffer, and InputStream as ways to get data.

For first two, we create an "ByteBufferInputSource", which doesn't
actually throw away data since it's all already in an array in memory.
isDefinedForLength doesn't need to read anything to return an answer
since it knows the full length of the data.

For the InputStream, we create a "BucketingInputSource" which is what
does this caching/discarding stuff, and needs to read/cache the input
stream, which is where we get into thie cache issue.

We could definitely create a new FileInputSource that could have
specific logic for a java File/Path and only use this when we have
length information (which I assume we can't get when dealing with things
like pipes. We would still have to fall back to the BucketingInputSource
for those cases.

But as Brandon points out, it doesn't solve the case when the input data
is not from a file. And it also requires the user to use a new API
function to create a File specific InputSource.

I've created DAFFODIL-2397 to track the API issue.

On 9/14/20 11:19 AM, Sloane, Brandon wrote:
> Does that solve the issue? Not all data sources are files. A file centric path could be used as an optimization, but we also need to behave correctly on non-file data sources.
> 
> Also, if we go this route, we need to be careful about what we consider to be a "file". An anonymous or named pipe are commonly used types of files which do provide a file-length.
> ________________________________
> From: Beckerle, Mike <mb...@owlcyberdefense.com>
> Sent: Monday, September 14, 2020 10:40 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
> 
> Is it possible to create a separate i/o path that is file centric, so that isDefinedForLength can answer without actually accessing any data at all?
> 
> The notion that there is no tunable access suggests we do need an API change to a factory pattern so that the I/O abstractions that need them have access to tunables. Efficient I/O is just one of those areas where tuning is generally needed.
> 
> ...mikeb
> 
> 
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Monday, September 14, 2020 8:32:34 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
> 
>> The issue isn't so much backtracking, as fortracking. At the moment,
>> we need to read in all of dfdl:length before we begin parsing the
>> first time, so we get a problem even if we never attempt to
>> backtrack.
> 
> This is correct. I think this is the core issue. As best I can come up
> with, the right fix is the removal (or very selective use) of
> isDefinedForLength along with parsers handling some new end of stream
> error. Not sure how big of a change that is or if maybe there's another
> approach.
> 
> I've created DAFFODIL-2395 to track this issue. I think this is probably
> a blocker for the next release.
> 
> Note that this isn't actually an issue for blobs. The BlobLengthParser
> reads blobs in chunks with a default size of 4KB. And it only calls
> isDefinedForLength for individual chunks, so never comes close the 256MB
> limit. So large blobs are never an issue. The only issue here is with
> complex or simple types with a specified length larger than 256MB. A
> simple type of 256MB seems unlikely, but is very possible for complex types.
> 
> Also note that the upcoming streaming changes won't help with this
> issue. That is really just about outputting the infoset while we parse
> rather than waiting until the end of parse, as well as the use of a SAX
> API. It won't affect how we handle the data at all.
> 
> On 9/11/20 3:04 PM, Sloane, Brandon wrote:
>> We should be careful about making the cache too big due to the memory overhead. I'm not sure where we are with support for streaming parses (beyond the --stream option in the CLI, which assumes multiple distinct documents that have been concatted), so in most cases, a multiple of this overhead would be needed by the infoset itself anyway.
>>
>> The reason I bring up blobs is that they bypass the requirement to have a large infoset. Imagine a 4GB document, consisting of a length prefix, 3.99GB worth of binary blobs, and metadata. There is no good reason parsing this document should require anywhere near 4GB memory overhead.
>>
>>> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
>>
>> The issue isn't so much backtracking, as fortracking. At the moment, we need to read in all of dfdl:length before we begin parsing the first time, so we get a problem even if we never attempt to backtrack.
>>
>>
>>
>> ________________________________
>> From: Beckerle, Mike <mb...@owlcyberdefense.com>
>> Sent: Friday, September 11, 2020 2:51 PM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Re: Large dfdl:length values on complex types
>>
>> We have tunable limits people have to sometimes enlarge. E.g. max size of a regex match is limited.
>>
>> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
>>
>> That said. We really don't want to have this sort of limit hard coded.
>>
>> A non-streamable file format that just stores the total length in a header record is pretty common, and people will use those for really big pieces of data these days.
>>
>> Can this 256MB just be a tunable constant that people can enlarge?
>>
>> From a security perspective, a stored length field is always something that should be sanity checked - if clobbered, it could contain -1 and that could be interpreted as maxInt or something, resulting in a denial-of-service attack. It would be better to get a processing error in that case.
>>
>> Can we just check for complex type lengths against a tunable limit, and error out before we even try to read it in?
>>
>> ________________________________
>> From: Sloane, Brandon <bs...@owlcyberdefense.com>
>> Sent: Friday, September 11, 2020 1:53 PM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Re: Large dfdl:length values on complex types
>>
>> 256MB doesn't strike me as that big. We haven't ad filesize limits measured in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit as low as 16gb). All it takes is such a file to have a length prefix, and we can very easily run into the limit. Combined with our support of large binary blobs, it is not unreasonable that someone would want to use DFDL on such a format.
>>
>> ________________________________
>> From: Beckerle, Mike <mb...@owlcyberdefense.com>
>> Sent: Friday, September 11, 2020 1:24 PM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Re: Large dfdl:length values on complex types
>>
>> Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.
>>
>> Is this a real format, or a test case designed to push the boundaries?
>>
>>
>> ________________________________
>> From: Steve Lawrence <sl...@apache.org>
>> Sent: Friday, September 11, 2020 1:14 PM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: Large dfdl:length values on complex types
>>
>> I recently came across an issue where we have something like this:
>>
>>   <xs:element name="length" type="xs:int" ... />
>>   <xs:element name="data"
>>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>>     <xs:complexType>
>>       <xs:sequence>
>>         <xs:element name="field1" ... />
>>         <xs:element name="field2" ... />
>>         ...
>>         <xs:element name="fieldN" ... />
>>       </xs:sequence>
>>     </xs:complexType>
>>   </xs:element>
>>
>> So we have a length element and a complex data field that uses this
>> length, and the data field is made up of a bunch of fields.
>>
>> The issue I come across is related to how we cache bytes in buckets for
>> backtracking. As we fill up buckets, we currently limit the total amount
>> cache size of the buckets to 256MB. So if someone ever parses more than
>> 256MB of data and then tries to backtrack past that, we error. The idea
>> being that we don't want to keep an infinite cache for potential
>> backtracking and people should have realized that they went down the
>> wrong branch much earlier.
>>
>> Though, a problem occurs with the complex types with a large specified
>> length like above. When we have the complex type with expression
>> ../length, before trying to parse any of the fields, we read that length
>> number of bytes into our cache buckets to confirm that that number of
>> bytes exists. The problem occurs if length is more than 256MB. In this
>> case, we read length number of bytes, and start removing elements from
>> the cache once we read more than 256MB.
>>
>> But once that succeeds and we read length bytes, we then try to start
>> parsing the fields within the complex type, but we've removed those
>> early cached bytes, and so we fail with an unhelpful backtracking exception.
>>
>> I'm not sure of the right solution here.
>>
>> Perhaps we shouldn't be throwing away these bytes when dealing with
>> complex lengths?
>>
>> Or perhaps we shouldn't even be trying to determine if that many bytes
>> are available when we have a specified length. Instead, maybe we should
>> just set the bit limit to make sure we don't parse more than than that?
>> And if eventually something tries to read a byte and there aren't enough
>> and we hit that limit, only then do we fail. This feels like the right
>> solution, but wanted to start a discussion to see if maybe there's a
>> reason we try to read the full length, or maybe there's another alternative?
>>
> 
> 


Re: Large dfdl:length values on complex types

Posted by "Sloane, Brandon" <bs...@owlcyberdefense.com>.
Does that solve the issue? Not all data sources are files. A file centric path could be used as an optimization, but we also need to behave correctly on non-file data sources.

Also, if we go this route, we need to be careful about what we consider to be a "file". An anonymous or named pipe are commonly used types of files which do provide a file-length.
________________________________
From: Beckerle, Mike <mb...@owlcyberdefense.com>
Sent: Monday, September 14, 2020 10:40 AM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

Is it possible to create a separate i/o path that is file centric, so that isDefinedForLength can answer without actually accessing any data at all?

The notion that there is no tunable access suggests we do need an API change to a factory pattern so that the I/O abstractions that need them have access to tunables. Efficient I/O is just one of those areas where tuning is generally needed.

...mikeb


________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Monday, September 14, 2020 8:32:34 AM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

> The issue isn't so much backtracking, as fortracking. At the moment,
> we need to read in all of dfdl:length before we begin parsing the
> first time, so we get a problem even if we never attempt to
> backtrack.

This is correct. I think this is the core issue. As best I can come up
with, the right fix is the removal (or very selective use) of
isDefinedForLength along with parsers handling some new end of stream
error. Not sure how big of a change that is or if maybe there's another
approach.

I've created DAFFODIL-2395 to track this issue. I think this is probably
a blocker for the next release.

Note that this isn't actually an issue for blobs. The BlobLengthParser
reads blobs in chunks with a default size of 4KB. And it only calls
isDefinedForLength for individual chunks, so never comes close the 256MB
limit. So large blobs are never an issue. The only issue here is with
complex or simple types with a specified length larger than 256MB. A
simple type of 256MB seems unlikely, but is very possible for complex types.

Also note that the upcoming streaming changes won't help with this
issue. That is really just about outputting the infoset while we parse
rather than waiting until the end of parse, as well as the use of a SAX
API. It won't affect how we handle the data at all.

On 9/11/20 3:04 PM, Sloane, Brandon wrote:
> We should be careful about making the cache too big due to the memory overhead. I'm not sure where we are with support for streaming parses (beyond the --stream option in the CLI, which assumes multiple distinct documents that have been concatted), so in most cases, a multiple of this overhead would be needed by the infoset itself anyway.
>
> The reason I bring up blobs is that they bypass the requirement to have a large infoset. Imagine a 4GB document, consisting of a length prefix, 3.99GB worth of binary blobs, and metadata. There is no good reason parsing this document should require anywhere near 4GB memory overhead.
>
>> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
>
> The issue isn't so much backtracking, as fortracking. At the moment, we need to read in all of dfdl:length before we begin parsing the first time, so we get a problem even if we never attempt to backtrack.
>
>
>
> ________________________________
> From: Beckerle, Mike <mb...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 2:51 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
>
> We have tunable limits people have to sometimes enlarge. E.g. max size of a regex match is limited.
>
> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
>
> That said. We really don't want to have this sort of limit hard coded.
>
> A non-streamable file format that just stores the total length in a header record is pretty common, and people will use those for really big pieces of data these days.
>
> Can this 256MB just be a tunable constant that people can enlarge?
>
> From a security perspective, a stored length field is always something that should be sanity checked - if clobbered, it could contain -1 and that could be interpreted as maxInt or something, resulting in a denial-of-service attack. It would be better to get a processing error in that case.
>
> Can we just check for complex type lengths against a tunable limit, and error out before we even try to read it in?
>
> ________________________________
> From: Sloane, Brandon <bs...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 1:53 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
>
> 256MB doesn't strike me as that big. We haven't ad filesize limits measured in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit as low as 16gb). All it takes is such a file to have a length prefix, and we can very easily run into the limit. Combined with our support of large binary blobs, it is not unreasonable that someone would want to use DFDL on such a format.
>
> ________________________________
> From: Beckerle, Mike <mb...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 1:24 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
>
> Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.
>
> Is this a real format, or a test case designed to push the boundaries?
>
>
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Friday, September 11, 2020 1:14 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Large dfdl:length values on complex types
>
> I recently came across an issue where we have something like this:
>
>   <xs:element name="length" type="xs:int" ... />
>   <xs:element name="data"
>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>     <xs:complexType>
>       <xs:sequence>
>         <xs:element name="field1" ... />
>         <xs:element name="field2" ... />
>         ...
>         <xs:element name="fieldN" ... />
>       </xs:sequence>
>     </xs:complexType>
>   </xs:element>
>
> So we have a length element and a complex data field that uses this
> length, and the data field is made up of a bunch of fields.
>
> The issue I come across is related to how we cache bytes in buckets for
> backtracking. As we fill up buckets, we currently limit the total amount
> cache size of the buckets to 256MB. So if someone ever parses more than
> 256MB of data and then tries to backtrack past that, we error. The idea
> being that we don't want to keep an infinite cache for potential
> backtracking and people should have realized that they went down the
> wrong branch much earlier.
>
> Though, a problem occurs with the complex types with a large specified
> length like above. When we have the complex type with expression
> ../length, before trying to parse any of the fields, we read that length
> number of bytes into our cache buckets to confirm that that number of
> bytes exists. The problem occurs if length is more than 256MB. In this
> case, we read length number of bytes, and start removing elements from
> the cache once we read more than 256MB.
>
> But once that succeeds and we read length bytes, we then try to start
> parsing the fields within the complex type, but we've removed those
> early cached bytes, and so we fail with an unhelpful backtracking exception.
>
> I'm not sure of the right solution here.
>
> Perhaps we shouldn't be throwing away these bytes when dealing with
> complex lengths?
>
> Or perhaps we shouldn't even be trying to determine if that many bytes
> are available when we have a specified length. Instead, maybe we should
> just set the bit limit to make sure we don't parse more than than that?
> And if eventually something tries to read a byte and there aren't enough
> and we hit that limit, only then do we fail. This feels like the right
> solution, but wanted to start a discussion to see if maybe there's a
> reason we try to read the full length, or maybe there's another alternative?
>


Re: Large dfdl:length values on complex types

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
Is it possible to create a separate i/o path that is file centric, so that isDefinedForLength can answer without actually accessing any data at all?

The notion that there is no tunable access suggests we do need an API change to a factory pattern so that the I/O abstractions that need them have access to tunables. Efficient I/O is just one of those areas where tuning is generally needed.

...mikeb


________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Monday, September 14, 2020 8:32:34 AM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

> The issue isn't so much backtracking, as fortracking. At the moment,
> we need to read in all of dfdl:length before we begin parsing the
> first time, so we get a problem even if we never attempt to
> backtrack.

This is correct. I think this is the core issue. As best I can come up
with, the right fix is the removal (or very selective use) of
isDefinedForLength along with parsers handling some new end of stream
error. Not sure how big of a change that is or if maybe there's another
approach.

I've created DAFFODIL-2395 to track this issue. I think this is probably
a blocker for the next release.

Note that this isn't actually an issue for blobs. The BlobLengthParser
reads blobs in chunks with a default size of 4KB. And it only calls
isDefinedForLength for individual chunks, so never comes close the 256MB
limit. So large blobs are never an issue. The only issue here is with
complex or simple types with a specified length larger than 256MB. A
simple type of 256MB seems unlikely, but is very possible for complex types.

Also note that the upcoming streaming changes won't help with this
issue. That is really just about outputting the infoset while we parse
rather than waiting until the end of parse, as well as the use of a SAX
API. It won't affect how we handle the data at all.

On 9/11/20 3:04 PM, Sloane, Brandon wrote:
> We should be careful about making the cache too big due to the memory overhead. I'm not sure where we are with support for streaming parses (beyond the --stream option in the CLI, which assumes multiple distinct documents that have been concatted), so in most cases, a multiple of this overhead would be needed by the infoset itself anyway.
>
> The reason I bring up blobs is that they bypass the requirement to have a large infoset. Imagine a 4GB document, consisting of a length prefix, 3.99GB worth of binary blobs, and metadata. There is no good reason parsing this document should require anywhere near 4GB memory overhead.
>
>> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
>
> The issue isn't so much backtracking, as fortracking. At the moment, we need to read in all of dfdl:length before we begin parsing the first time, so we get a problem even if we never attempt to backtrack.
>
>
>
> ________________________________
> From: Beckerle, Mike <mb...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 2:51 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
>
> We have tunable limits people have to sometimes enlarge. E.g. max size of a regex match is limited.
>
> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
>
> That said. We really don't want to have this sort of limit hard coded.
>
> A non-streamable file format that just stores the total length in a header record is pretty common, and people will use those for really big pieces of data these days.
>
> Can this 256MB just be a tunable constant that people can enlarge?
>
> From a security perspective, a stored length field is always something that should be sanity checked - if clobbered, it could contain -1 and that could be interpreted as maxInt or something, resulting in a denial-of-service attack. It would be better to get a processing error in that case.
>
> Can we just check for complex type lengths against a tunable limit, and error out before we even try to read it in?
>
> ________________________________
> From: Sloane, Brandon <bs...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 1:53 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
>
> 256MB doesn't strike me as that big. We haven't ad filesize limits measured in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit as low as 16gb). All it takes is such a file to have a length prefix, and we can very easily run into the limit. Combined with our support of large binary blobs, it is not unreasonable that someone would want to use DFDL on such a format.
>
> ________________________________
> From: Beckerle, Mike <mb...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 1:24 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
>
> Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.
>
> Is this a real format, or a test case designed to push the boundaries?
>
>
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Friday, September 11, 2020 1:14 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Large dfdl:length values on complex types
>
> I recently came across an issue where we have something like this:
>
>   <xs:element name="length" type="xs:int" ... />
>   <xs:element name="data"
>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>     <xs:complexType>
>       <xs:sequence>
>         <xs:element name="field1" ... />
>         <xs:element name="field2" ... />
>         ...
>         <xs:element name="fieldN" ... />
>       </xs:sequence>
>     </xs:complexType>
>   </xs:element>
>
> So we have a length element and a complex data field that uses this
> length, and the data field is made up of a bunch of fields.
>
> The issue I come across is related to how we cache bytes in buckets for
> backtracking. As we fill up buckets, we currently limit the total amount
> cache size of the buckets to 256MB. So if someone ever parses more than
> 256MB of data and then tries to backtrack past that, we error. The idea
> being that we don't want to keep an infinite cache for potential
> backtracking and people should have realized that they went down the
> wrong branch much earlier.
>
> Though, a problem occurs with the complex types with a large specified
> length like above. When we have the complex type with expression
> ../length, before trying to parse any of the fields, we read that length
> number of bytes into our cache buckets to confirm that that number of
> bytes exists. The problem occurs if length is more than 256MB. In this
> case, we read length number of bytes, and start removing elements from
> the cache once we read more than 256MB.
>
> But once that succeeds and we read length bytes, we then try to start
> parsing the fields within the complex type, but we've removed those
> early cached bytes, and so we fail with an unhelpful backtracking exception.
>
> I'm not sure of the right solution here.
>
> Perhaps we shouldn't be throwing away these bytes when dealing with
> complex lengths?
>
> Or perhaps we shouldn't even be trying to determine if that many bytes
> are available when we have a specified length. Instead, maybe we should
> just set the bit limit to make sure we don't parse more than than that?
> And if eventually something tries to read a byte and there aren't enough
> and we hit that limit, only then do we fail. This feels like the right
> solution, but wanted to start a discussion to see if maybe there's a
> reason we try to read the full length, or maybe there's another alternative?
>


Re: Large dfdl:length values on complex types

Posted by Steve Lawrence <sl...@apache.org>.
> The issue isn't so much backtracking, as fortracking. At the moment, 
> we need to read in all of dfdl:length before we begin parsing the 
> first time, so we get a problem even if we never attempt to
> backtrack.

This is correct. I think this is the core issue. As best I can come up
with, the right fix is the removal (or very selective use) of
isDefinedForLength along with parsers handling some new end of stream
error. Not sure how big of a change that is or if maybe there's another
approach.

I've created DAFFODIL-2395 to track this issue. I think this is probably
a blocker for the next release.

Note that this isn't actually an issue for blobs. The BlobLengthParser
reads blobs in chunks with a default size of 4KB. And it only calls
isDefinedForLength for individual chunks, so never comes close the 256MB
limit. So large blobs are never an issue. The only issue here is with
complex or simple types with a specified length larger than 256MB. A
simple type of 256MB seems unlikely, but is very possible for complex types.

Also note that the upcoming streaming changes won't help with this
issue. That is really just about outputting the infoset while we parse
rather than waiting until the end of parse, as well as the use of a SAX
API. It won't affect how we handle the data at all.

On 9/11/20 3:04 PM, Sloane, Brandon wrote:
> We should be careful about making the cache too big due to the memory overhead. I'm not sure where we are with support for streaming parses (beyond the --stream option in the CLI, which assumes multiple distinct documents that have been concatted), so in most cases, a multiple of this overhead would be needed by the infoset itself anyway.
> 
> The reason I bring up blobs is that they bypass the requirement to have a large infoset. Imagine a 4GB document, consisting of a length prefix, 3.99GB worth of binary blobs, and metadata. There is no good reason parsing this document should require anywhere near 4GB memory overhead.
> 
>> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
> 
> The issue isn't so much backtracking, as fortracking. At the moment, we need to read in all of dfdl:length before we begin parsing the first time, so we get a problem even if we never attempt to backtrack.
> 
> 
> 
> ________________________________
> From: Beckerle, Mike <mb...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 2:51 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
> 
> We have tunable limits people have to sometimes enlarge. E.g. max size of a regex match is limited.
> 
> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
> 
> That said. We really don't want to have this sort of limit hard coded.
> 
> A non-streamable file format that just stores the total length in a header record is pretty common, and people will use those for really big pieces of data these days.
> 
> Can this 256MB just be a tunable constant that people can enlarge?
> 
> From a security perspective, a stored length field is always something that should be sanity checked - if clobbered, it could contain -1 and that could be interpreted as maxInt or something, resulting in a denial-of-service attack. It would be better to get a processing error in that case.
> 
> Can we just check for complex type lengths against a tunable limit, and error out before we even try to read it in?
> 
> ________________________________
> From: Sloane, Brandon <bs...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 1:53 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
> 
> 256MB doesn't strike me as that big. We haven't ad filesize limits measured in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit as low as 16gb). All it takes is such a file to have a length prefix, and we can very easily run into the limit. Combined with our support of large binary blobs, it is not unreasonable that someone would want to use DFDL on such a format.
> 
> ________________________________
> From: Beckerle, Mike <mb...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 1:24 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
> 
> Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.
> 
> Is this a real format, or a test case designed to push the boundaries?
> 
> 
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Friday, September 11, 2020 1:14 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Large dfdl:length values on complex types
> 
> I recently came across an issue where we have something like this:
> 
>   <xs:element name="length" type="xs:int" ... />
>   <xs:element name="data"
>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>     <xs:complexType>
>       <xs:sequence>
>         <xs:element name="field1" ... />
>         <xs:element name="field2" ... />
>         ...
>         <xs:element name="fieldN" ... />
>       </xs:sequence>
>     </xs:complexType>
>   </xs:element>
> 
> So we have a length element and a complex data field that uses this
> length, and the data field is made up of a bunch of fields.
> 
> The issue I come across is related to how we cache bytes in buckets for
> backtracking. As we fill up buckets, we currently limit the total amount
> cache size of the buckets to 256MB. So if someone ever parses more than
> 256MB of data and then tries to backtrack past that, we error. The idea
> being that we don't want to keep an infinite cache for potential
> backtracking and people should have realized that they went down the
> wrong branch much earlier.
> 
> Though, a problem occurs with the complex types with a large specified
> length like above. When we have the complex type with expression
> ../length, before trying to parse any of the fields, we read that length
> number of bytes into our cache buckets to confirm that that number of
> bytes exists. The problem occurs if length is more than 256MB. In this
> case, we read length number of bytes, and start removing elements from
> the cache once we read more than 256MB.
> 
> But once that succeeds and we read length bytes, we then try to start
> parsing the fields within the complex type, but we've removed those
> early cached bytes, and so we fail with an unhelpful backtracking exception.
> 
> I'm not sure of the right solution here.
> 
> Perhaps we shouldn't be throwing away these bytes when dealing with
> complex lengths?
> 
> Or perhaps we shouldn't even be trying to determine if that many bytes
> are available when we have a specified length. Instead, maybe we should
> just set the bit limit to make sure we don't parse more than than that?
> And if eventually something tries to read a byte and there aren't enough
> and we hit that limit, only then do we fail. This feels like the right
> solution, but wanted to start a discussion to see if maybe there's a
> reason we try to read the full length, or maybe there's another alternative?
> 


Re: Large dfdl:length values on complex types

Posted by "Sloane, Brandon" <bs...@owlcyberdefense.com>.
We should be careful about making the cache too big due to the memory overhead. I'm not sure where we are with support for streaming parses (beyond the --stream option in the CLI, which assumes multiple distinct documents that have been concatted), so in most cases, a multiple of this overhead would be needed by the infoset itself anyway.

The reason I bring up blobs is that they bypass the requirement to have a large infoset. Imagine a 4GB document, consisting of a length prefix, 3.99GB worth of binary blobs, and metadata. There is no good reason parsing this document should require anywhere near 4GB memory overhead.

> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.

The issue isn't so much backtracking, as fortracking. At the moment, we need to read in all of dfdl:length before we begin parsing the first time, so we get a problem even if we never attempt to backtrack.



________________________________
From: Beckerle, Mike <mb...@owlcyberdefense.com>
Sent: Friday, September 11, 2020 2:51 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

We have tunable limits people have to sometimes enlarge. E.g. max size of a regex match is limited.

The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.

That said. We really don't want to have this sort of limit hard coded.

A non-streamable file format that just stores the total length in a header record is pretty common, and people will use those for really big pieces of data these days.

Can this 256MB just be a tunable constant that people can enlarge?

From a security perspective, a stored length field is always something that should be sanity checked - if clobbered, it could contain -1 and that could be interpreted as maxInt or something, resulting in a denial-of-service attack. It would be better to get a processing error in that case.

Can we just check for complex type lengths against a tunable limit, and error out before we even try to read it in?

________________________________
From: Sloane, Brandon <bs...@owlcyberdefense.com>
Sent: Friday, September 11, 2020 1:53 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

256MB doesn't strike me as that big. We haven't ad filesize limits measured in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit as low as 16gb). All it takes is such a file to have a length prefix, and we can very easily run into the limit. Combined with our support of large binary blobs, it is not unreasonable that someone would want to use DFDL on such a format.

________________________________
From: Beckerle, Mike <mb...@owlcyberdefense.com>
Sent: Friday, September 11, 2020 1:24 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.

Is this a real format, or a test case designed to push the boundaries?


________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Friday, September 11, 2020 1:14 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Large dfdl:length values on complex types

I recently came across an issue where we have something like this:

  <xs:element name="length" type="xs:int" ... />
  <xs:element name="data"
    dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="field1" ... />
        <xs:element name="field2" ... />
        ...
        <xs:element name="fieldN" ... />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

So we have a length element and a complex data field that uses this
length, and the data field is made up of a bunch of fields.

The issue I come across is related to how we cache bytes in buckets for
backtracking. As we fill up buckets, we currently limit the total amount
cache size of the buckets to 256MB. So if someone ever parses more than
256MB of data and then tries to backtrack past that, we error. The idea
being that we don't want to keep an infinite cache for potential
backtracking and people should have realized that they went down the
wrong branch much earlier.

Though, a problem occurs with the complex types with a large specified
length like above. When we have the complex type with expression
../length, before trying to parse any of the fields, we read that length
number of bytes into our cache buckets to confirm that that number of
bytes exists. The problem occurs if length is more than 256MB. In this
case, we read length number of bytes, and start removing elements from
the cache once we read more than 256MB.

But once that succeeds and we read length bytes, we then try to start
parsing the fields within the complex type, but we've removed those
early cached bytes, and so we fail with an unhelpful backtracking exception.

I'm not sure of the right solution here.

Perhaps we shouldn't be throwing away these bytes when dealing with
complex lengths?

Or perhaps we shouldn't even be trying to determine if that many bytes
are available when we have a specified length. Instead, maybe we should
just set the bit limit to make sure we don't parse more than than that?
And if eventually something tries to read a byte and there aren't enough
and we hit that limit, only then do we fail. This feels like the right
solution, but wanted to start a discussion to see if maybe there's a
reason we try to read the full length, or maybe there's another alternative?

Re: Large dfdl:length values on complex types

Posted by Steve Lawrence <sl...@apache.org>.
This currently can't be changed via a tunable because the
InputSourceDataInputStreams don't have access to tunables. They are
created by the user and users don't have access to tunables. We could
make them available, or perhaps change it so the DataProcessor becomes a
factory for creating InputSourceDataInptuStreams, in which case we could
pass in tunables.

These limits are currently only available as parameters to the
BucketingInputSource (the class that handles this caching behavior), but
they aren't exposed via the JAPI/SAPI, so right now there's really no
way to change them using the documented API.

So this limit is essentially hard coded right now.

Another option would be to just make the above changes so this limit
could be changed.

But I think perhaps the biggest issue is that if you want a very large
specified length complex type, we currently read that entire complex
type into memory, even if all the individual components that make up
that type are small. So this limits streamabilty with large specified
length complex types.

On 9/11/20 2:51 PM, Beckerle, Mike wrote:
> We have tunable limits people have to sometimes enlarge. E.g. max size of a regex match is limited.
> 
> The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.
> 
> That said. We really don't want to have this sort of limit hard coded.
> 
> A non-streamable file format that just stores the total length in a header record is pretty common, and people will use those for really big pieces of data these days.
> 
> Can this 256MB just be a tunable constant that people can enlarge?
> 
> From a security perspective, a stored length field is always something that should be sanity checked - if clobbered, it could contain -1 and that could be interpreted as maxInt or something, resulting in a denial-of-service attack. It would be better to get a processing error in that case.
> 
> Can we just check for complex type lengths against a tunable limit, and error out before we even try to read it in?
> 
> ________________________________
> From: Sloane, Brandon <bs...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 1:53 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
> 
> 256MB doesn't strike me as that big. We haven't ad filesize limits measured in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit as low as 16gb). All it takes is such a file to have a length prefix, and we can very easily run into the limit. Combined with our support of large binary blobs, it is not unreasonable that someone would want to use DFDL on such a format.
> 
> ________________________________
> From: Beckerle, Mike <mb...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 1:24 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
> 
> Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.
> 
> Is this a real format, or a test case designed to push the boundaries?
> 
> 
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Friday, September 11, 2020 1:14 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Large dfdl:length values on complex types
> 
> I recently came across an issue where we have something like this:
> 
>   <xs:element name="length" type="xs:int" ... />
>   <xs:element name="data"
>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>     <xs:complexType>
>       <xs:sequence>
>         <xs:element name="field1" ... />
>         <xs:element name="field2" ... />
>         ...
>         <xs:element name="fieldN" ... />
>       </xs:sequence>
>     </xs:complexType>
>   </xs:element>
> 
> So we have a length element and a complex data field that uses this
> length, and the data field is made up of a bunch of fields.
> 
> The issue I come across is related to how we cache bytes in buckets for
> backtracking. As we fill up buckets, we currently limit the total amount
> cache size of the buckets to 256MB. So if someone ever parses more than
> 256MB of data and then tries to backtrack past that, we error. The idea
> being that we don't want to keep an infinite cache for potential
> backtracking and people should have realized that they went down the
> wrong branch much earlier.
> 
> Though, a problem occurs with the complex types with a large specified
> length like above. When we have the complex type with expression
> ../length, before trying to parse any of the fields, we read that length
> number of bytes into our cache buckets to confirm that that number of
> bytes exists. The problem occurs if length is more than 256MB. In this
> case, we read length number of bytes, and start removing elements from
> the cache once we read more than 256MB.
> 
> But once that succeeds and we read length bytes, we then try to start
> parsing the fields within the complex type, but we've removed those
> early cached bytes, and so we fail with an unhelpful backtracking exception.
> 
> I'm not sure of the right solution here.
> 
> Perhaps we shouldn't be throwing away these bytes when dealing with
> complex lengths?
> 
> Or perhaps we shouldn't even be trying to determine if that many bytes
> are available when we have a specified length. Instead, maybe we should
> just set the bit limit to make sure we don't parse more than than that?
> And if eventually something tries to read a byte and there aren't enough
> and we hit that limit, only then do we fail. This feels like the right
> solution, but wanted to start a discussion to see if maybe there's a
> reason we try to read the full length, or maybe there's another alternative?
> 


Re: Large dfdl:length values on complex types

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
We have tunable limits people have to sometimes enlarge. E.g. max size of a regex match is limited.

The DFDL spec allows implementations to have limits on backtracking capability, it just requires us to document them.

That said. We really don't want to have this sort of limit hard coded.

A non-streamable file format that just stores the total length in a header record is pretty common, and people will use those for really big pieces of data these days.

Can this 256MB just be a tunable constant that people can enlarge?

From a security perspective, a stored length field is always something that should be sanity checked - if clobbered, it could contain -1 and that could be interpreted as maxInt or something, resulting in a denial-of-service attack. It would be better to get a processing error in that case.

Can we just check for complex type lengths against a tunable limit, and error out before we even try to read it in?

________________________________
From: Sloane, Brandon <bs...@owlcyberdefense.com>
Sent: Friday, September 11, 2020 1:53 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

256MB doesn't strike me as that big. We haven't ad filesize limits measured in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit as low as 16gb). All it takes is such a file to have a length prefix, and we can very easily run into the limit. Combined with our support of large binary blobs, it is not unreasonable that someone would want to use DFDL on such a format.

________________________________
From: Beckerle, Mike <mb...@owlcyberdefense.com>
Sent: Friday, September 11, 2020 1:24 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.

Is this a real format, or a test case designed to push the boundaries?


________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Friday, September 11, 2020 1:14 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Large dfdl:length values on complex types

I recently came across an issue where we have something like this:

  <xs:element name="length" type="xs:int" ... />
  <xs:element name="data"
    dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="field1" ... />
        <xs:element name="field2" ... />
        ...
        <xs:element name="fieldN" ... />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

So we have a length element and a complex data field that uses this
length, and the data field is made up of a bunch of fields.

The issue I come across is related to how we cache bytes in buckets for
backtracking. As we fill up buckets, we currently limit the total amount
cache size of the buckets to 256MB. So if someone ever parses more than
256MB of data and then tries to backtrack past that, we error. The idea
being that we don't want to keep an infinite cache for potential
backtracking and people should have realized that they went down the
wrong branch much earlier.

Though, a problem occurs with the complex types with a large specified
length like above. When we have the complex type with expression
../length, before trying to parse any of the fields, we read that length
number of bytes into our cache buckets to confirm that that number of
bytes exists. The problem occurs if length is more than 256MB. In this
case, we read length number of bytes, and start removing elements from
the cache once we read more than 256MB.

But once that succeeds and we read length bytes, we then try to start
parsing the fields within the complex type, but we've removed those
early cached bytes, and so we fail with an unhelpful backtracking exception.

I'm not sure of the right solution here.

Perhaps we shouldn't be throwing away these bytes when dealing with
complex lengths?

Or perhaps we shouldn't even be trying to determine if that many bytes
are available when we have a specified length. Instead, maybe we should
just set the bit limit to make sure we don't parse more than than that?
And if eventually something tries to read a byte and there aren't enough
and we hit that limit, only then do we fail. This feels like the right
solution, but wanted to start a discussion to see if maybe there's a
reason we try to read the full length, or maybe there's another alternative?

Re: Large dfdl:length values on complex types

Posted by "Sloane, Brandon" <bs...@owlcyberdefense.com>.
256MB doesn't strike me as that big. We haven't ad filesize limits measured in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit as low as 16gb). All it takes is such a file to have a length prefix, and we can very easily run into the limit. Combined with our support of large binary blobs, it is not unreasonable that someone would want to use DFDL on such a format.

________________________________
From: Beckerle, Mike <mb...@owlcyberdefense.com>
Sent: Friday, September 11, 2020 1:24 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.

Is this a real format, or a test case designed to push the boundaries?


________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Friday, September 11, 2020 1:14 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Large dfdl:length values on complex types

I recently came across an issue where we have something like this:

  <xs:element name="length" type="xs:int" ... />
  <xs:element name="data"
    dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="field1" ... />
        <xs:element name="field2" ... />
        ...
        <xs:element name="fieldN" ... />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

So we have a length element and a complex data field that uses this
length, and the data field is made up of a bunch of fields.

The issue I come across is related to how we cache bytes in buckets for
backtracking. As we fill up buckets, we currently limit the total amount
cache size of the buckets to 256MB. So if someone ever parses more than
256MB of data and then tries to backtrack past that, we error. The idea
being that we don't want to keep an infinite cache for potential
backtracking and people should have realized that they went down the
wrong branch much earlier.

Though, a problem occurs with the complex types with a large specified
length like above. When we have the complex type with expression
../length, before trying to parse any of the fields, we read that length
number of bytes into our cache buckets to confirm that that number of
bytes exists. The problem occurs if length is more than 256MB. In this
case, we read length number of bytes, and start removing elements from
the cache once we read more than 256MB.

But once that succeeds and we read length bytes, we then try to start
parsing the fields within the complex type, but we've removed those
early cached bytes, and so we fail with an unhelpful backtracking exception.

I'm not sure of the right solution here.

Perhaps we shouldn't be throwing away these bytes when dealing with
complex lengths?

Or perhaps we shouldn't even be trying to determine if that many bytes
are available when we have a specified length. Instead, maybe we should
just set the bit limit to make sure we don't parse more than than that?
And if eventually something tries to read a byte and there aren't enough
and we hit that limit, only then do we fail. This feels like the right
solution, but wanted to start a discussion to see if maybe there's a
reason we try to read the full length, or maybe there's another alternative?

Re: Large dfdl:length values on complex types

Posted by Steve Lawrence <sl...@apache.org>.
I think blobs are implemented so that they don't have this limit. I know
we've tested with blobs larger than 256 MB and the work fine. But I
think this is probably an issue for any complex type or simple type
greater than 256MB. I'm not sure if there really are going to be simple
types greater than 256MB in practice, since using a blob is probably the
right thing. But complex types can definitely have a specified length
greater than 256MB, which as you point out is way to small and why I
bring this up. We definitely need some kind of solution. Maybe our cache
is too small, or maybe we need to redesign some behavior.

On 9/11/20 2:05 PM, Sloane, Brandon wrote:
> (Resending after updating my subscribed address)
> 
> 256MB doesn't strike me as that big. We haven't ad filesize limits measured in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit as low as 16gb). All it takes is such a file to have a length prefix, and we can very easily run into the limit. Combined with our support of large binary blobs, it is not unreasonable that someone would want to use DFDL on such a format.
> ________________________________
> From: Beckerle, Mike <mb...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 1:24 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
> 
> Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.
> 
> Is this a real format, or a test case designed to push the boundaries?
> 
> 
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Friday, September 11, 2020 1:14 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Large dfdl:length values on complex types
> 
> I recently came across an issue where we have something like this:
> 
>   <xs:element name="length" type="xs:int" ... />
>   <xs:element name="data"
>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>     <xs:complexType>
>       <xs:sequence>
>         <xs:element name="field1" ... />
>         <xs:element name="field2" ... />
>         ...
>         <xs:element name="fieldN" ... />
>       </xs:sequence>
>     </xs:complexType>
>   </xs:element>
> 
> So we have a length element and a complex data field that uses this
> length, and the data field is made up of a bunch of fields.
> 
> The issue I come across is related to how we cache bytes in buckets for
> backtracking. As we fill up buckets, we currently limit the total amount
> cache size of the buckets to 256MB. So if someone ever parses more than
> 256MB of data and then tries to backtrack past that, we error. The idea
> being that we don't want to keep an infinite cache for potential
> backtracking and people should have realized that they went down the
> wrong branch much earlier.
> 
> Though, a problem occurs with the complex types with a large specified
> length like above. When we have the complex type with expression
> ../length, before trying to parse any of the fields, we read that length
> number of bytes into our cache buckets to confirm that that number of
> bytes exists. The problem occurs if length is more than 256MB. In this
> case, we read length number of bytes, and start removing elements from
> the cache once we read more than 256MB.
> 
> But once that succeeds and we read length bytes, we then try to start
> parsing the fields within the complex type, but we've removed those
> early cached bytes, and so we fail with an unhelpful backtracking exception.
> 
> I'm not sure of the right solution here.
> 
> Perhaps we shouldn't be throwing away these bytes when dealing with
> complex lengths?
> 
> Or perhaps we shouldn't even be trying to determine if that many bytes
> are available when we have a specified length. Instead, maybe we should
> just set the bit limit to make sure we don't parse more than than that?
> And if eventually something tries to read a byte and there aren't enough
> and we hit that limit, only then do we fail. This feels like the right
> solution, but wanted to start a discussion to see if maybe there's a
> reason we try to read the full length, or maybe there's another alternative?
> 


Re: Large dfdl:length values on complex types

Posted by "Sloane, Brandon" <bs...@owlcyberdefense.com>.
(Resending after updating my subscribed address)

256MB doesn't strike me as that big. We haven't ad filesize limits measured in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit as low as 16gb). All it takes is such a file to have a length prefix, and we can very easily run into the limit. Combined with our support of large binary blobs, it is not unreasonable that someone would want to use DFDL on such a format.
________________________________
From: Beckerle, Mike <mb...@owlcyberdefense.com>
Sent: Friday, September 11, 2020 1:24 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.

Is this a real format, or a test case designed to push the boundaries?


________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Friday, September 11, 2020 1:14 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Large dfdl:length values on complex types

I recently came across an issue where we have something like this:

  <xs:element name="length" type="xs:int" ... />
  <xs:element name="data"
    dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="field1" ... />
        <xs:element name="field2" ... />
        ...
        <xs:element name="fieldN" ... />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

So we have a length element and a complex data field that uses this
length, and the data field is made up of a bunch of fields.

The issue I come across is related to how we cache bytes in buckets for
backtracking. As we fill up buckets, we currently limit the total amount
cache size of the buckets to 256MB. So if someone ever parses more than
256MB of data and then tries to backtrack past that, we error. The idea
being that we don't want to keep an infinite cache for potential
backtracking and people should have realized that they went down the
wrong branch much earlier.

Though, a problem occurs with the complex types with a large specified
length like above. When we have the complex type with expression
../length, before trying to parse any of the fields, we read that length
number of bytes into our cache buckets to confirm that that number of
bytes exists. The problem occurs if length is more than 256MB. In this
case, we read length number of bytes, and start removing elements from
the cache once we read more than 256MB.

But once that succeeds and we read length bytes, we then try to start
parsing the fields within the complex type, but we've removed those
early cached bytes, and so we fail with an unhelpful backtracking exception.

I'm not sure of the right solution here.

Perhaps we shouldn't be throwing away these bytes when dealing with
complex lengths?

Or perhaps we shouldn't even be trying to determine if that many bytes
are available when we have a specified length. Instead, maybe we should
just set the bit limit to make sure we don't parse more than than that?
And if eventually something tries to read a byte and there aren't enough
and we hit that limit, only then do we fail. This feels like the right
solution, but wanted to start a discussion to see if maybe there's a
reason we try to read the full length, or maybe there's another alternative?

Re: Large dfdl:length values on complex types

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
Maybe a silly question, but why don't we just hit a tunable size limit immediately before we "try to read" that data? 256MB is very big.

Is this a real format, or a test case designed to push the boundaries?


________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Friday, September 11, 2020 1:14 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Large dfdl:length values on complex types

I recently came across an issue where we have something like this:

  <xs:element name="length" type="xs:int" ... />
  <xs:element name="data"
    dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="field1" ... />
        <xs:element name="field2" ... />
        ...
        <xs:element name="fieldN" ... />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

So we have a length element and a complex data field that uses this
length, and the data field is made up of a bunch of fields.

The issue I come across is related to how we cache bytes in buckets for
backtracking. As we fill up buckets, we currently limit the total amount
cache size of the buckets to 256MB. So if someone ever parses more than
256MB of data and then tries to backtrack past that, we error. The idea
being that we don't want to keep an infinite cache for potential
backtracking and people should have realized that they went down the
wrong branch much earlier.

Though, a problem occurs with the complex types with a large specified
length like above. When we have the complex type with expression
../length, before trying to parse any of the fields, we read that length
number of bytes into our cache buckets to confirm that that number of
bytes exists. The problem occurs if length is more than 256MB. In this
case, we read length number of bytes, and start removing elements from
the cache once we read more than 256MB.

But once that succeeds and we read length bytes, we then try to start
parsing the fields within the complex type, but we've removed those
early cached bytes, and so we fail with an unhelpful backtracking exception.

I'm not sure of the right solution here.

Perhaps we shouldn't be throwing away these bytes when dealing with
complex lengths?

Or perhaps we shouldn't even be trying to determine if that many bytes
are available when we have a specified length. Instead, maybe we should
just set the bit limit to make sure we don't parse more than than that?
And if eventually something tries to read a byte and there aren't enough
and we hit that limit, only then do we fail. This feels like the right
solution, but wanted to start a discussion to see if maybe there's a
reason we try to read the full length, or maybe there's another alternative?

Re: Large dfdl:length values on complex types

Posted by Steve Lawrence <sl...@apache.org>.
A good example of this and what I'm suggesting is in
SpecifiedLengthParserBase. This parser is created when theres an
dfdl:lengthKind="explict" and dfdl:length property. The way this parser
works is it first figures out what the specified length is:

  val maybeNBits = getBitLength(pState)

In this case, this is returning a value greater than 256MB. We then
check to see if there are actually this many bits available in the input
stream:

  if (!dis.isDefinedForLength(nBits)) {
    PENotEnoughBits(pState, nBits, dis.remainingBits)
    return
  }

This isDefinedForLength function is what reads nBits from the underlying
input stream and caches them. The problem here is that if nbits > 256MB,
then this function also starts throwing away bytes to keep the cache
small. So even though we know it's define for that length, we won't
actually have the first chunk of bytes available.

After that, we try to actually parse a simple/complex type with the
following:

  val isLimitOk: Boolean = dis.withBitLengthLimit(nBits) {
    eParser.parse1(pState)
  }

So we set a bit limit to the number of bits, and then parse whatever
we're parsing (e.g. complex/simple type). The bitLimit prevents that
child parser from ever parsing beyond nBits, and because
isDefinedForLength succeeded these parsers can assume that they will
have the appropriate bits avialable.

The second idea I mentioned was to essentially get rid of the
isDefinedForLength check, but still keep the withBitLengthLimit call.
This means the child eParser.parse() call can no longer assume that bits
are available and they will have to handle something like an EndOfStream
error and create the same error as if isDefinedForLength failed. This
adds complications because the eParser can be lots of different parsers.
It could be a simple type parser, a complex type parser, a sequence of a
many types parser, etc.

I think this would be a pretty large change to the DataInputStream API.
Right now, because we always check isDefinedForLength, we can always
assume calls like getSignedLong will succeed. If we remove the
isDefinedForLength check, calls to DataInputStream functions can no
longer make assumptions about having enough data, and must handle
EndOfStream.

It also changes out the bucketing/caching stuff works in the
BuckingInputSourceDataInptuStream, since isDefinedForLength is what
fills the buckes. Instead, that needs to happen on just reads from the
underlying stream. Not a big change, but a change nonetheless.

It doesn't seem unreasonble to only call isDefinedForLength for smaller
lengths so we can sort of quickly fill the cache, but the
DataInputStream and callers of it's functions still need to be changed
because they can no longer assume that it's been called.



On 9/11/20 2:40 PM, Interrante, John A (GE Research, US) wrote:
> Steve,
> 
> Can you elaborate on what you mean by setting the bit limit?  The DFDL specification has too many occurrences of "length" and "limit" for me to easily grasp what you propose to do, although setting a limit of how many bits to read for the current complex element and failing only once we find out there aren't enough bits OR we hit that limit makes sense to me.
> 
> I searched the Daffodil codebase and found that both "bitLimit0b" and "bitLimit1b" are used in the codebase.  Seems like you plan to tie into that somehow instead of trying to read that many bits into your cache buckets simply to confirm as early as possible that the given number of bits exists.  That sounds reasonable for explicit lengths that are larger than what you want to cache ahead of time (256MB).
> 
> I could envision you setting both a bit limit and reading bytes into the cache buckets when the explicit length is smaller than 256MB, but skipping the read-ahead check whenever the explicit length is greater than 256MB and relying on the bit limit to produce the right exception later.
> 
> John
> 
> -----Original Message-----
> From: Steve Lawrence <sl...@apache.org> 
> Sent: Friday, September 11, 2020 1:14 PM
> To: dev@daffodil.apache.org
> Subject: EXT: Large dfdl:length values on complex types
> 
> I recently came across an issue where we have something like this:
> 
>   <xs:element name="length" type="xs:int" ... />
>   <xs:element name="data"
>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>     <xs:complexType>
>       <xs:sequence>
>         <xs:element name="field1" ... />
>         <xs:element name="field2" ... />
>         ...
>         <xs:element name="fieldN" ... />
>       </xs:sequence>
>     </xs:complexType>
>   </xs:element>
> 
> So we have a length element and a complex data field that uses this length, and the data field is made up of a bunch of fields.
> 
> The issue I come across is related to how we cache bytes in buckets for backtracking. As we fill up buckets, we currently limit the total amount cache size of the buckets to 256MB. So if someone ever parses more than 256MB of data and then tries to backtrack past that, we error. The idea being that we don't want to keep an infinite cache for potential backtracking and people should have realized that they went down the wrong branch much earlier.
> 
> Though, a problem occurs with the complex types with a large specified length like above. When we have the complex type with expression ../length, before trying to parse any of the fields, we read that length number of bytes into our cache buckets to confirm that that number of bytes exists. The problem occurs if length is more than 256MB. In this case, we read length number of bytes, and start removing elements from the cache once we read more than 256MB.
> 
> But once that succeeds and we read length bytes, we then try to start parsing the fields within the complex type, but we've removed those early cached bytes, and so we fail with an unhelpful backtracking exception.
> 
> I'm not sure of the right solution here.
> 
> Perhaps we shouldn't be throwing away these bytes when dealing with complex lengths?
> 
> Or perhaps we shouldn't even be trying to determine if that many bytes are available when we have a specified length. Instead, maybe we should just set the bit limit to make sure we don't parse more than than that?
> And if eventually something tries to read a byte and there aren't enough and we hit that limit, only then do we fail. This feels like the right solution, but wanted to start a discussion to see if maybe there's a reason we try to read the full length, or maybe there's another alternative?
> 


RE: Large dfdl:length values on complex types

Posted by "Interrante, John A (GE Research, US)" <in...@research.ge.com>.
Steve,

Can you elaborate on what you mean by setting the bit limit?  The DFDL specification has too many occurrences of "length" and "limit" for me to easily grasp what you propose to do, although setting a limit of how many bits to read for the current complex element and failing only once we find out there aren't enough bits OR we hit that limit makes sense to me.

I searched the Daffodil codebase and found that both "bitLimit0b" and "bitLimit1b" are used in the codebase.  Seems like you plan to tie into that somehow instead of trying to read that many bits into your cache buckets simply to confirm as early as possible that the given number of bits exists.  That sounds reasonable for explicit lengths that are larger than what you want to cache ahead of time (256MB).

I could envision you setting both a bit limit and reading bytes into the cache buckets when the explicit length is smaller than 256MB, but skipping the read-ahead check whenever the explicit length is greater than 256MB and relying on the bit limit to produce the right exception later.

John

-----Original Message-----
From: Steve Lawrence <sl...@apache.org> 
Sent: Friday, September 11, 2020 1:14 PM
To: dev@daffodil.apache.org
Subject: EXT: Large dfdl:length values on complex types

I recently came across an issue where we have something like this:

  <xs:element name="length" type="xs:int" ... />
  <xs:element name="data"
    dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="field1" ... />
        <xs:element name="field2" ... />
        ...
        <xs:element name="fieldN" ... />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

So we have a length element and a complex data field that uses this length, and the data field is made up of a bunch of fields.

The issue I come across is related to how we cache bytes in buckets for backtracking. As we fill up buckets, we currently limit the total amount cache size of the buckets to 256MB. So if someone ever parses more than 256MB of data and then tries to backtrack past that, we error. The idea being that we don't want to keep an infinite cache for potential backtracking and people should have realized that they went down the wrong branch much earlier.

Though, a problem occurs with the complex types with a large specified length like above. When we have the complex type with expression ../length, before trying to parse any of the fields, we read that length number of bytes into our cache buckets to confirm that that number of bytes exists. The problem occurs if length is more than 256MB. In this case, we read length number of bytes, and start removing elements from the cache once we read more than 256MB.

But once that succeeds and we read length bytes, we then try to start parsing the fields within the complex type, but we've removed those early cached bytes, and so we fail with an unhelpful backtracking exception.

I'm not sure of the right solution here.

Perhaps we shouldn't be throwing away these bytes when dealing with complex lengths?

Or perhaps we shouldn't even be trying to determine if that many bytes are available when we have a specified length. Instead, maybe we should just set the bit limit to make sure we don't parse more than than that?
And if eventually something tries to read a byte and there aren't enough and we hit that limit, only then do we fail. This feels like the right solution, but wanted to start a discussion to see if maybe there's a reason we try to read the full length, or maybe there's another alternative?