You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by "Costello, Roger L." <co...@mitre.org> on 2018/11/13 13:22:17 UTC

I think DFDL (Daffodil) should be able to handle input files up to 2^64 in size

Hello DFDL community,

I think Daffodil should handle input files up to 2^64 in size.

2^64 = 18,446,744,073,709,551,616 (20 digits)

Why that number? Here's why:

The number 2^64 is:

The number of distinct values representable in a single word on a 64-bit processor. Or, the number of values representable in a doubleword on a 32-bit processor. Or, the number of values representable in a quadword on a 16-bit processor, such as the original x86 processors.

The range of a long variable in the Java and C# programming languages.

The range of a Int64 or QWord variable in the Pascal programming language.

The total number of IPv6 addresses generally given to a single LAN or subnet.

Thoughts?

https://www.quora.com/How-big-is-2-power-64 

/Roger

Re: I think DFDL (Daffodil) should be able to handle input files up to 2^64 in size

Posted by Steve Lawrence <sl...@apache.org>.
Support for much larger files is something we hope to be able to support
in the future. In fact, Daffodil 2.2.0 removed most of technical
limitations, so theoretically we can support reading in arbitrarily
large files. The real issue is that in practice Daffodil requires lots
and lots of memory, which makes it difficult to support large files in
practice. The two main memory limitations:

1) The way Daffodil 2.2.0 works is it reads data into relatively small
buckets and throws away buckets of data that it no longer needs.
However, if a schema is written in such away that it may backtrack, we
cannot throw away some buckets. So we end up holding on to data and
using up memory just in case we need that data later. For large files,
this can eat up a lot of memory. Designing schemas to limit backtracking
can help with this, but may not be possible in some cases.
Alternatively, we could add a feature that throws away old buckets after
some amount of not being used, with the assumption that they won't be
backtracked to even if it could theoretically happen, and if it does the
we error.

2) A large issue with memory usage is the size of the infoset, which can
easily explode in size when compared to the input data. The planned
solution here is to support SAX style streaming, so that we could send
out SAX events and throw away parts of the infoset we don't need
(similar to how we throw away buckets of data). We have some plans for
how we can do this, and it is on the future roadmap. But it's a pretty
large change with some technical complexities that we haven't had a
chance to fully work through yet.

That said, if you have a message that is really just the repetition of a
whole bunch of small messages, you can use the new --stream feature. See
this mailing list posts that talks about how streaming works:

https://lists.apache.org/thread.html/4e9026a21583b7aa2c5112cf6b6ad8135ec7526e6d24b80e2ae99494@%3Cusers.daffodil.apache.org%3E

Streaming essentially allows Daffodil to throw away buckets and the
infoset after each individual stream parse completes, rather than
holding on to it everything until the one large parse completes.

- Steve


On 11/13/18 8:22 AM, Costello, Roger L. wrote:
> Hello DFDL community,
> 
> I think Daffodil should handle input files up to 2^64 in size.
> 
> 2^64 = 18,446,744,073,709,551,616 (20 digits)
> 
> Why that number? Here's why:
> 
> The number 2^64 is:
> 
> The number of distinct values representable in a single word on a 64-bit processor. Or, the number of values representable in a doubleword on a 32-bit processor. Or, the number of values representable in a quadword on a 16-bit processor, such as the original x86 processors.
> 
> The range of a long variable in the Java and C# programming languages.
> 
> The range of a Int64 or QWord variable in the Pascal programming language.
> 
> The total number of IPv6 addresses generally given to a single LAN or subnet.
> 
> Thoughts?
> 
> https://www.quora.com/How-big-is-2-power-64 
> 
> /Roger
>