You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@daffodil.apache.org by "Michael Beckerle (Jira)" <ji...@apache.org> on 2019/08/28 16:50:00 UTC

[jira] [Commented] (DAFFODIL-2194) buffered data output stream has a chunk limit of 2GB

    [ https://issues.apache.org/jira/browse/DAFFODIL-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16917916#comment-16917916 ] 

Michael Beckerle commented on DAFFODIL-2194:
--------------------------------------------

I don't think fixing this by just auto-splitting DirectOrBufferedDataOutputStream if buffering and the buffer grows near 2GByte max is a good fix.

It will work, and allow larger blobs to be processed, but doesn't keep the blob data out of the java heap. We will be able to get past the 2GByte object size limit, but the java heap size limit will be next.

We want to be able to parse image/video files that are much larger than the java heap would want to grow to. I.e., a 64GByte NITF file parsed in a 5GByte JVM.

That means that the DirectOrBufferedDataOutputStream would need a specific blob-indirect flavor. So when you unparse a blob, rather than stream it into a buffered instance, copying all the bytes into one or more java heap objects of size <= 2GB, you instead just defer the blob and point at the file (and add in its length to the accumulated length for use by the dfdl:valueLength function). When the streams collapse together that's when the blob file contents would be accessed and streamed to the direct output stream.

 

> buffered data output stream has a chunk limit of 2GB
> ----------------------------------------------------
>
>                 Key: DAFFODIL-2194
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2194
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Back End
>            Reporter: Steve Lawrence
>            Assignee: Steve Lawrence
>            Priority: Major
>             Fix For: 2.5.0
>
>
> A buffered data outupt stream is backed by a growable ByteArrayOutputStream, which can only grow to 2GB in size. So if we ever try to write more than 2GB to a buffered output stream during unparse (very possible with large blobs), we'll get an OutOfMemoryError.
> One potential solution is to be aware of the size of a ByteArrayOutputStream when buffering output and automatically create a split when it gets to 2GB in sizes. This will still require a ton of memory since we're buffering these in memoary, but we'll at least be able to unparse more than 2GB of continuous data. 
> Note that we should still be able to unparse more than 2GB of data total, as long as there so single buffer that's more than 2GB.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)