You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by Mike Beckerle <mb...@tresys.com> on 2018/09/04 19:18:02 UTC

Re: BLOB thoughts

Hi Russ,


So you do have the option to include the contents in the infoset today. Just model the object as a hexBinary type. The entire blob contents will appear as a byte array in the infoset, rendered as a hex string if you output to XML/JSON.


The overhead of this, and the limit it places on the size of images you can process, (at the maximum size of a java byte array), is the problem here. We could lift the size limit by just making the representation be an array of byte arrays, but then you still have the overall process memory size issue, and the overhead of all that copying of data into the address space of a JVM,.... for no purpose.


So really this is only about *not* doing that, because you either have no interest in the data, just the surrounding metadata, or because you just want to avoid the overhead of anything reading the blob data that doesn't actually need to process it.


If the format supports it, I'm hoping the implementation would skip the content either using a seek, or the memory mapped tricks you mention below.


Alas, some formats have big blobs in them, and do NOT store the length in advance - rather they depend on a delimiter that can't be found in the data stream of the binary data. In that case some sort of find-the-end/fast-scanner that is not storing what it scans past, would be optimal.


One alternative idea is to just provide some expression language functions e.g., daf:currentPosition('bytes'), and those could be used to put the seek position of a blob and it's length into the infoset as numeric values without including the blob contents. Then a subsequent processing pass can use those to access the "blob" contents directly from the original file via mmap or seek.  So a blob would show up in XML like:


<blob>

  <startPos>898987</startPos>

  <len>-989289892939</len>

</blob>


I mean, to some extent this is enough. This works so long as the blob is at the top-level of the data, not nested inside some compressed region. Still that would suffice for the important use cases we have today.


-mike


________________________________
From: Russ Williams <ru...@reciprocity.com>
Sent: Thursday, August 30, 2018 5:45:52 PM
To: dev@daffodil.apache.org
Subject: Re: BLOB thoughts

Looks interesting! Only had time to skim it briefly, but I’m unsure about the BLOB handle/URI.

The handle requirements in the image pipeline section (durable, opaque, completely defined, permanent/canonicalised) are good, though there’s a clear tension between the infoset containing everything needed to unparse the file and yet not containing the BLOBs themselves. I don’t think there’s a way around that, without explicitly including the BLOBs’ contents in the infoset, so a decision will have to be made on whether the output of parsing is a collection of “file” objects (the infoset/XML + N x BLOBs) or if it’s still a single infoset/XML containing references to the original file (so you can no longer delete it after parsing). I’d probably lean towards the latter, personally, but there are arguments for either, or even both.

Actually, there’s probably a third possibility, which is to include the BLOB contents in the infoset/XML, arbitrary bytes encoded as something like base64. That then keeps to the single-infoset-contains-everything pattern, allows the original file to be deleted after parsing, and doesn’t clutter up the filesystem with strangely-named temp files… but it also defeats most of the point of using BLOBs. If the reason for using a BLOB is that the data is opaque, rather than huge, it makes sense, though.

For all the talk about BLOBs/files, there are a couple of other possibilities - both memory-mapped files and seeking on the input stream offer possibilities to avoid reading the contents of the BLOB at all. Might not matter for a 10MB image, but it’s damn sure going to count if you’re trying to parse a 60GB UHD BluRay. This fits the “avoid copying” implementation concern.

I’m not sure the SAX-style parser case is really all that different from the image pipeline. If the BLOB definitions fit the requirements for reproducibility/etc, there’s no reason why the BLOB parameters couldn’t be “saved for later” as an opaque handle - like a super-powered version of ftell() - so it wouldn’t be necessary to use it only when encountered. For a trivial case where the structure is maintained but you’re swapping the word “foo” for “bar” it won’t matter, but anything using some local storage - such as sorting a list of elements before output - would benefit from being able to defer reading the BLOB data until it’s actually being written out.

Gotta say I don’t particularly like the look of the BLOB URIs. Feels like it’s assuming too much about the implementation, and is unnecessarily limiting. Also, it’s not a URI as there’s no scheme… don’t know if that would be better as a new blob:// thing or if a standard file:// URI should be allowable within <BLOB> tags or something else entirely. It also seems kinda weird that it’s in the text of the tag, rather than as an attribute of a singleton tag.

Sorry, this turned out to be quite a bit longer than I expected ^_^;;

Cheers,
—
Russ

PS - this email client *really* hates the word “infoset” and keeps wanting to turn it into “infest”...

> On 30 Aug 2018, at 19:34, Mike Beckerle <mb...@tresys.com> wrote:
>
> I put some thought into the design of the BLOB feature we need to add to Daffodil for image file parsing.
>
>
> I think it's actually rather easily done using layering. I.e., a BLOB is parsed/unparsed via a layer with dfdl:layerTransform="BLOB".
>
>
> This layer actually is a fake out that produces the bytes corresponding to the URI that identifies the BLOB file.
>
>
> See the wiki page on BLOB here: https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+BLOB+Objects+-+Binary+Large+Objects
>
>
> All comments are welcome here or on the wiki page.