You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by GitBox <gi...@apache.org> on 2021/01/27 21:51:51 UTC

[GitHub] [incubator-daffodil] mbeckerle commented on pull request #480: WIP: First cut at a Zip layer transform

mbeckerle commented on pull request #480:
URL: https://github.com/apache/incubator-daffodil/pull/480#issuecomment-768600775


   Responding to @bsloane1650 
   
   > Some high level comments:
   > 
   >     1. It feels like we should ship schema for our self defined ZIP header type, so the user schema can be something like:
   >
   Ok. Probably it should live in daffodil-lib src/main/resources/org...../dfdlZipEntryHeader.dfdl.xsd where it can just be included.
   My use case would want to modify this however, as the data part of specific zip entries might want to be parsed based on some other schema element.  I intend to look at the file extension of the files, and if they are ".bin" parse them according to a specific format. Probably that sort of entry parsing could be handled the same way we do message_size and payload in other header schemas, the user would just provide their own implementation of the payload part which would supersede the built-in one via classpath. 
   
   > 
   > ```
   > <xs:element name="zipEntry" maxOccurs="unbounded" dfdl:occursCountKind="implicit">
   >   <xs:complexType>
   >     <xs:sequence>
   >       <xs:element name="header" type="dfdlx:ZipHeader"/>
   >        ...
   >     </xs:sequence>
   >   </xs:complexType>
   > </xs:element>
   > ```
   > 
   > In addition to being generally easier for the user, this should also allow us to make changes with a bit more backwards compatibility.
   > 
   > It also feels wrong that we are passing the metadata in-band in the first place; but I don't see any way around that without redesigning DFDL somewhat.
   > 
   >     1. Silently dropping deleted entries seems like a limitation to me as well. From a DFDL interface standpoint supporting deleted entries is straightforward: just add a boolean field to the entry header indicating if the file was deleted. From an implementation standpoint, this probably means we would need to roll our own ZIP implementation.
   >        Not something we need to address, but worth explicitly listing as a limitation.
   > 
   Yeah, zip is actually quite complex in this regard. There can be dead-space in the file. To modify the contents of a zip, you just write a new trailing directory record to the the data, and that can put the files anywhere, as it stores the offsets to them.  Zip files can also be split over multiple volumes, in which case the earlier volumes can quite literally not be interpreted at all until you get the final volume. Entries can be marked deleted explicitly, but they needn't be actually. The trailing directory record can just not indicate that anything appears in the data at what was previously an entry. That "deletes" the entry. 
   
   >     2. Similar to (2), we silently drop dead space from the ZIP file. Probably less even less important than (2), but still a limitation if anyone wants to use this for forensic purposes.
   
   This is definitely the nature of the limitation. Not usable for forensics on the zip data at all really. 
   > 
   >     3. If I understand the ZIP format correctly, there is a global comment in the that applies to the entire ZIP which we do not expose. We may want to include a global ZIP header before the first entry header.
   
   I will look into this. Nothing i've seen requires anything specific to be first in the file. The global directory, which is last in the data, says where everything goes. 
   > 
   >     4. A maliciously crafted ZIP file might be able to cause all sorts of interesting issues.
   
   This is a real concern. Many of the use cases for Daffodil are cybersecurity related. So use of this zip layerTransform we inherit its bugs relative to security. 
   
   OTOH, parse/unparse of a zip using this layerTransform removes all dead space in the zip file, removes all deleted entries, etc. So it reduces the attack surface for anything downstream that takes in the data. At least from data exfiltration perspective this removes a bunch of covert channels. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org