You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "threeplanetssoftware (Jira)" <ji...@apache.org> on 2022/10/14 23:56:00 UTC

[jira] [Created] (NIFI-10654) UnpackContent Processor Doesn't Support Multi-part Files

threeplanetssoftware created NIFI-10654:
-------------------------------------------

             Summary: UnpackContent Processor Doesn't Support Multi-part Files
                 Key: NIFI-10654
                 URL: https://issues.apache.org/jira/browse/NIFI-10654
             Project: Apache NiFi
          Issue Type: Bug
          Components: Core Framework
    Affects Versions: 1.18.0
            Reporter: threeplanetssoftware
         Attachments: encrypted.zip, unencrypted-1.zip

I'm filing this as a bug due to this behavior generally working in zip implementations and not in the UnpackContent processor. I can understand an argument for making this an improvement ticket and will happily go that route if the maintainers choose to change it.

I am trying to deal with large (dozens of GB), split zip files that are password protected. The zip file is split to allow for better downloading of the parts, rather than having to wait for one 45GB file to download successfully. A multipart zip file can't be handled piecemeal, so picking up each part as a FlowFile and routing it into the UnpackContent processor won't work.

I tried putting them together manually first to at least make sure that would work, but UnpackContent still refused with this error: "UnpackContent[id=snipped] Unable to unpack FlowFile[filename=license-3.zip] because it does not appear to have any entries; routing to failure." Meanwhile, unzip opened the archive successfully, even if it gave an warning about the multiple parts being put into the same file.

I also tried this with an unencrypted split zip that I reconstructed and UnpackContent failed with this error: "UnpackContent[id=d834e7ae-0183-1000-cfe8-0806ea6d348d] Unable to unpack FlowFile[filename=unencrypted_reconstructed-2.zip]; routing to failure: org.apache.nifi.processor.exception.ProcessException: IOException thrown from UnpackContent[id=d834e7ae-0183-1000-cfe8-0806ea6d348d]: org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException: Unsupported feature splitting used in archive.
- Caused by: org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException: Unsupported feature splitting used in archive."

{*}Bug Request{*}: Can UnpackContent be fixed to support unpacking multipart zip files, encrypted or not, that have been put back together in the proper order prior to arrival at that processor?

{*}Potential Feature Request{*}: Can UnpackContent (or some other processor) be made to ingest multiple parts of a zip file and cat them together in the right order in one FlowFile? The first file should be obvious from zip headers and the last file should also be obvious from zip footers. Everything in the middle should be sorted numerically by file extension. This would allow me to use the InvokeHTTP processor to fetch these large files and have them get joined together and forwarded on once the entire thing was built.

{*}Reproduction{*}: I'm attaching two much smaller files to this ticket, both were made concatting a few files in the latest NIFI release together to get enough data to split then splitting on a Linux machine such as: `zip --split-size 64k unencrypted.zip LICENSE`. `unencrypted.zip` is a multiple part zip without a password that was concatenated back together in the proper order. `encrypted.zip` is the same file and command, but with the addition of the password "password" (no quotes). Running `unzip` on these produces the correct file (LICENSE, md5sum: 108db6ee2249df0e1c7df85216e1b883).

In Nifi, I have a GetFile processor to pick the file up explicitly by name, send it directly to UnpackContent with mime type set to zip. For the encrypted file, I added the password in as necessary.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)