You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hop.apache.org by "thadguidry (via GitHub)" <gi...@apache.org> on 2023/02/05 04:02:04 UTC

[GitHub] [hop] thadguidry opened a new issue, #2235: [Bug]: unzip is extremely slow with large 1.2GB single zip with over 155,000 small JSON files

thadguidry opened a new issue, #2235:
URL: https://github.com/apache/hop/issues/2235

   ### Apache Hop version?
   
   2.3.0
   
   ### Java version?
   
   17.0.4.1
   
   ### Operating system
   
   Windows
   
   ### What happened?
   
   Wire up a simple workflow of
   Start -> unzip -> Success
   
   The .zip file is read from disk, and file contents are written correctly to target folder on same disk.
   The disk is a RAID 1 array with 7200 RPM HDD's.
   
   The single large 1.2GB .zip file has taken over 4 hours to unzip so far...and I am still waiting.
   Conversely on a different machine with similar HDD configuration, using 7Zip 64bit the extraction was completed in about 40 minutes.
   
   Looking at the code, I am wondering if the option for `if file exists` = SKIP is perhaps a likely suspect (where perhaps my strategy is to go as fast as possible and I don't care about checking if things exist are not across over 155,000 files, but just blast through and overwrite files, no matter what) ?  I.E. maybe I should have chosen `if file exists` = OVERWRITE ?  I don't know which settings would make zip file unpacking faster or the HopVfs streaming faster?  Maybe there should have been a setting in the unzip dialog that allowed me to use more memory or cores? 
   
   The other suspicion is that of the buffering and low amount of memory that is utilized in order to unpack ?
   Looking at Task Manager on Windows, I can see Hop stayed at 50% CPU utilization across all 8 of my i7-9700k cores.  And where memory utilized was at 2.5GB.
   
   Can we do better?  Of course.  A general strategy perhaps:
   
   1. The question is how can the HOP VFS architecture deal with unzipping faster and optionally use more of the hardware?
   2. Maybe perhaps warn users of issues with a many-file .zip file that needs to be extracted for now, and with a note to perhaps use other unzip tooling when you are dealing with large, deep zip files and need to go very fast?
   3. With a final goal to eventually make things faster with Hop's own unzip?
   
   
   ### Issue Priority
   
   Priority: 3
   
   ### Issue Component
   
   Component: VFS


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hop.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hop] thadguidry commented on issue #2235: [Bug]: unzip is extremely slow with large 1.2GB single zip with over 175,000 small JSON files

Posted by "thadguidry (via GitHub)" <gi...@apache.org>.
thadguidry commented on issue #2235:
URL: https://github.com/apache/hop/issues/2235#issuecomment-1417054802

   [1 big 1.2GB submissions.zip file from sec.gov website](https://www.sec.gov/Archives/edgar/daily-index/bulkdata/submissions.zip), and seems to have actually a total of 834,176 JSON files according to the Zip metadata on my Windows 11 system.
   ![image](https://user-images.githubusercontent.com/986438/216807917-1e82f946-5eed-4b85-b765-b5db740d6f36.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hop] hansva commented on issue #2235: [Bug]: unzip is extremely slow with large 1.2GB single zip with over 175,000 small JSON files

Posted by "hansva (via GitHub)" <gi...@apache.org>.
hansva commented on issue #2235:
URL: https://github.com/apache/hop/issues/2235#issuecomment-1417014318

   Are these files nested inside that zip, or is it one big file with no subfolders?
   Asking to make a reproduction. One thing I can say is that Hop uses only 1 thread per action/transform this might also be a limiting factor to the unzip process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org