You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2016/07/11 14:24:08 UTC

[Bug 59841] New: ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

            Bug ID: 59841
           Summary: ZipInputStreamZipEntrySource should use temp files
                    instead of ByteArrayOutputStream
           Product: POI
           Version: 3.14-FINAL
          Hardware: PC
                OS: Mac OS X 10.1
            Status: NEW
          Severity: normal
          Priority: P2
         Component: POI Overall
          Assignee: dev@poi.apache.org
          Reporter: fanningpj@yahoo.com

I have a use case where I have a worksheet with a large amount of data.
I'm using XSSFEventBasedExcelExtractor to try to extract the text from the
workbook and am hoping to use as little memory as possible.
Most of the code base seems pretty memory efficient but the putting the
worksheet data into a ByteArrayOutputStream in the
ZipInputStreamZipEntrySource.FakeZipEntry causes high memory consumption.
POI has temp file support via DefaultTempFileCreationStrategy.
Would it be possible to consider using temp files in this code base, even it
was just for data beyond a certain size?

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #10 from PJ Fanning <fa...@yahoo.com> ---
Thanks Andreas for your analysis. Your summary is exactly what I am looking to
achieve. I understand the Shared String Table could be another memory issue but
I'm wondering if at least the FakeZipEntry memory consumption could be
addressed by my patch set? I've added a basic test case to the pull request. My
custom strategy class would be similar to the one in the test case except that
it has its own built-in AES encryption.
This approach is based on the pre-existing TempFile creation strategy class in
POI.
Of course, the solutions you are suggesting are also well worth considering.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #16 from PJ Fanning <fa...@yahoo.com> ---
ReadOnlySharedStringsTable readFrom(InputStream is) does a check on
is.available and for the CipherInputStream, this seems to be zero.
I suspect that we don't need the is.available check.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #15 from Andreas Beeker <ki...@apache.org> ---
I've added a call to XSSFEventBasedExcelExtractor and also receive that error
...
I see what I can do ...

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #4 from PJ Fanning <fa...@yahoo.com> ---
Using temp files for the fake zip entries makes a big difference in memory
usage in the use case I have. The https://github.com/apache/poi/pull/34 change
leaves the default behaviour as it is but allows user to provide customised
overrides to the behaviour.
Would it be possible to consider reopening this issue and I will proceed to
extend the test coverage?
I think it is a legitimate concern not wanting to put unencrypted xlsx files
onto disk in order to read them efficiently.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

Nick Burch <ap...@gagravarr.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |INVALID
             Status|NEW                         |RESOLVED

--- Comment #1 from Nick Burch <ap...@gagravarr.org> ---
To minimise memory use, open the XLSX file using a real File object, not an
InputStream, see
http://poi.apache.org/spreadsheet/quick-guide.html#FileInputStream . That
should avoid all in-memory buffering of the xml parts at the OPC / Zip level

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #13 from Andreas Beeker <ki...@apache.org> ---
Created attachment 34046
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34046&action=edit
new OPCPackage.open(ZipEntrySource) method / custom encrypted temp zip

As described above, the OPCPackage needed to be extended to provide the
functionality of providing a custom encrypted temp file, which can be streamed
to POI

At first I've started to use packages like zip4j, winzipaes or commons-compress
(which doesn't provide encryption support), but as the resulting temp doesn't
need to be outside of the user process, I've downgraded it to a simple custom
enc-/decryption routine.

Also as noted above, I haven't put any effort into the the temp file
generation, as this is simply a user code issue - so given your described use
case, I don't see a need to push the logic into ZipInputStreamZipEntrySource &
Co. and rather supply an example in the unit tests for further reference, i.e.
I'm reluctant to make things more complicated ...

Please validate that approach and if it's ok I'll commit it.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #14 from PJ Fanning <fa...@yahoo.com> ---
Thanks Andreas. I tried out your patch code on 2 xlsx files and got the same
NullPointerException for both.

    at
org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable.getEntryAt(ReadOnlySharedStringsTable.java:184)
    at
org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:353)
    at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)

I'm using the OPCPackage in new XSSFEventBasedExcelExtractor.
I can try to put together a test case to demonstrate the issue I'm seeing.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #11 from Javen O'Neal <on...@apache.org> ---
(In reply to Andreas Beeker from comment #8)
> - the extraction should be on a file and not a stream

And another requirement, since there usually tends to be a trade off between
memory consumption and execution speed:
If using a file is slower, we must keep the stream variant so users  can choose
based on what best meets their needs. Having a method that operates on an
OutputStream and letting the user pass in a the implementation: BAOS or a FOS,
and an InputStream with the choice of implementation including BAOS, FIS,
ZipInputStream, EncryptedZipInputStream, EncryptedZipFileInputStream, etc).

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #18 from PJ Fanning <fa...@yahoo.com> ---
I spotted that problem with the available check too and removing it fixes my
test case.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #9 from Andreas Beeker <ki...@apache.org> ---
So we could either
- add a OPCPackage.open(ZipFile) method, which would skip the functionality of
ZipSecureFile
- or pass a FilterInputStream-wrapper/decorator down to ThresholdInputStream
which is used decrypt/deobfuscate the raw zip bytes - of course the encrypting
part seems to be not trivial then, if not a simple bytewise XOR is used ...

(again this all assumes, we are working on a random accessible file and not on
a stream)

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #8 from Andreas Beeker <ki...@apache.org> ---
Sorry - I don't understand your FakeEntry approach ... and even Nick beat me on
this, I'd like to add my two cents ...

So lets summarize:
- you have a hugh encrypted .xlsx
- and want to use eventmodel for text extraction
- the extraction should be on a file and not a stream
- it's not a problem to a have temporary file, but it should be encrypted

My/further insights:
- the shared string table is loaded in full into the memory, so this might be a
caveat
- you need to provide a custom ZipFile implementation with standard AES or
custom encryption support which is used by POI - maybe "Apache Commons
Compress" can be used for that ... and of course we need to adapt the
OPCPackage handling for that

So currently it looks like the following approach for me:
- read the ole2 container
- copy the encrypted stream into a encrypted zip with a session key
- provide the zip to an adapted ZIP-/OPCPackage
- use XSSFReader with that OPCPackage

The key is, that we need to change the ZIPPackage to support a custom ZipFile!
So writing the ZipFile is not an issue and can be handled without changing POI,
but reading is the interesting part.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

Andreas Beeker <ki...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #34046|0                           |1
        is obsolete|                            |

--- Comment #17 from Andreas Beeker <ki...@apache.org> ---
Created attachment 34047
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34047&action=edit
new OPCPackage.open(ZipEntrySource) method / custom encrypted temp zip

There's an error in handling InputStreams via available() in reading shared
strings tables, i.e. assuming EOF on available() == 0 is simply wrong.

I've fixed it for this use case, but on a quick search, this wrong assumption
is also on various other places.

So check if it works for you and I open a new ticket for fixing the other
invocations.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #5 from Nick Burch <ap...@gagravarr.org> ---
(In reply to PJ Fanning from comment #4)
> I think it is a legitimate concern not wanting to put unencrypted xlsx files
> onto disk in order to read them efficiently.

I'm not sure how your patch helps with this - all it does is only put parts of
the unencrypted data on disk at a time, but the unencrypted data is still on
disk in the temp files!

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

Andreas Beeker <ki...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |59893

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

Nick Burch <ap...@gagravarr.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|INVALID                     |---

--- Comment #7 from Nick Burch <ap...@gagravarr.org> ---
Ah, that makes more sense now :)

Re-opening so that someone can review the patch, and apply if appropriate /
give further comments if not!

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

PJ Fanning <fa...@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fanningpj@yahoo.com

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #6 from PJ Fanning <fa...@yahoo.com> ---
I have a class that encrypts the temp files (not part of patch). With the patch
I can create my own custom FakeZipEntryStrategy class that uses this temp file
util under the hood.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #3 from PJ Fanning <fa...@yahoo.com> ---
I have created https://github.com/apache/poi/pull/34 - if this approach is
acceptable, I can add extra test coverage.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #2 from PJ Fanning <fa...@yahoo.com> ---
Thanks Nick for the quick response. In my case, the xlsx file is password
protected and I'm using POIFSFileSystem to decrypt it. I don't want to store
the unencrypted xlsx on the file system.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

Andreas Beeker <ki...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Hardware|PC                          |All
            Version|3.14-FINAL                  |3.15-dev
         Resolution|---                         |FIXED
                 OS|Mac OS X 10.1               |All
             Status|REOPENED                    |RESOLVED

--- Comment #19 from Andreas Beeker <ki...@apache.org> ---
Thank you for testing.
patch applied via r1753003

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 59841] ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=59841

--- Comment #12 from Andreas Beeker <ki...@apache.org> ---
(In reply to PJ Fanning from comment #10)
> This approach is based on the pre-existing TempFile creation strategy class
> in POI.

Although I understand, that a temp file needs to be created, this can be done
outside of POI and therefore user code can do whatever temp creation they want
to do - so this is not my main scope as mentioned above, i.e. "writing the
ZipFile is not an issue".

I'll now try to do a test implementation with commons-compress which doesn't
extend java.util.zip.ZipFile, therefore I need a interface to a delegate which
forwards the zip operations to the underlying zip implementation. This
interface will be referenced via OPCPackage.open(interface).

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org