You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2017/01/10 00:20:06 UTC

[Bug 60567] New: XSSFReader caused OutOfMemoryError when reading a lerge excel file in HDFS as inputStream

https://bz.apache.org/bugzilla/show_bug.cgi?id=60567

            Bug ID: 60567
           Summary: XSSFReader caused OutOfMemoryError when reading a
                    lerge excel file in HDFS as inputStream
           Product: POI
           Version: 3.14-FINAL
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: blocker
          Priority: P2
         Component: XSSF
          Assignee: dev@poi.apache.org
          Reporter: dejian.tu@oracle.com
                CC: dev@poi.apache.org, pcllau@gmail.com
        Depends on: 57842
  Target Milestone: ---

My project is using POI library to read excel file in HDFS. The API I used is
as below:
==================
// inputStream is generated from a HDFS path, because OPCPackage could
// not recognize HDFS path directly.
XSSFReader xssfReader = new XSSFReader(OPCPackage.open(inputStream));
==================

The excel file has around 1,000,000 rows of simple data (columns like name, id,
address, etc.), and the file size is around 140MB. When I run my project, the
process consumes about 3.25GB memory, which is much bigger than the excel file
size.

AFAIK, reading from a String path or File uses much less memory than reading
from inputStream for XSSFReader. But for my case, because the excel file is in
HDFS file system, we could not pass the HDFS path to XSSFReader directly.

Could you please help to fix the issue that XSSFReader uses much more memory
when reading from inputStream?

Thank you.


Referenced Bugs:

https://bz.apache.org/bugzilla/show_bug.cgi?id=57842
[Bug 57842] Using POI 3.9 API memory consumed reading an xlsx file is not
released back to the operating system after completion
-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 60567] XSSFReader caused OutOfMemoryError when reading a large excel file in HDFS as inputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=60567

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |INVALID
             Status|NEW                         |RESOLVED
           Severity|normal                      |enhancement

--- Comment #2 from Javen O'Neal <on...@apache.org> ---
(In reply to Dejian Tu from comment #0)
> AFAIK, reading from a String path or File uses much less memory than reading
> from inputStream for XSSFReader. But for my case, because the excel file is
> in HDFS file system, we could not pass the HDFS path to XSSFReader directly.

This sounds like a point for discussion on the mailing list (and perhaps Stack
Overflow or other community to get suggestions on how to write a program that
can deal with data stored on a distributed file system), and is not a bug
without convincing evidence and a patch.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 60567] XSSFReader caused OutOfMemoryError when reading a lerge excel file in HDFS as inputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=60567

Dejian Tu <de...@oracle.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|57842                       |


Referenced Bugs:

https://bz.apache.org/bugzilla/show_bug.cgi?id=57842
[Bug 57842] Using POI 3.9 API memory consumed reading an xlsx file is not
released back to the operating system after completion
-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: [Bug 60567] XSSFReader caused OutOfMemoryError when reading a large excel file in HDFS as inputStream

Posted by "pj.fanning" <fa...@yahoo.com>.
OPCPackage.open(file) is more efficient memory wise than
OPCPackage.open(inputStream)

	/**
	 * Open a package.
	 *
	 * Note - uses quite a bit more memory than {@link #open(String)}, which
	 * doesn't need to hold the whole zip file in memory, and can take
advantage
	 * of native methods
	 *
	 * @param in
	 *            The InputStream to read the package from
	 * @return A PackageBase object
	 */
	public static OPCPackage open(InputStream in) throws
InvalidFormatException,
			IOException {



--
View this message in context: http://apache-poi.1045710.n5.nabble.com/Bug-60567-New-XSSFReader-caused-OutOfMemoryError-when-reading-a-lerge-excel-file-in-HDFS-as-inputStrm-tp5726184p5727624.html
Sent from the POI - Dev mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 60567] XSSFReader caused OutOfMemoryError when reading a large excel file in HDFS as inputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=60567

chenchanghan <ch...@huawei.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|3.14-FINAL                  |3.15-FINAL
                 OS|Linux                       |Windows 7

--- Comment #3 from chenchanghan <ch...@huawei.com> ---
I encounter the same question. 700,000 rows, size:139M
I find use OPCPackage.open(inputStream) will get OutOfMemoryError,But use
OPCPackage.open(filePath) run very well.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 60567] XSSFReader caused OutOfMemoryError when reading a large excel file in HDFS as inputStream

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=60567

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|XSSFReader caused           |XSSFReader caused
                   |OutOfMemoryError when       |OutOfMemoryError when
                   |reading a lerge excel file  |reading a large excel file
                   |in HDFS as inputStream      |in HDFS as inputStream
           Severity|blocker                     |normal

--- Comment #1 from Javen O'Neal <on...@apache.org> ---
1,000,000 rows is massive. That's nearly the maximum number of rows allowed per
the file format specification.

140 MB file size is massive. Keep in mind that this is zipped XML files, and I
would expect 90-95% compression for these files. Unzip this on your hard drive
to see how much disk space is consumed when you expand it. It should be in the
neighborhood of 1-3 GB.

You're also opening the file via an input stream, which has some memory
overhead.

Therefore, 3.25 GB of memory consumption is reasonable in this case,
considering input stream overhead, memory alignment, garbage collection,
temporary files for unzipping, maintaining references to files in the unzipped
directory structure, creating XML trees for the minimum files needed for
XSSFReader.

If you have any suggestions and could contribute a patch towards lowering
XSSFReader's memory footprint, we'd greatly appreciate the help.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org