You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2017/08/02 11:08:05 UTC

[Bug 61371] New: 20^20 hidden cells in a .xlsx file causes heap memory full exception

https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

            Bug ID: 61371
           Summary: 20^20 hidden cells in a .xlsx file causes heap memory
                    full exception
           Product: POI
           Version: unspecified
          Hardware: PC
                OS: Mac OS X 10.1
            Status: NEW
          Severity: major
          Priority: P2
         Component: XSSF
          Assignee: dev@poi.apache.org
          Reporter: lakshminathanlaky@gmail.com
  Target Milestone: ---

Created attachment 35196
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35196&action=edit
File which causes heap memory out

An .xlsx file has only one cell filled but all other cells are hidden. When
tried to read the file using " workbook = new
org.apache.poi.xssf.usermodel.XSSFWorkbook(in);" this throws an Heap memory
full exception.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

--- Comment #9 from Lakshminathan <la...@gmail.com> ---
Created attachment 35199
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35199&action=edit
Information about the bug

> Lakshminathan, please verify whether you're able to read
> https://svn.apache.org/repos/asf/poi/trunk/test-data/spreadsheet/SampleSS.
> xlsx in your application. - yes,its working
> 
> Which JVM are you using and what is your max heap size? details attached
> What version of POI are you using?3.16
> Is there anything else in your application that could be consuming a
> substantial part of the heap? No

Further details attached.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

--- Comment #4 from Greg Woolsey <gw...@apache.org> ---
To clarify:

1. even empty hidden cells are still cells, and as such require object
allocations when read using the XSSFWorkbook API.  no space is allocated if
they are truly missing from the workbook - not even defined in the OOXML file.

2. OOXML files are ZIP packages containing mostly XML files.  Cells defined but
empty compress extremely well, so a 9k file can actually unzip to hundreds of
megabytes.  Unzip your sample file and see just how big it really is.  This
will give you a good rough start for estimating the heap space required to read
it using the XSSFWorkbook API.  I'd say double the unzipped size as a good
first guess.

If the result turns out to be too big for the available heap, try using the
streaming API as mentioned by PJ Fanning in comment 3.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Nick Burch <ap...@gagravarr.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #1 from Nick Burch <ap...@gagravarr.org> ---
How big a heap did you give to Apache POI? Most JVM default heaps are way way
too small these days

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #7 from Javen O'Neal <on...@apache.org> ---
(In reply to Javen O'Neal from comment #5)
> Lakshminathan, please verify whether you're able to read
> https://svn.apache.org/repos/asf/poi/trunk/test-data/spreadsheet/SampleSS.
> xlsx in your application.
> 
> Which JVM are you using and what is your max heap size?
> What version of POI are you using?
> Is there anything else in your application that could be consuming a
> substantial part of the heap?

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Lakshminathan <la...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #6 from Lakshminathan <la...@gmail.com> ---
@Team, Thank you for the response. Please find below the requested information

https://docs.google.com/document/d/1Lf1FEeQDillIrT4fTz-u9nKiUp7s9_60RwTrEApZ4jk/edit?usp=sharing

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #5 from Javen O'Neal <on...@apache.org> ---
The attached workbook, attachment 35196, is actually quite small uncompressed
and doesn't describe many cells.

xl/worksheets/sheet1.xml:
<cols>
  <col min="1" max="1" width="9.140625" customWidth="1"/>
  <col min="2" max="16384" width="9.140625" hidden="1"/>
</cols>
<sheetData>
  <row r="1" spans="1:1" x14ac:dyDescent="0.25">
    <c r="A1" t="s">
      <v>0</v>
    </c>
  </row>
</sheetData>

sheet2 and 3 are even smaller.

The entire file expands to 28 KB on disk, and I didn't see anything suspicious
in sharedStrings or elsewhere that would cause this workbook to misbehave.

Lakshminathan, please verify whether you're able to read
https://svn.apache.org/repos/asf/poi/trunk/test-data/spreadsheet/SampleSS.xlsx
in your application.

Which JVM are you using and what is your max heap size?
What version of POI are you using?
Is there anything else in your application that could be consuming a
substantial part of the heap?

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Lakshminathan <la...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

--- Comment #3 from PJ Fanning <fa...@yahoo.com> ---
The XSSF model loads all the data in the xlsx file, even if it is hidden, The
XSSF cell interface will let you know which cells are hidden and which are not.
In theory, the XSSF workbook code where we load the xlsx data could have an
option to ignore hidden cells or hidden sheets. This would be an enhancement
and I'm not sure how useful this feature is.
You could also try the XSSFReader which let's you read the xlsx file in a
streaming way.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

--- Comment #8 from Javen O'Neal <on...@apache.org> ---
Please add supporting information as a comment or an attachment. External
dependencies may break or be inaccessible to some users.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Dominik Stadler <do...@gmx.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |RESOLVED
         Resolution|---                         |WORKSFORME

--- Comment #11 from Dominik Stadler <do...@gmx.at> ---
There was no indication of an actual "bug" in Apache POI, memory usage for a
file with 1 million rows is expected to be considerable, so unless there is
more evidence of too much memory usage, we are resolving this issue, please
reopen with more information if you still think there is something wrong in
Apache POI itself.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Lakshminathan <la...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

--- Comment #2 from Lakshminathan <la...@gmail.com> ---
The file size as you can see from the attachment is just 9Kb . It has only one
cell with data. But since there are 2^20 hidden cells Apache POI allocates more
heap space causing heap space full exception. I think this can be handled only
by Apache POI because file size is the only information we get even before
reading the file.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #10 from Javen O'Neal <on...@apache.org> ---
attachment 35199 references a different file that was uploaded to Google Sheets
which is 2.75 MB instead of 8 KB in size.
Unzipping the 2.75 MB file expands to 27.9 MB (90% compression ratio is typical
for workbooks that do not contain embedded objects).
xl/worksheets/sheet1.xml is 27.9 MB (leaving a few KB for the rest of the
extracted files).

Inspecting the XML, I see that the workbook defines 1 cell at A1 and 1 million
rows:
<sheetFormatPr defaultColWidth="0" defaultRowHeight="15" zeroHeight="1"/>
<cols>
  <col min="1" max="1" width="9.140625" customWidth="1"/>
  <col min="2" max="16384" width="9.140625" hidden="1"/>
</cols>
<sheetData>
  <row r="1" spans="1:1">
    <c r="A1" t="s"><v>0</v></c>
  </row>
  <row r="2" spans="1:1" hidden="1"/>
  <row r="3" spans="1:1" hidden="1"/>
  ...
  <row r="16" spans="1:1" hidden="1"/>
  <row r="17" hidden="1"/>
  ...
  <row r="1048556" hidden="1"/>
  <row r="1048557" hidden="1"/>
</sheetData>

POI uses a TreeMap<int rowNumber, XSSFRow row> to store these rows for fast
random and sequential access, at the cost of some memory. Oracle Java 8 docs
state that it implements TreeMap using a self-balancing pointer-based red-black
tree. This should be fine.

To test that your JVM can handle 1 million items in a TreeMap, do something
like the following:
private static final Random rand = new Random();

private Object createFakeRow(int nbytes) {
    byte[] row = new byte[nbytes];
    rand.nextBytes(row);
    return row;
}

SortedMap<Integer, Object> rows = new TreeMap<>();
for (int r=1; r<=1048557; r++) {
    // create an object that consumes 10 KB of RAM in place of a real XSSFRow
    rows.put(i, createFakeRow(10*1024));
}

If that works without issue, then the next focus is on what XMLBeans is doing
as it's reading Sheet1.xml.
You could look at how POI unzips the file into memory and reads each XML file
into an XML DOM using XMLBeans and how much extra memory is consumed by the CT
classes.
Before we blame XMLBeans, we'd have to fairly compare it with JAXB.

Here's a Google Drive link to this file since the file exceeds the file size
limit of bugzilla:
https://drive.google.com/file/d/0B2v9cndcBwIWeURpNXVrSjJYbDg/view

Based on your reported 9 GB of RAM, the amortized size of each XSSFRow would be
roughly 10 KB.

To make sure this really isn't a bug in POI's handling of hidden rows or
columns, we'd need to test for OOM on a workbook with the same number of rows
defined but all of them visible.

I think you have enough ideas here to try to figure out where the problem is,
which is needed before a potential fix can be written.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org