You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@poi.apache.org by Alex Geller <ag...@4js.com> on 2011/05/04 18:31:21 UTC

Re: HSSF and XSSF memory usage, some numbers

Hi,

I would like to give some feedback after implementing a solution based on
the "BigGridDemo".
The good news is that it is fast enough, uses little memory and handles all
the decorational items we were worrying about (CellStyles, MergedRegions and
Images), just fine. A .xlsx file with 300,000 rows and 33 columns is
generated in 50 seconds using the default heap space of 64 MB.

The BigGridDemo should be improved to replace the "sheetData" element with
the real data rather then than replacing the complete file because
MergedRegions, Image references and other information such as margins are
otherwise lost on overwriting the file.

Since we need to continue supporting HSSF as well as XSSF we implemented an
API compatible streaming version of XSSF called SXSSF ("S" for streaming) so
that we wouldn't have to change the code and could have identical code for
both file formats (Apart from the "Workbook" construction of course (change
from "new XSSFWorkbook" to "new SXSSFWorkbook")).

The bad news is, that we worry a little about maintaining the code. There
are a number of reasons for this which I would like to state:

We need to change our code each time a new method is added to the interfaces
"Workbook", "Sheet", "Row" or "Cell" since we implemented those from scratch
(Workbook and Sheet possibly could have been subclassed instead making a
composition).

There is a principle fragility in the solution that is caused by the fact
that the BigGridDemo strategy makes assumptions on inner workings of POI
that could change from one version to another. What if for example the
XSSFCell class gets a new pakage method XSSFCell.setCellSpan(int rowspan,int
colspan) and the public method XSSFSheet.addMergedRegion() is implemented on
top of this? Since the BigGridDemo strategy doesn't create any rows or cells
then the merged regions would stop working after this change.

We have some open points which don't bother us for the moment because we are
using only string, numeric and date cell values. We are not using formulas,
rich text, hyperlinks and comments but we may in the future:

What cell types and values have to be written into the "c" elements of the
"sheetData" structure for the cell types BLANK, FORMULA and ERROR? How is
rich text written?

How are comments and hyperlinks written into the file?

What are the exact semantics of Sheet.shiftRows()? Why is the code in
HSSFSheet complicated? Is it perhaps because of formulas that are cell
relative that need to be recomputed in the new location?

What should the return value for Row.getZeroHeight() be if
Row.setZeroHeight() has not been called previously?

How are cell formulas parsed? How is the type determined? What is the
lifecycle of formulas? What is the precomputed value and when is is
precomputed? What is the difference between "precomputed" and "cached" and
when are values cached?

All in all, we are quite happy now but we would very much prefer if you
agreed to maintain this code instead of us, for the reasons mentioned above.
Maybe there are other users with similar problems that would appreciate a
transparent API compatible version of the BigGridDemo solution too.

Thanks again,

Alex

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/HSSF-and-XSSF-memory-usage-some-numbers-tp4312784p4370377.html
Sent from the POI - User mailing list archive at Nabble.com.

Re: HSSF and XSSF memory usage, some numbers

Posted by Alex Geller <ag...@4js.com>.

Hi,
as requested I added a patch via bugzilla
(https://issues.apache.org/bugzilla/show_bug.cgi?id=51160) to start the
process.
Thanks again,
Alex

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/HSSF-and-XSSF-memory-usage-some-numbers-tp4312784p4375903.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: HSSF and XSSF memory usage, some numbers

Posted by Alex Geller <ag...@4js.com>.

OK, very good, I will upload the code to bugzilla. This might take a day or
two because I need to set up the required build environment (you use ant I
hope?) and I need to modify the sources (I can use subclassing instead of
composition when the code is a part of the poi package. This will shrink the
code hugely because a large number of proxy functions will no longer be
needed.).

Just to be sure, here a more detailed description of what exactly I have
done:

The BigGridDemo can be described as follows:

Create a XSSFWorkbook and from it any number of XSSFSheets via
Workbook.createSheet().

You can call any method on both the Workbook and the sheet besides
Sheet.createRow().

Create XML files using the class SpreadsheetWriter that contain the rows and
cells of the data.

Call Workbook.write() to save the document to a temporary file. Then replace
the sheets files contained in this file with the XML documents containing
the rows and cells of the data to create the final result file.

This scheme was improved so that one may call Sheet.createRow() and
Row.createCell() too and one may call any methods on the classes Row and
Cell.

In analogy to the description above this scheme can be desribed as follows:

Create a SXSSFWorkbook and from it any number of SXSSFSheets via
Workbook.createSheet()

You can call any method on both the Workbook and the sheet.

Create any number of SXSSFRows via Sheet.createRow()

You can call any method on the created Rows.

There is a limitation on how many rows can be accessed at one time in
memory. The default is 5000 rows. When the 5001th row is created then the
first row is flushed to the disk. The value of 5000 can be changed to some
other value via a call to SXSSFSheet.setRandomAccessWindowSize(int value).
Setting a value of -1 will disable the automatic flushing and flushing can
be done manually by calling SXSSFSheet.flushRows(int remaining).

Create any number of SXSSFCells via Row.createCell()

You can call any method on the created Cells.

Call Workbook.write() to save the document to the final result file. The
patching is done transparently.

Note that there is no imposed order (apart from requiring Workbook.write()
to be called at the end) on the calling sequence of functions. Just like on
the normal API one can switch between workbooks, sheets, rows and cells
freely as long as the rows are still in the random access window.

Regards,

Alex

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/HSSF-and-XSSF-memory-usage-some-numbers-tp4312784p4372383.html
Sent from the POI - User mailing list archive at Nabble.com.

Re: HSSF and XSSF memory usage, some numbers

Posted by Nick Burch <ni...@alfresco.com>.

On Wed, 4 May 2011, Alex Geller wrote:
> The good news is that it is fast enough, uses little memory and handles 
> all the decorational items we were worrying about (CellStyles, 
> MergedRegions and Images), just fine. A .xlsx file with 300,000 rows and 
> 33 columns is generated in 50 seconds using the default heap space of 64 
> MB.

Great news

> The BigGridDemo should be improved to replace the "sheetData" element 
> with the real data rather then than replacing the complete file because 
> MergedRegions, Image references and other information such as margins 
> are otherwise lost on overwriting the file.

I think (as you hint later) that the demo is probably approaching the 
point where it can be re-done as a streaming api similar to the read one, 
rather than an example you write your code into.

> Since we need to continue supporting HSSF as well as XSSF we implemented 
> an API compatible streaming version of XSSF called SXSSF ("S" for 
> streaming) so that we wouldn't have to change the code and could have 
> identical code for both file formats (Apart from the "Workbook" 
> construction of course (change from "new XSSFWorkbook" to "new 
> SXSSFWorkbook")).

Interesting idea. I'd probably say that an event model writing code would 
be the first thing to nail down, then possibly an event backed workbook 
implementation would be next after that. If you've done much of that then 
it could help!

If you're happy, what I'd suggest is you post your code to a new bug in 
bugzilla, then we all work together to get the code into svn

> We have some open points which don't bother us for the moment because we 
> are using only string, numeric and date cell values. We are not using 
> formulas, rich text, hyperlinks and comments but we may in the future:
>
> What cell types and values have to be written into the "c" elements of 
> the "sheetData" structure for the cell types BLANK, FORMULA and ERROR? 
> How is rich text written?

The quick way to check is just to look at some example files. Full details 
are in the microsoft specifications, which should be linked from the POI 
website.

> How are comments and hyperlinks written into the file?

With a fair bit of xml around them... Take a look at some example files to 
see. It's quite fiddly, the xssf usermodel code should show you the 
details too

> What are the exact semantics of Sheet.shiftRows()? Why is the code in 
> HSSFSheet complicated? Is it perhaps because of formulas that are cell 
> relative that need to be recomputed in the new location?

HSSF is often more complicated than XSSF, as more of the work needs to be 
done and more things need to be kept in sync. Formulas are just one bit

> What should the return value for Row.getZeroHeight() be if 
> Row.setZeroHeight() has not been called previously?

I'd suggest you crib off what xssf does?

> How are cell formulas parsed? How is the type determined? What is the 
> lifecycle of formulas? What is the precomputed value and when is is 
> precomputed? What is the difference between "precomputed" and "cached" 
> and when are values cached?

XSSF is much simpler than HSSF for this. Formulas are stored as text, and 
the cached/precomputed (same thing) value goes inline with them. You 
normally set those values at the end of writing the cells, using the 
formula evaluator. However, as you don't have all the cells in memory at 
once, any references will be very tricky. The best bet might be to set the 
force calculation flag, skip writing the cached values, and let excel do 
them on next load

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org