You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@poi.apache.org by Alex Geller <ag...@4js.com> on 2011/04/19 10:33:42 UTC

HSSF and XSSF memory usage, some numbers

Hi,


we are experiencing problems with the memory consumption of HSSF and in 
particular of XSSF documents which is by far too large for the application
we have. We’d appreciate someone taking the time to look into this. Perhaps
we’re just doing something the wrong way.


Our application has the following constraints:

We are generating Excel spreadsheets as yet another output option to a 
reporting tool similar to Jasper or Birt. The data arrives in streaming
manner.

Our users produce large documents and we can't tell whether that is silly or 
not. We don't know how big the largest documents will be but the report we
were given produces a sheet of 150,000 rows x 33 cells.

We are asked to use XSSF over HSSF in order to get all the data in a single
sheet and not have to create spillover sheets every 65536 rows. 

We need different styles, colspan, rowspan, etc. because the output is
supposed to resemble the layout of the report as closely as possible. This
keeps us from using the csv trick. For the same reason, we suspect that the
XML zip injection trick (see  http://www.realdevelopers.com/blog/code/excel
Streaming xlsx files ) that can also be found on this forum 
cannot be applied either. 
Is this assumption correct? The XML for the data looks straightforward but 
what about other issues like cell styles?  

Our documents are square and contain no empty rows or columns.



We made some tests with HSSF and XSSF. The test document contains a matrix 
of cells, each cell containg a "double" value. No cell styles are set.
To measure the memory consumption we ran the test program with varying 
column numbers (8, 16 and 33) and with as many rows as possible before 
running into an out-of-memory exception. We ran the test against the default 
heap space (-Xmx64m) and against 256 MB and 1.1 GB which is the limit 
for Windows.


The relevant part of the code is as follows (The entire program is 104 lines
long and can be posted by request):


import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import java.io.FileOutputStream;
import java.io.IOException;

public class PoiTest
{
    public static void main(String[] args)
    {
        if(args.length!=4) usage("need four command arguments");

        Workbook workBook=createWorkbook(args[0]);
        boolean isHType=workBook instanceof HSSFWorkbook;

        int rows=parseInt(args[1],"Failed to parse rows value as integer");
        int cols=parseInt(args[2],"Failed to parse cols value as integer");
        boolean saveFile=parseInt(args[3],"Failed to parse saveFile 
value as integer")!=0;

        Sheet sheet=workBook.createSheet("Main Sheet");
        int sheetNo=0;
        int rowIndexInSheet=0;
        double value=0;
        for(int rowIndex=0;rowIndex&lt;rows;rowIndex++)
        {
            if(isHType&&sheetNo!=rowIndex/0x10000)
            {
                sheet=workBook.createSheet("Spillover from sheet 
"+(++sheetNo));
                rowIndexInSheet=0;
            }
    
            Row row=sheet.createRow(rowIndexInSheet);
            for(int colIndex=0;colIndex&lt;cols;colIndex++)
            {
                Cell cell=row.createCell(colIndex);
                cell.setCellValue(value++);
            }
            rowIndexInSheet++;
        }


The good news is that processing time and memory consumption grow linearly
with the document size. The bad news is that in particular the memory need
is much too high for us.
The values are:


HSSF: 77 bytes/cell

XSFF: about 630 bytes/cell



Translated into rows and cells this means that with a heap space of 265 MB,
one can produce 101,000 rows using HSSF and only 12,300 rows 
using XSSF. Using XSSF we can't even get over the 65535 limit with the
maximum of 1.1 GB heap space.



In order to understand the numbers, we wrote a "naive" model using 
TreeMaps for both rows and cells and got figures similar to the HSSF 
values. By replacing the TreeMap in the rows by an array, the memory 
consumption dropped to 37 byte per cell.
We naively assumed that a cell needed only two member variables, a 
reference to the row (or sheet) and a reference to the value object. We 
looked at the class HSSFCell to see what member variables were actually 
used and found the following:




There are references to both the sheet (_sheet) and the workbook 
(_book). Isn’t it possible remove _book and implement getBoundWorkbook() as 
getSheet().getWorkbook()?


The values are apparently stored in _record based on _cellType to 
cater for the different data types (double, date string ..). Why not get 
rid of the type field and query the value for the type (getCellType() { 
return _record.getCellType(); }? The case of setting a style before a 
value can be handled by assigning a "type only" value.


It seems that the member variable _stringValue is used to store string 
values. Couldn't this be stored in _record?


The member variable _comment apparently stores a cell comment. 
Assuming that per average there are more values than comments one could 
surely find a more efficient storage strategy. As an example one could
introduce extra value types so that for every cell record type there is a
commented and a non commented version (e.g. DoubleCellValueRecord,
CommentedDoubleCellValueRecord).


Looking at the storage method used in the rows (HSFFRow) to store the cells,
there 
is also potential for simple memory optimization. Currently, the rows are
stored in a vector that grows the capacity by doubling, starting with an
initial size of 5. A spread sheet of 81 columns and 400,000 rows wastes
(79*400,000=32MB). Keeping a list of the row widths seen so far can make the
allocation much faster and avoid the 
waste. Aren't most sheets square so that the list would have only one entry?




All these remarks are made based on a quick glance at the code so 
there might be a very good explanation why things need to be the way 
they are.


Regarding XSSF it seems that there is a more basic problem. Can an all
purpose (xmlbeans) model be as a efficient as a custom model? Can the memory
consumption realistically be lowered from now 630 byte/cell to 37 bytes/cell
without significant loss of performance (which isn't great to begin with)?
An optimized model can make use of the knowledge that sheets are commonly
square, that they tend to be taller than wide, that there is usually a lot
more data than comments, that the cell style tends to be the same in larger
rectangular areas, etc.. Can the all purpose model ever have the same
efficiency as a model taking these issues into account?
Wouldn’t it likely treat comments the same way as values and manage cell
styles in a wasteful way since it doesn't know anything about the typical
use of styles?


A solution that would perhaps solve the problem would be to have a 
common in-memory model for both HSSF and XSSF and just have two separate 
serializers for the different formats. We would appreciate the possibility
to define a custom model since the general model needs to be efficient for
random access and we don’t need that at all in our application. Instead, the
model is written left-to-right, top-to-bottom and it is accessed for reading
only at the very end for the purpose of saving the document to disk. A model
designed for this purpose only can be implemented very efficiently.
 

Find all the results of the test in the table below (All tests were done 
with version 3.7):



#Space tests

#time java  -Xmx64m  PoiTest        HSSF   25200 33 0 #  77 byte/cell, 104 %
#time java  -Xmx64m  NaiveModelTest NAIVE  24310 33 0 #  80 byte/cell, 100 %
#time java  -Xmx64m  PoiTest        XSSF    3050 33 0 # 636 byte/cell,  12 %

#time java -Xmx256m  PoiTest        HSSF  101000 33 0 #  76 byte/cell, 104 %
#time java -Xmx256m  NaiveModelTest NAIVE  97300 33 0 #  80 byte/cell, 100 %
#time java -Xmx256m  NaiveModelTest ARRAY 210000 33 0 #  37 byte/cell, 216 %
!
#time java -Xmx256m  PoiTest        XSSF   12300 33 0 # 631 byte/cell,  13 %

#time java -Xmx256m  PoiTest        HSSF  192500 16 0 #  83 byte/cell, 100 %
#time java -Xmx256m  NaiveModelTest NAIVE 193000 16 0 #  83 byte/cell, 100 %
#time java -Xmx256m  PoiTest        XSSF   25000 16 0 # 640 byte/cell,  13 %

#time java -Xmx256m  PoiTest        HSSF  336000  8 0 #  95 byte/cell,  93 %
#time java -Xmx256m  NaiveModelTest NAIVE 361000  8 0 #  83 byte/cell, 100 %
#time java -Xmx256m  PoiTest        XSSF   48000 8 0  # 640 byte/cell,  13 %

#time java -Xmx1100m PoiTest        HSSF  434000 33 0 #  77 byte/cell, 104 %
#time java -Xmx1100m NaiveModelTest NAIVE 417000 33 0 #  80 byte/cell, 100 %
#time java -Xmx1100m PoiTest        XSSF   53100 33 0 # 628 byte/cell,  13 %

#Speed tests (test document generation in memory without saving to disk)

#time java -Xmx1100m PoiTest        HSSF  380000 33 0 # (16s),  784,000
cells/s, 113 %
#time java -Xmx1100m NaiveModelTest NAIVE 380000 33 0 # (18s),  667,000
cells/s, 100 %
#time java -Xmx1100m NaiveModelTest ARRAY 380000 33 0 # (7s), 1,791,000
cells/s, 269 %
#time java -Xmx1100m PoiTest        XSSF   50000 33 0 # (36s)    45,800
cells/s,   7 % (5,8 % of HSSF performance) !

#Speed tests including saving to the disk

#time java -Xmx256m  PoiTest        HSSF  50000 33 1 # ( 13s, 31   MB),
127,000 cells/s, 19   byte/cell on disk
#time java -Xmx256m  NaiveModelTest NAIVE 50000 33 1 # ( 45s, 59   MB),
37,000 cells/s, 35   byte/cell on disk
#time java -Xmx256m  NaiveModelTest ARRAY 50000 33 1 # ( 25s, 42   MB),
66,000 cells/s, 25   byte/cell on disk
#time java -Xmx1100m PoiTest        XSSF  40000 33 1 # (110s,  4.6 MB),
12,000 cells/s   3,5 byte/cell on disk !



I’m looking forward to your comments. Thanks a lot for your time,

Alex



--
View this message in context: http://apache-poi.1045710.n5.nabble.com/HSSF-and-XSSF-memory-usage-some-numbers-tp4312784p4312784.html
Sent from the POI - User mailing list archive at Nabble.com.

Re: HSSF and XSSF memory usage, some numbers

Posted by naresh <d....@gmail.com>.

Hi Nick,

I have used "BigGridDemo" from examples for writing data into ".xlsx" files.

I have tested with below sets of data:

1. Rows(<=480000) X Columns(155): Written successfully and I can open the
excel file successfully. File size 257MB.
2. Rows(>490000) X Columns(155): Written successfully and I cannot open the
excel file. File size: 525MB. Below is the error message it showing:
Excel found unreadable content in 'big-grid.xlsx'. Do you want to recover
the contents of this workbook? If you trust the cource of this workbook,
click Yes.

Do you have any idea what could be the problem. 
I am not understanding whether this problem is with Excel 2007 or my machine
configuration or the problem is with BigGridDemo itself.

Please help.

Thanks & Regards,
Naresh.D

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/HSSF-and-XSSF-memory-usage-some-numbers-tp4312784p4447474.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: HSSF and XSSF memory usage, some numbers

Posted by Alex Geller <ag...@4js.com>.

Hi,
as requested I added a patch via bugzilla
(https://issues.apache.org/bugzilla/show_bug.cgi?id=51160) to start the
process.
Thanks again,
Alex

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/HSSF-and-XSSF-memory-usage-some-numbers-tp4312784p4375903.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: HSSF and XSSF memory usage, some numbers

Posted by Alex Geller <ag...@4js.com>.

OK, very good, I will upload the code to bugzilla. This might take a day or
two because I need to set up the required build environment (you use ant I
hope?) and I need to modify the sources (I can use subclassing instead of
composition when the code is a part of the poi package. This will shrink the
code hugely because a large number of proxy functions will no longer be
needed.).

Just to be sure, here a more detailed description of what exactly I have
done:

The BigGridDemo can be described as follows:

Create a XSSFWorkbook and from it any number of XSSFSheets via
Workbook.createSheet().

You can call any method on both the Workbook and the sheet besides
Sheet.createRow().

Create XML files using the class SpreadsheetWriter that contain the rows and
cells of the data.

Call Workbook.write() to save the document to a temporary file. Then replace
the sheets files contained in this file with the XML documents containing
the rows and cells of the data to create the final result file.

This scheme was improved so that one may call Sheet.createRow() and
Row.createCell() too and one may call any methods on the classes Row and
Cell.

In analogy to the description above this scheme can be desribed as follows:

Create a SXSSFWorkbook and from it any number of SXSSFSheets via
Workbook.createSheet()

You can call any method on both the Workbook and the sheet.

Create any number of SXSSFRows via Sheet.createRow()

You can call any method on the created Rows.

There is a limitation on how many rows can be accessed at one time in
memory. The default is 5000 rows. When the 5001th row is created then the
first row is flushed to the disk. The value of 5000 can be changed to some
other value via a call to SXSSFSheet.setRandomAccessWindowSize(int value).
Setting a value of -1 will disable the automatic flushing and flushing can
be done manually by calling SXSSFSheet.flushRows(int remaining).

Create any number of SXSSFCells via Row.createCell()

You can call any method on the created Cells.

Call Workbook.write() to save the document to the final result file. The
patching is done transparently.

Note that there is no imposed order (apart from requiring Workbook.write()
to be called at the end) on the calling sequence of functions. Just like on
the normal API one can switch between workbooks, sheets, rows and cells
freely as long as the rows are still in the random access window.

Regards,

Alex

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/HSSF-and-XSSF-memory-usage-some-numbers-tp4312784p4372383.html
Sent from the POI - User mailing list archive at Nabble.com.

Re: HSSF and XSSF memory usage, some numbers

Posted by Nick Burch <ni...@alfresco.com>.

On Wed, 4 May 2011, Alex Geller wrote:
> The good news is that it is fast enough, uses little memory and handles 
> all the decorational items we were worrying about (CellStyles, 
> MergedRegions and Images), just fine. A .xlsx file with 300,000 rows and 
> 33 columns is generated in 50 seconds using the default heap space of 64 
> MB.

Great news

> The BigGridDemo should be improved to replace the "sheetData" element 
> with the real data rather then than replacing the complete file because 
> MergedRegions, Image references and other information such as margins 
> are otherwise lost on overwriting the file.

I think (as you hint later) that the demo is probably approaching the 
point where it can be re-done as a streaming api similar to the read one, 
rather than an example you write your code into.

> Since we need to continue supporting HSSF as well as XSSF we implemented 
> an API compatible streaming version of XSSF called SXSSF ("S" for 
> streaming) so that we wouldn't have to change the code and could have 
> identical code for both file formats (Apart from the "Workbook" 
> construction of course (change from "new XSSFWorkbook" to "new 
> SXSSFWorkbook")).

Interesting idea. I'd probably say that an event model writing code would 
be the first thing to nail down, then possibly an event backed workbook 
implementation would be next after that. If you've done much of that then 
it could help!

If you're happy, what I'd suggest is you post your code to a new bug in 
bugzilla, then we all work together to get the code into svn

> We have some open points which don't bother us for the moment because we 
> are using only string, numeric and date cell values. We are not using 
> formulas, rich text, hyperlinks and comments but we may in the future:
>
> What cell types and values have to be written into the "c" elements of 
> the "sheetData" structure for the cell types BLANK, FORMULA and ERROR? 
> How is rich text written?

The quick way to check is just to look at some example files. Full details 
are in the microsoft specifications, which should be linked from the POI 
website.

> How are comments and hyperlinks written into the file?

With a fair bit of xml around them... Take a look at some example files to 
see. It's quite fiddly, the xssf usermodel code should show you the 
details too

> What are the exact semantics of Sheet.shiftRows()? Why is the code in 
> HSSFSheet complicated? Is it perhaps because of formulas that are cell 
> relative that need to be recomputed in the new location?

HSSF is often more complicated than XSSF, as more of the work needs to be 
done and more things need to be kept in sync. Formulas are just one bit

> What should the return value for Row.getZeroHeight() be if 
> Row.setZeroHeight() has not been called previously?

I'd suggest you crib off what xssf does?

> How are cell formulas parsed? How is the type determined? What is the 
> lifecycle of formulas? What is the precomputed value and when is is 
> precomputed? What is the difference between "precomputed" and "cached" 
> and when are values cached?

XSSF is much simpler than HSSF for this. Formulas are stored as text, and 
the cached/precomputed (same thing) value goes inline with them. You 
normally set those values at the end of writing the cells, using the 
formula evaluator. However, as you don't have all the cells in memory at 
once, any references will be very tricky. The best bet might be to set the 
force calculation flag, skip writing the cached values, and let excel do 
them on next load

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: HSSF and XSSF memory usage, some numbers

Posted by Alex Geller <ag...@4js.com>.

Hi,

I would like to give some feedback after implementing a solution based on
the "BigGridDemo".
The good news is that it is fast enough, uses little memory and handles all
the decorational items we were worrying about (CellStyles, MergedRegions and
Images), just fine. A .xlsx file with 300,000 rows and 33 columns is
generated in 50 seconds using the default heap space of 64 MB.

The BigGridDemo should be improved to replace the "sheetData" element with
the real data rather then than replacing the complete file because
MergedRegions, Image references and other information such as margins are
otherwise lost on overwriting the file.

Since we need to continue supporting HSSF as well as XSSF we implemented an
API compatible streaming version of XSSF called SXSSF ("S" for streaming) so
that we wouldn't have to change the code and could have identical code for
both file formats (Apart from the "Workbook" construction of course (change
from "new XSSFWorkbook" to "new SXSSFWorkbook")).

The bad news is, that we worry a little about maintaining the code. There
are a number of reasons for this which I would like to state:

We need to change our code each time a new method is added to the interfaces
"Workbook", "Sheet", "Row" or "Cell" since we implemented those from scratch
(Workbook and Sheet possibly could have been subclassed instead making a
composition).

There is a principle fragility in the solution that is caused by the fact
that the BigGridDemo strategy makes assumptions on inner workings of POI
that could change from one version to another. What if for example the
XSSFCell class gets a new pakage method XSSFCell.setCellSpan(int rowspan,int
colspan) and the public method XSSFSheet.addMergedRegion() is implemented on
top of this? Since the BigGridDemo strategy doesn't create any rows or cells
then the merged regions would stop working after this change.

We have some open points which don't bother us for the moment because we are
using only string, numeric and date cell values. We are not using formulas,
rich text, hyperlinks and comments but we may in the future:

What cell types and values have to be written into the "c" elements of the
"sheetData" structure for the cell types BLANK, FORMULA and ERROR? How is
rich text written?

How are comments and hyperlinks written into the file?

What are the exact semantics of Sheet.shiftRows()? Why is the code in
HSSFSheet complicated? Is it perhaps because of formulas that are cell
relative that need to be recomputed in the new location?

What should the return value for Row.getZeroHeight() be if
Row.setZeroHeight() has not been called previously?

How are cell formulas parsed? How is the type determined? What is the
lifecycle of formulas? What is the precomputed value and when is is
precomputed? What is the difference between "precomputed" and "cached" and
when are values cached?

All in all, we are quite happy now but we would very much prefer if you
agreed to maintain this code instead of us, for the reasons mentioned above.
Maybe there are other users with similar problems that would appreciate a
transparent API compatible version of the BigGridDemo solution too.

Thanks again,

Alex

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/HSSF-and-XSSF-memory-usage-some-numbers-tp4312784p4370377.html
Sent from the POI - User mailing list archive at Nabble.com.

Re: HSSF and XSSF memory usage, some numbers

Posted by Alex Geller <ag...@4js.com>.

Thanks for the really quick reply. I am having a deeper look at the
BigGridDemo now. It looks good at a first glance. I was worried about the
cell styles but that seems to be taken care of. The next thing I need to
validate , is that merged regions can be done too. If we don't encounter any
other difficulties then this is indeed a workable option.
Regards,
Alex

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/HSSF-and-XSSF-memory-usage-some-numbers-tp4312784p4313202.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: HSSF and XSSF memory usage, some numbers

Posted by naresh <d....@gmail.com>.

Hi Nick,

I am very happy to see some soution for solving big data problem while
writing with POI.

I have not yet completely evaluated BigGridDemo for all my requirements.

What I am looking for is that, In my project, I want to write to xlsx files
but the data size can range from very  small (10 * 4 cells) to (1,000,000 *
200 cells).
I want to know whether BigGridDemo is suitable only for large data or it can
be used for small data also by ensuring the optimum memory and performance
(time to write).

I am very glad to see your reply about my query.

Thanks,
Naresh

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/HSSF-and-XSSF-memory-usage-some-numbers-tp4312784p4409080.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: HSSF and XSSF memory usage, some numbers

Posted by Nick Burch <ni...@alfresco.com>.

On Tue, 19 Apr 2011, Alex Geller wrote:
> We need different styles, colspan, rowspan, etc. because the output is
> supposed to resemble the layout of the report as closely as possible. This
> keeps us from using the csv trick. For the same reason, we suspect that the
> XML zip injection trick (see  http://www.realdevelopers.com/blog/code/excel
> Streaming xlsx files ) that can also be found on this forum
> cannot be applied either.
> Is this assumption correct? The XML for the data looks straightforward but
> what about other issues like cell styles?

Take a look a the BigGridDemo - it may do what you need it to. The idea 
there is to use the friendly UserModel code to generate fiddly bits like 
fonts, styles, formatting etc. Then, generate the data with a low level 
xml streaming, and merge the two.

This should hopefully let you generate a fairly rich file, with lots of 
data, without using much memory

> Translated into rows and cells this means that with a heap space of 265 MB,
> one can produce 101,000 rows using HSSF and only 12,300 rows
> using XSSF. Using XSSF we can't even get over the 65535 limit with the
> maximum of 1.1 GB heap space.

The usual answer for XSSF is either to use something along thing 
BigGridDemo style, or just bump up your heap size (8gb memory modules 
usually cost something like half a day's billable rate for a programmer, 
so you can buy a lot of memory for the price of someone optimising the 
code...)

> There are references to both the sheet (_sheet) and the workbook 
> (_book). Isn’t it possible remove _book and implement getBoundWorkbook() 
> as getSheet().getWorkbook()?

Possibly. Are you able to use your test rig to check the performance 
impact of this?

> The values are apparently stored in _record based on _cellType to cater 
> for the different data types (double, date string ..). Why not get rid 
> of the type field and query the value for the type (getCellType() { 
> return _record.getCellType(); }? The case of setting a style before a 
> value can be handled by assigning a "type only" value.

I think we've generally gone for the simplest option. If you can see how 
this'd work and would save memory, please send in a patch and we'll look 
at applying it

> It seems that the member variable _stringValue is used to store string
> values. Couldn't this be stored in _record?

We need to store the parsed form somewhere. Wouldn't it be the same memory 
use no matter if we stored it against the cell or the cell's record?

> The member variable _comment apparently stores a cell comment.

Finding a cell's comment is a bit tricky, so we cache it once we locate 
it.

> Assuming that per average there are more values than comments one could
> surely find a more efficient storage strategy. As an example one could
> introduce extra value types so that for every cell record type there is a
> commented and a non commented version (e.g. DoubleCellValueRecord,
> CommentedDoubleCellValueRecord).

Hmm, that doesn't look very clean to me. One thing that we could do is 
push the cache down into the sheet, since that's where the records are 
stored. If we used a map there to cache the comments once created from 
records, that'd probably help with the memory footprint, wouldn't it? 
Assuming so, please send in a patch and I'll review + apply.

> Looking at the storage method used in the rows (HSFFRow) to store the 
> cells, there is also potential for simple memory optimization. 
> Currently, the rows are stored in a vector that grows the capacity by 
> doubling, starting with an initial size of 5. A spread sheet of 81 
> columns and 400,000 rows wastes (79*400,000=32MB).

I'd be tempted to switch this to just using an ArrayList, instead of 
handling it ourself. We could probably also do something smart with the 
sizing of the row when reading in, because we can probably figure out then 
how many cells we have. Would that help for your case? If so, please 
either send in a patch, or give me a shout and I'd be happy to tackle that

> Keeping a list of the row widths seen so far can make the allocation 
> much faster and avoid the waste. Aren't most sheets square so that the 
> list would have only one entry?

Not sure where that logic is to check, but if you'd like to send in a 
patch I'll happily review it, or point me at the code and I'll look and 
comment :)

> Regarding XSSF it seems that there is a more basic problem. Can an all 
> purpose (xmlbeans) model be as a efficient as a custom model?

Almost certainly not. It's a hell of a lot quicker to code though!

> Can the memory consumption realistically be lowered from now 630 
> byte/cell to 37 bytes/cell without significant loss of performance 
> (which isn't great to begin with)?

It's not impossible that something specific and lightweight could be coded 
up for a few hot bits, though I've never tried it. The issue is that at 
the moment, most of the people volunteering their time to work on POI 
can't spend as long as they'd like working on POI. The resource that's 
short is programmer time, and for us it isn't memory (there are 
workarounds like BigGridDemo that work well enough)

If the XSSF memory is a problem for you, and if you have some programmer 
time to throw at it, we'd love for you to help! However so far everyone 
who's hit problems has either switched to BigGridDemo, or thrown a grand 
at their favourite server manufacturer and bought 16gb of memory to make 
the problem go away...

> A solution that would perhaps solve the problem would be to have a
> common in-memory model for both HSSF and XSSF and just have two separate
> serializers for the different formats.

The two formats probably aren't quite close enough for this. We've got 
common interfaces, but the code underneath is different enough that it 
needs different logic. Some bits are common, see the concrete classes in 
org.apache.poi.ss.usermodel and ss.util for those, but the rest currently 
needs to differ

Otherwise, thanks for doing all this checking! And if you have some time 
to help work on solutions, we'd love for you to help and send in patches 
to improve things :)

Cheers
Nick