You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@poi.apache.org by Chris Nokleberg <ch...@sixlegs.com> on 2003/07/29 20:30:26 UTC

Reducing HSSF memory use

I've been poking around a little bit into the 3.0 branch. One issue that
pops out is that the parallel arrays in ValueRecordsAggregate are
sparse. This makes iteration slower (have to skip empty cells) and uses
more memory. In addition, all of the arrays must be the same size, even
though some inherently sparser (e.g. formulaptgs).

One possibility is to eliminate totally empty rows and columns from each
of the data sets. Separate row and column index arrays would be needed
for each data set to map into the value arrays. Random access would use
linear search on the index arrays, followed by a simple lookup into the
value array.

For even better performance and memory use, I would recommend moving
away from the current sheet.getRow(i).getCell(j) model. That model
forces a certain level of object creation which is not always
necessary. Something like a JDBC ResultSet is more appropriate for
managing large amounts of data, where you reuse the same object, and
just the underlying values changes as you iterate. Example:

  Cell cell = new ColumnIterator(sheet, 3);
  while (cell.next()) {
      if (cell.getType() == Cell.STRING) {
         System.out.println(cell.getValue());
      }
  }

In a read-only, low-memory scenario, this type of API would allow you to
get rid of *all* the data storage in ValueRecordsAggregate. The
iterators could advance over the underlying RandomAccessFile directly,
doing the necessary conversions on the fly. For modifying or writing
sheets, something like ValueRecordsAggregate will still be necessary.

Chris

Re: Reducing HSSF memory use

Posted by Chris Nokleberg <ch...@sixlegs.com>.

On Tue, Jul 29, 2003 at 02:48:29PM -0400, Andrew C. Oliver wrote:
> Do you think the tradeoff would be worth it in a read-write scenario?  I'm
> leaning towards to completing this level of refactoring done before going
> another round.  Though if you want to take a go at it PLEASE do so.

I may do a little technology demo to test the feasibility...I don't have
enough knowledge of POI internals yet to make it worthwhile to try
patching the real thing.

Our messages just passed each other, but in summary I do think that in
read-write scenario it is still useful. If you end up modifying every
cell of course you will not save any memory, but I doubt that is very
common.

> There are still several places left where we can remove object creation.  I
> was going for the least radical and invasive and figured we'd iterate from
> there.

Sure. I am just concerned that the wrong high-level API may affect how much
performance you can end up squeezing out in the long run.

> We actually have an API for read-only which will become more efficient soon.
> Its a reactor pattern which allows you to specify *what* types of data
> you're interested in.  I plan to add more granularity ("only interested in
> rows x-y or columns i-n", etc).  I think this is actually more efficient
> than a cursor approach, though please attempt to persuade me otherwise.

I do see some value in the reactor pattern. Actually for our PowerPoint
thing we setup a JXPath-like navigator so you could write an xpath
expression to pull out the exact set of records you wanted. e.g. all all
text record strings containing "Chris" that are within a yellow
rectangle.

But taking a flexible query and generating an optimized access path is
not trivial. XSLT is very fast, but only in theory :-) 

With an iterator-based API it is relatively simple to get good
performance. The simple reactor implementation of just blowing through
all records and filtering out the ones you don't want is not going to be
faster. And you can also use the iterator for writing...maybe iterator
isn't the best name, it is more of a "window":

  Cell cell = new Cell();
  cell.setSheet(sheet1);
  cell.setRow(3);
  cell.setColumn(4);
  cell.setValue("Hello");
  cell.setColumn(5);
  cell.setValue("World");

Chris

Re: Reducing HSSF memory use

Posted by Chris Nokleberg <ch...@sixlegs.com>.

On Tue, Jul 29, 2003 at 11:30:26AM -0700, Chris Nokleberg wrote:
> For even better performance and memory use, I would recommend moving
> away from the current sheet.getRow(i).getCell(j) model. That model
> forces a certain level of object creation which is not always
> necessary. Something like a JDBC ResultSet is more appropriate for
> managing large amounts of data, where you reuse the same object, and
> just the underlying values changes as you iterate. Example:

I was just browsing the archives and came across this message:
http://article.gmane.org/gmane.comp.jakarta.poi.devel/4878

FWIW, this describes what I have in mind almost exactly. It is useful
even for modifying files, since the unmodified records don't have to be
read in. Essentially you would be using a ValueRecordsAggregate layer
for modified or newly created objects, and unchanged objects would be
left in the underlying source layer.

Chris

Re: Reducing HSSF memory use

Posted by "Andrew C. Oliver" <ac...@apache.org>.

Do you think the tradeoff would be worth it in a read-write scenario?  I'm
leaning towards to completing this level of refactoring done before going
another round.  Though if you want to take a go at it PLEASE do so.

There are still several places left where we can remove object creation.  I
was going for the least radical and invasive and figured we'd iterate from
there.

We actually have an API for read-only which will become more efficient soon.
Its a reactor pattern which allows you to specify *what* types of data
you're interested in.  I plan to add more granularity ("only interested in
rows x-y or columns i-n", etc).  I think this is actually more efficient
than a cursor approach, though please attempt to persuade me otherwise.

-Andy

PS:

The Andy Queue (just so everyone asking for things from me knows where they
stand):

1. A SuperLink proposal (today/tomorrow)
2. Finish the Jboss Xdoclet training that I'm working on (today/tomorrow)
3. Performance/etc testing Chris's POIFS2 (sometime in the next few days)
4. Finish JBossMail SMTP service
5. Work on resolving unit test errors in 3.0 / Work on 2.0 bugs

On 7/29/03 2:30 PM, "Chris Nokleberg" <ch...@sixlegs.com> wrote:

> I've been poking around a little bit into the 3.0 branch. One issue that
> pops out is that the parallel arrays in ValueRecordsAggregate are
> sparse. This makes iteration slower (have to skip empty cells) and uses
> more memory. In addition, all of the arrays must be the same size, even
> though some inherently sparser (e.g. formulaptgs).
> 
> One possibility is to eliminate totally empty rows and columns from each
> of the data sets. Separate row and column index arrays would be needed
> for each data set to map into the value arrays. Random access would use
> linear search on the index arrays, followed by a simple lookup into the
> value array.
> 
> For even better performance and memory use, I would recommend moving
> away from the current sheet.getRow(i).getCell(j) model. That model
> forces a certain level of object creation which is not always
> necessary. Something like a JDBC ResultSet is more appropriate for
> managing large amounts of data, where you reuse the same object, and
> just the underlying values changes as you iterate. Example:
> 
> Cell cell = new ColumnIterator(sheet, 3);
> while (cell.next()) {
>     if (cell.getType() == Cell.STRING) {
>        System.out.println(cell.getValue());
>     }
> }
> 
> In a read-only, low-memory scenario, this type of API would allow you to
> get rid of *all* the data storage in ValueRecordsAggregate. The
> iterators could advance over the underlying RandomAccessFile directly,
> doing the necessary conversions on the fly. For modifying or writing
> sheets, something like ValueRecordsAggregate will still be necessary.
> 
> Chris
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-dev-help@jakarta.apache.org
> 

-- 
Andrew C. Oliver
http://www.superlinksoftware.com/poi.jsp
Custom enhancements and Commercial Implementation for Jakarta POI

http://jakarta.apache.org/poi
For Java and Excel, Got POI?