You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Daniel Noll <da...@nuix.com> on 2008/01/22 23:22:25 UTC

HSSF: Middle-ground API for reading an Excel spreadsheet

Hi all.

I was wondering if anyone had experimented with doing lazy parsing via the 
eventusermodel interface.  I've had an attempt at it myself but am running 
into various troubles.

The first one which is really problematic is that once I get a FormulaRecord, 
I can't find a way to convert that into the formula string.  Thankfully 
getting the value result is relatively simple.

Have the HSSF developers considered making an API half way between usermodel 
and eventusermodel, which can return HSSFCell instances one at a time without 
instantiating the entire spreadsheet?  It would be a really nice thing for 
saving memory.  (Although an implementation of the records which doesn't 
create copies of everything in memory would probably solve the memory 
problems almost as well.)

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: HSSF: Middle-ground API for reading an Excel spreadsheet

Posted by Nick Burch <ni...@torchbox.com>.
On Tue, 12 Feb 2008, Daniel Noll wrote:
>  - The file loaded from disk is merely one big ByteBuffer. (easy)
>
>  - A block in the file would be a ByteBuffer created as a subset over the
>    larger file ByteBuffer (easy, Java allows for this already)

This looks like it might be a little bit of work. It looks to me like most 
of the block creation/reading is done on the input stream one block at a 
time, with eof checks etc in there. So, I guess we'd need to change to 
just reading the whole lot into some sort of growable byte array, wrap 
that as a ByteBuffer, then change the block code to work on that.

>  - A document would be a ByteBuffer created as a composite ByteBuffer over
>    the blocks which make it up (slightly less easy, requires custom
>    ByteBuffer subclass to be written but such a thing will be a useful
>    utility and probably should be in Commons if not the JRE itself.)

I guess we could do the blocks first, then have them return byte arrays to 
maintain current behaviour. Then, we write the new bytebuffer stuff, and I 
guess finally we tweak things like RecordInputStream

> Of course if someone writes to a document it's a different story.  You 
> would need to create a new ByteBuffer so as not to damage the original 
> file (unless you design it to write to the original file -- probably 
> harder.)

Currently, we just dump it all into a fresh output stream. I guess we 
could keep going with that, or possibly dump into a fresh ByteBuffer, 
which we can also pass into an output stream if wanted?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: HSSF: Middle-ground API for reading an Excel spreadsheet

Posted by Daniel Noll <da...@nuix.com>.
On Saturday 09 February 2008 05:37:12 Nick Burch wrote:
> I've been doing some reading up on ByteBuffer, and was wondering:
>
> On Mon, 4 Feb 2008, Daniel Noll wrote:
> >   1. Lower memory usage due to not keeping a byte[] copy of all data at
> > the POIFS level.
>
> How would this work? Surely we'll still need to read all the bytes that
> make up the whole poifs stream, then pass those into our underlying
> ByteBuffer? I couldn't figure out a way to do it without processing all
> the input stream at least once, since most of them won't support zipping
> about to different places
>
> >   2. If you don't ask for a DocumentInputStream for a given Document, the
> >      bytes don't even get read.  If you open a stream for a given
> > Document and only read the first part, the rest of the bytes don't even
> > get read.
>
> Again, not sure about that. I can see how we could possibly use a
> ByteBuffer to ensure we always use the same set of bytes in all the bits
> of poifs (and on up as required), but surely we'll still need to save the
> bytes of each DocumentInputStream, otherwise they'll be gone?

I don't follow.  Here's what I was thinking in more detail:

At the POIFS level:

  - The file loaded from disk is merely one big ByteBuffer. (easy)

  - A block in the file would be a ByteBuffer created as a subset over the
    larger file ByteBuffer (easy, Java allows for this already)

  - A document would be a ByteBuffer created as a composite ByteBuffer over
    the blocks which make it up (slightly less easy, requires custom
    ByteBuffer subclass to be written but such a thing will be a useful
    utility and probably should be in Commons if not the JRE itself.)

  - A new kind of DocumentInputStream is created which create a fresh copy
    of the ByteBuffer state and uses that to implement an InputStream. (easy)

With this, even if callers read every input stream, it will use only slightly 
more memory than what they store themselves.  The main memory usage at the 
POIFS level would be the storage of which block offsets make up which 
documents, and the directory tree information.

Of course if someone writes to a document it's a different story.  You would 
need to create a new ByteBuffer so as not to damage the original file (unless 
you design it to write to the original file -- probably harder.)

> > Of course the main beef I have with ByteBuffer is that it is limited to
> > Integer.MAX_VALUE size, but I guess with OLE2 this isn't, in practice,
> > going to be reached.  I imagine the maximum size for an OLE2 document is
> > somewhat lower, although I don't actually know.
>
> Nore do I, but I have a feeling it could well be 2gb too. Surely we have
> that 2gb limit already though, since we're reading the poifs data into a
> byte array, which has the same restriction?

True enough.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: HSSF: Middle-ground API for reading an Excel spreadsheet

Posted by Nick Burch <ni...@torchbox.com>.
I've been doing some reading up on ByteBuffer, and was wondering:

On Mon, 4 Feb 2008, Daniel Noll wrote:
>   1. Lower memory usage due to not keeping a byte[] copy of all data at the
>      POIFS level.

How would this work? Surely we'll still need to read all the bytes that
make up the whole poifs stream, then pass those into our underlying
ByteBuffer? I couldn't figure out a way to do it without processing all
the input stream at least once, since most of them won't support zipping
about to different places

>   2. If you don't ask for a DocumentInputStream for a given Document, the
>      bytes don't even get read.  If you open a stream for a given Document and
>      only read the first part, the rest of the bytes don't even get read.

Again, not sure about that. I can see how we could possibly use a
ByteBuffer to ensure we always use the same set of bytes in all the bits
of poifs (and on up as required), but surely we'll still need to save the
bytes of each DocumentInputStream, otherwise they'll be gone?

> Of course the main beef I have with ByteBuffer is that it is limited to
> Integer.MAX_VALUE size, but I guess with OLE2 this isn't, in practice,
> going to be reached.  I imagine the maximum size for an OLE2 document is
> somewhat lower, although I don't actually know.

Nore do I, but I have a feeling it could well be 2gb too. Surely we have
that 2gb limit already though, since we're reading the poifs data into a
byte array, which has the same restriction?


If we can get some memory savings without too much work by switching to
nio / bytebuffer stuff, I am keen to do it. I'm just struggling, almost
certainly due to being new to it all, to see how it'll deliver much of a
saving just yet. Do please educate me :)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: HSSF: Middle-ground API for reading an Excel spreadsheet

Posted by Daniel Noll <da...@nuix.com>.
On Friday 01 February 2008 05:10:30 Avik Sengupta wrote:
> We've looked at an NIO based POIFS earlier, which is not simple
> (relatively), but doable, but doesnt help at all ...

It's true that it's not simple, I made an attempt to do it once before but 
failed.

I wouldn't say that it doesn't help at all though.

  1. Lower memory usage due to not keeping a byte[] copy of all data at the
     POIFS level.
  2. If you don't ask for a DocumentInputStream for a given Document, the
     bytes don't even get read.  If you open a stream for a given Document and
     only read the first part, the rest of the bytes don't even get read.
  3. Not everyone is reading OLE2 documents from a File in the first place.

All three of these benefits apply *even if* the changes don't cascade into 
HSSF and the other libraries which sit on top of POIFS.

Of course the main beef I have with ByteBuffer is that it is limited to 
Integer.MAX_VALUE size, but I guess with OLE2 this isn't, in practice, going 
to be reached.  I imagine the maximum size for an OLE2 document is somewhat 
lower, although I don't actually know.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: HSSF: Middle-ground API for reading an Excel spreadsheet

Posted by Avik Sengupta <av...@lab49.com>.
On Thursday 31 January 2008 17:37:02 Nick Burch wrote:
> On Tue, 29 Jan 2008, Daniel Noll wrote:
> >> Is your formula related eventusermodel code in a format suitable for

... snip ...

>
> > And as far as POIFS keeping a copy, yes... POIFS is full of issues like
> > that. For instance, even if all you need to read is the CLSID, you still
> > have to read the entire file.  If POIFSFileSystem could construct from a
> > ByteBuffer and not take unnecessary copies, it could speed things up
> > dramatically for that situation... but ultimately that would need to
> > propagate to the whole framework for it to really show benefits.
>
> Do feel free to submit patches for that sort of thing :)
>
> I haven't played with ByteBuffer before, so do feel free to suggest how it
> might help + point at code examples / patches that show it
>

We've looked at an NIO based POIFS earlier, which is not simple (relatively), 
but doable, but doesnt help at all ... as you say, it needs to propagate up 
to HSSF, which will be a significant amount of work....



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: XLS files with no header

Posted by Marwan Gedeon <ma...@zaradoustra.com>.
I attached the file that is Excel 2.1, it seems it uses BIFF 2.0 format,
where no documentation about the format is available anywhere online. 
I just need to pull out the data in there, but first POI would complain
about the headers. Anyway to skip that part, and just extract the data,
would be awesome.

-----Original Message-----
From: Nick Burch [mailto:nick@torchbox.com] 
Sent: Tuesday, February 05, 2008 7:29 PM
To: POI Users List
Subject: RE: XLS files with no header

On Tue, 5 Feb 2008, Marwan Gedeon wrote:
> The file I'm unable to read is an excel 2.1 file, which is really old.

Wow, that is old

> But POI as I understand does not support this, any easy way to make it 
> support this format, since this format is still actively used by some 
> carriers for sending invoices to their customers?

Depends what you need to do with the file. Just get some simple numeric 
data out? Get formulas out? Get formatting out?

Many of the more complex records will certainly have changed, but you 
might be able to bodge something to work just with the numeric records. 
Try using the eventusermodel code (it's much simpler), and disable all the 
records in RecordFactory except NumberRecord. If that works, you'll have 
the cell numeric values, and you can add in other records as needed

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

RE: XLS files with no header

Posted by Nick Burch <ni...@torchbox.com>.
On Tue, 5 Feb 2008, Marwan Gedeon wrote:
> The file I'm unable to read is an excel 2.1 file, which is really old.

Wow, that is old

> But POI as I understand does not support this, any easy way to make it 
> support this format, since this format is still actively used by some 
> carriers for sending invoices to their customers?

Depends what you need to do with the file. Just get some simple numeric 
data out? Get formulas out? Get formatting out?

Many of the more complex records will certainly have changed, but you 
might be able to bodge something to work just with the numeric records. 
Try using the eventusermodel code (it's much simpler), and disable all the 
records in RecordFactory except NumberRecord. If that works, you'll have 
the cell numeric values, and you can add in other records as needed

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: XLS files with no header

Posted by Marwan Gedeon <ma...@zaradoustra.com>.
The file I'm unable to read is an excel 2.1 file, which is really old. I
figured that out after removing the extension, opening it in excel, then
trying to save, and having Excel prompting if I want to save the 2.1 format
or not. 
But POI as I understand does not support this, any easy way to make it
support this format, since this format is still actively used by some
carriers for sending invoices to their customers?

-----Original Message-----
From: Nick Burch [mailto:nick@torchbox.com] 
Sent: Friday, February 01, 2008 3:42 PM
To: POI Users List
Subject: Re: XLS files with no header

On Thu, 31 Jan 2008, Marwan Gedeon wrote:
> I'm running through constraints in the format of an Excel file I have at
> hand, as it's being downloaded from a carrier directly.  My application
> needs to read the excel file as is without preopening in Excel, then
convert
> it to CSV. POI fails to open it with the error:
>
> java.io.IOException: Invalid header signature; read 4503629692403721,
> expected -2226271756974174256

This error means that your file isn't a valid OLE2 document

One thing you could try doing is saving the file, and looking at it. 
Perhaps it's not in excel format after all, but really something else?

If it is an excel file, but without the normal OLE2 wrapper (rare and odd, 
but not un-heard of) you'll need to wrap it up as OLE2 before passing to 
HSSF. Check the list archives for the appropriate few lines of POIFS code 
to call.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: XLS files with no header

Posted by Nick Burch <ni...@torchbox.com>.
On Thu, 31 Jan 2008, Marwan Gedeon wrote:
> I'm running through constraints in the format of an Excel file I have at
> hand, as it's being downloaded from a carrier directly.  My application
> needs to read the excel file as is without preopening in Excel, then convert
> it to CSV. POI fails to open it with the error:
>
> java.io.IOException: Invalid header signature; read 4503629692403721,
> expected -2226271756974174256

This error means that your file isn't a valid OLE2 document

One thing you could try doing is saving the file, and looking at it. 
Perhaps it's not in excel format after all, but really something else?

If it is an excel file, but without the normal OLE2 wrapper (rare and odd, 
but not un-heard of) you'll need to wrap it up as OLE2 before passing to 
HSSF. Check the list archives for the appropriate few lines of POIFS code 
to call.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


XLS files with no header

Posted by Marwan Gedeon <ma...@zaradoustra.com>.
I'm running through constraints in the format of an Excel file I have at
hand, as it's being downloaded from a carrier directly.  My application
needs to read the excel file as is without preopening in Excel, then convert
it to CSV. POI fails to open it with the error:

java.io.IOException: Invalid header signature; read 4503629692403721,
expected -2226271756974174256
      at
org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java
:100)
      at
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:
84)
      at com.cme.billtools.ExcelReader.main(ExcelReader.java:36)


I have noticed many threads on the net mentioning that the headers can be
set through the contenttype, but I do not have control over the carrier's
website to do that. 
So my other alternative is to preprocess the Excel file in java to insert
headers, then save it, and reopen it with POI.  However, I do not see any
information on doing that through the API docs. Particularly, I do not know
how to manipulate the different blocks. 
If anyone has some insight on this, it would be greatly appreciated. 

--Marwan



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: HSSF: Middle-ground API for reading an Excel spreadsheet

Posted by Nick Burch <ni...@torchbox.com>.
On Tue, 29 Jan 2008, Daniel Noll wrote:
>> Is your formula related eventusermodel code in a format suitable for 
>> contributing back? It'd be handy to be able to put something in svn 
>> that would make dealing with the formula stuff much simpler. I'd be 
>> happy to spend a bit of time tidying it up / writing tests for it, if 
>> you could contribute it?
>
> If I ever figure out how to handle it, I probably would contribute it 
> back because it would mean changes to how shared formulas work.  At the 
> moment as you say, it does require a Workbook.  At the moment I don't 
> have a Workbook to work with.  Maybe I can store off the first however 
> many records and then create the Workbook from those -- I haven't tried 
> so I don't know what happens if you feed in a list of records without 
> the ones which make up the read of the file.

I think you might be able to get away with that. If not, shout and we can 
tweak things.

If it gets you close, then we should probably come up with something like 
a WorkbookRecordSource interface, which model.Workbook implements. Tweak 
the formula code to use those instead, then it's easier for you to pass in 
the records that mater. Let us know if that looks like being worth doing.


> Memory is indeed cheap, but unless you have the luxury of a 64-bit JVM, 
> there is an upper limit of somewhere around 1.4GB, sometimes less. 
> This would normally be nearly 2GB but Windows allocates some DLLs in 
> weird positions on some systems, and Sun insist on allocating a 
> contiguous block of memory for the heap which sometimes causes a huge 
> unusable memory hole above that.

Have you tried tweaking your windows box to use a 1gb/3gb split, instead 
of the usual 2gb/2gb one? Might help out in the absence of a 64 bit jvm / 
a licence for a non-hobbled 32 bit version of windows.
http://www.microsoft.com/whdc/system/platform/server/PAE/PAEmem.mspx


> In actual fact for us, something closer to RecordInputStream would be 
> even better, where we can just say nextRecord() and have it return a 
> properly constructed Record.  Then we have control over the loop, which 
> is ideal when you need to return a Reader.

Does the newly added org.apache.poi.hssf.eventusermodel.HSSFRecordStream 
look roughly like what you need? I've converted the existing 
eventusermodel code to use it under the hood, so it ought to behave 
pretty much the same, except with pull instead of push.


> As far as the records keeping a copy, could they not instead keep an 
> offset and a reference to the original buffer?  Then if someone calls a 
> setter, it would need to create a new buffer, set the offset to 0 and 
> copy the data before doing the actual set.

In many cases, they only keep the parsed data in memory, and not the 
source bytes. That's certainly one of the advantages of the (not so) new 
RecordInputStream method

> And as far as POIFS keeping a copy, yes... POIFS is full of issues like 
> that. For instance, even if all you need to read is the CLSID, you still 
> have to read the entire file.  If POIFSFileSystem could construct from a 
> ByteBuffer and not take unnecessary copies, it could speed things up 
> dramatically for that situation... but ultimately that would need to 
> propagate to the whole framework for it to really show benefits.

Do feel free to submit patches for that sort of thing :)

I haven't played with ByteBuffer before, so do feel free to suggest how it 
might help + point at code examples / patches that show it

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: HSSF: Middle-ground API for reading an Excel spreadsheet

Posted by Daniel Noll <da...@nuix.com>.
On Friday 25 January 2008 02:38:14 Nick Burch wrote:
> I did a bit, the core of which is now in svn as
> MissingRecordAwareHSSFListener

I discovered that, it's a great help for handling the blank cells, line 
endings and so forth while iterating through the cells.

> Is your formula related eventusermodel code in a format suitable for
> contributing back? It'd be handy to be able to put something in svn that
> would make dealing with the formula stuff much simpler. I'd be happy to
> spend a bit of time tidying it up / writing tests for it, if you could
> contribute it?

If I ever figure out how to handle it, I probably would contribute it back 
because it would mean changes to how shared formulas work.  At the moment as 
you say, it does require a Workbook.  At the moment I don't have a Workbook 
to work with.  Maybe I can store off the first however many records and then 
create the Workbook from those -- I haven't tried so I don't know what 
happens if you feed in a list of records without the ones which make up the 
read of the file.

> I think there was some talk a few years back, but nothing really came of
> it. The problem is that it'd take a large amount of programmer time, and
> memory seems to be fairly cheap.

Memory is indeed cheap, but unless you have the luxury of a 64-bit JVM, there 
is an upper limit of somewhere around 1.4GB, sometimes less.  This would 
normally be nearly 2GB but Windows allocates some DLLs in weird positions on 
some systems, and Sun insist on allocating a contiguous block of memory for 
the heap which sometimes causes a huge unusable memory hole above that.

"Normal" spreadsheets, where the number of cells isn't excessive, are not 
really a problem.  The problem is where some spreadsheet does have thousands 
of rows and/or dozens of columns.  Usually these will cause an OOME, but 
allocation which gets close to an OOME without causing one is actually more 
dangerous (some other thread suffers, too bad if it's something really 
important.)

> I'm not sure how that'd work though. If we don't hold the contents of the
> records in memory, then how are we going to be able to do anything with
> them? (Maybe I'm missing something in your suggestion though)

To convert an Excel spreadsheet to text (or another format), all you need to 
do is for each cell, store a text version somewhere (in a StringBuilder, in a 
temp file, etc.)  If you don't need to modify a cell then there is no reason 
to have it in memory.

In actual fact for us, something closer to RecordInputStream would be even 
better, where we can just say nextRecord() and have it return a properly 
constructed Record.  Then we have control over the loop, which is ideal when 
you need to return a Reader.

> My hunch is that we'll have a peak use of somewhere around 3-5 times the
> size of the excel file in memory, except for very small files. There'll be
> one copy of the file in poifs, another in hssf, then each record will take
> a copy as it parses itself.

There was one 40MB file which hit the 1GB memory limit.  It turns out the file 
had a huge number of cells per row, but opening the file showed most of them 
to be empty (they probably had styles or something on them which prompted 
HSSF to store something about it.)

Underlying issue here is that even if a cell doesn't exist, sometimes there is 
still memory allocated for it.  HSSFRow stores the cells in an array which 
means holes in the middle are still allocated a small amount of space.  And 
every HSSFCell holds references to many things.  All these eat up memory when 
you have a spreadsheet with a huge number of cells.

As far as the records keeping a copy, could they not instead keep an offset 
and a reference to the original buffer?  Then if someone calls a setter, it 
would need to create a new buffer, set the offset to 0 and copy the data 
before doing the actual set.

And as far as POIFS keeping a copy, yes... POIFS is full of issues like that.  
For instance, even if all you need to read is the CLSID, you still have to 
read the entire file.  If POIFSFileSystem could construct from a ByteBuffer 
and not take unnecessary copies, it could speed things up dramatically for 
that situation... but ultimately that would need to propagate to the whole 
framework for it to really show benefits.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: HSSF: Middle-ground API for reading an Excel spreadsheet

Posted by Nick Burch <ni...@torchbox.com>.
On Wed, 23 Jan 2008, Daniel Noll wrote:
> I was wondering if anyone had experimented with doing lazy parsing via 
> the eventusermodel interface.  I've had an attempt at it myself but am 
> running into various troubles.

I did a bit, the core of which is now in svn as 
MissingRecordAwareHSSFListener

> The first one which is really problematic is that once I get a 
> FormulaRecord, I can't find a way to convert that into the formula 
> string.  Thankfully getting the value result is relatively simple.

Formulas are surprisingly tricky. They're stored as a series of ptgs, and 
turning them back into strings is quite hard. Then you have the fun of 
shared formulas, so you'll have to track all the formulas to be able to 
resolve those. Comes a point that you're holding so many records that you 
might as well just give in and use usermodel :/


If you have a fairly simple formula, then you can probably turn them into 
strings without needing a hssf.model.Workbook, using 
hssf.model.FormulaParser. However, there are some ptgs that need the 
workbook to turn into strings, so you might have problems with those.

Is your formula related eventusermodel code in a format suitable for 
contributing back? It'd be handy to be able to put something in svn that 
would make dealing with the formula stuff much simpler. I'd be happy to 
spend a bit of time tidying it up / writing tests for it, if you could 
contribute it?


> Have the HSSF developers considered making an API half way between 
> usermodel and eventusermodel, which can return HSSFCell instances one at 
> a time without instantiating the entire spreadsheet?  It would be a 
> really nice thing for saving memory.

I think there was some talk a few years back, but nothing really came of 
it. The problem is that it'd take a large amount of programmer time, and 
memory seems to be fairly cheap.

(From my perspective, I can buy a staggering amount of memory for all my 
production servers for a couple of days billable rate. I suspect that 
that holds for many of the other poi developers, so in the absense of 
external sponsorship, I can't see it being a great priority for anyone. 
Alas I think most of us have larger poi 'itches' than memory)


> (Although an implementation of the records which doesn't create copies 
> of everything in memory would probably solve the memory problems almost 
> as well.)

I'm not sure how that'd work though. If we don't hold the contents of the 
records in memory, then how are we going to be able to do anything with 
them? (Maybe I'm missing something in your suggestion though)


My hunch is that we'll have a peak use of somewhere around 3-5 times the 
size of the excel file in memory, except for very small files. There'll be 
one copy of the file in poifs, another in hssf, then each record will take 
a copy as it parses itself.

Does anyone have a good memory profiling tool? While I can't see us 
re-architecting poi any time soon (unless someone wants to sponsor it...), 
if there are a few quick wins them I'm sure we can sort those. If someone 
could spot where most of the memory does go, or any points in processing 
when we use very large amounts of memory for a short spell, that'd be 
helpful to know

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org