You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Ashok Natesan <AN...@ariba.com> on 2004/02/26 19:40:12 UTC

RE: Problem generating a large file when following the POIFSstand ard

I can attest to the fact that POI does not do very well with large XLS file
generation.  As part of a larger server,  I was running out of memory (128M
- but server took about 1/2 of that) when generating XLS files around 20k
lines (with just a couple of cells per line).  You can test this out with a
simple test case that just creates additional lines.  I had to switch back
to generating plain CSVs rather than using POI in my app as a result of
this.

BTW, since Excel is a great environment to do quick and very flexible
analysis, large XLS data sheets/reports is fairly common esp. in many
enterprise apps where the data comes from a database.  So, I think
addressing this memory consumption would add a lot of value to many
real-world usages of POI.

I can think of 2 capabilities that would have been useful for my needs -
they might benefit others and the POI project (I am not very familiar either
with XLS internal formats or the POI internal code structure - so these
suggestions may be infeasible to actually implement).

1) When you are appending to an existing file, provide an appendToEnd
function that will not force re-reading entire file into memory & allow
addition of records at the end.  This would probably help many apps that
generate large data sheets - the typical access pattern here is add to the
end (& not random access)
2) Lazy and smart loading/rewrite of records as asked for by program.

Regards

Ashok Natesan

-----Original Message-----
From: Andrew C. Oliver [mailto:acoliver@apache.org] 
Sent: Thursday, February 26, 2004 10:05 AM
To: POI Users List
Subject: Re: Problem generating a large file when following the
POIFSstandard

It will load very very slowly.  We also don't have support yet for
"relative" rows and cells in HSSF.  Originally when we wrote the first few
iterations we thought "No one will generate a sheet that big".  I'd like to
see some testing done on the maximum valid XLS file, but I've not done it in
awhile.

Also to make that big of a file with POI 2.0 you'll find we'll consume an
oppressive amount of memory.  The garbage collector will run wild.  3.0
should answer some of that once it gets under way.
-- 
Andrew C. Oliver
http://www.superlinksoftware.com/poi.jsp
Custom enhancements and Commercial Implementation for Jakarta POI

http://jakarta.apache.org/poi
For Java and Excel, Got POI?

> From: Jeff Blackwell <jb...@tenacityinc.com>
> Reply-To: "POI Users List" <po...@jakarta.apache.org>
> Date: Thu, 26 Feb 2004 08:48:08 -0800 (PST)
> To: POI Users List <po...@jakarta.apache.org>
> Subject: RE: Problem generating a large file when following the
POIFSstandard
> 
> I haven't used POI very much.  Rather new to the
> POI/HSSF game, but been around Java for a while.  Kais
> seems to mention memory issues with writing large
> files.  Are there any gotcha's that I should be aware
> of?  Anything specific to avoid?  I may have need to
> go into the 20-40 mb range.  I know, it's insane to
> stuff that much rot in a sheet - but it honestly may
> need to be done in this case.  Again, I'm just
> investigating, and need to know what to be aware of,
> before jumping into it with both feet.  Any light you
> can shed on it would be much appreciated...
> 
> Thanks in advance,
> 
> Jeff 
> 
> 
> --- Kais Dukes <kd...@kaisdukes.com> wrote:
>> Dear Michael
>> 
>> I can offer my answers to these questions, with the
>> disclaimer that these
>> are based on my own observations.
>> 
>> From my use of POIFS:
>> 
>> (1) POIFS seems to have no trouble with large files,
>> in terms of XBATs.
>> Memory is another question.
>> (2) You are right that XBAT is not the usual name,
>> in Microsoft speak, and
>> reading standard docs on OLE2, it is the DIF.
>> (3) I have noticed these issues. In the
>> implementations I have written for
>> OLE2 structured storage, it appears that the
>> Microsoft implementation will
>> correctly read your OLE2 files regardless of how you
>> distribute BAT or XBAT
>> blocks (as long as there correctly pointed to).
>> (4) The POIFS implementation doesn't seem to hard
>> code sector sizes, these
>> have been abstacted (e.g. contants). It may be the
>> case that switching these
>> would result in the code still working, although I
>> have not tested.
>> (5) Related to the issue of sector sizes, the next
>> version of Windows
>> (Longhorn) will have much better support sector
>> sizes other than 512 bytes,
>> so I would hope a future POIFS system to deal with
>> this correctly.
>> 
>> -----Original Message-----
>> From: Michael Zalewski
>> [mailto:zalewski@optonline.net]
>> Sent: 26 February 2004 14:43
>> To: POI Users List
>> Subject: RE: Problem generating a large file when
>> following the
>> POIFSstandard
>> 
>> 
>> I don't mean to add fuel to the fire. But I have
>> some questions:
>> 
>> 1. Brandon pointed out that a C++ implementation
>> based on POIFS has trouble
>> writing files larger than 6.8 MB, where XBATs must
>> be created. Do we know
>> that POIFS creates these large files correctly? Has
>> anyone used HSSF to make
>> a XLS larger than 7 MB? How much heap space was
>> required?
>> 
>> 2. For that matter, where does the term XBAT come
>> from? I thought it was
>> called DIF (Double Indirect FAT).
>> 
>> 3. There are some other differences that I notice in
>> the way POIFS handles
>> things versus COMPOBJ.DLL. The most notable is that
>> HSSF does not put
>> something called the 'Storage Class ID' into the
>> root entry or the OLE Doc
>> file header. But Excel doesn't seem to care. POIFS
>> has no support for
>> 'CompObj' stream, which is required for embedding
>> (and is used for Word and
>> Project even when the files are not embedded - but
>> again the application
>> doesn't seem to care if the stream is present or not
>> when the document is
>> read from a file). POIFS cannot create a document
>> stream of 0 length (it's a
>> known bug). And finally, POIFS puts the directory
>> structure in block 0
>> whereas Office always puts the document stream into
>> block 0. None of these
>> differences should account for Brandon's problems (I
>> don't think).
>> 
>> 4. Does POIFS support any block size other than 512
>> bytes? Not sure, but I
>> think Microsoft Map uses a larger sector size. Since
>> POIFSConstants.BIG_BLOCK_SIZE is statically defined
>> (instead of interpreting
>> the value at offset 30 from the beginning of the
>> file. This value is (I
>> guess) ignored by POIFS, but should be interpreted
>> as the base 2 logarithm
>> of the large sector size (so it should be 9 for 512
>> bytes used by Microsoft
>> Office, and I think 10 for 1024 bytes on some files
>> produced by Microsoft
>> Map).
>> 
>> If Brandon's C++ derivative supports the variant
>> block sizes, perhaps he
>> could try using a 1024 block size. Doing this should
>> allow you to go to over
>> 12 MB without the need for XBATs (or DIFs). I would
>> really like to know if
>> that works - but I seriously wonder if he would
>> report back anyway.
>> 
>> Note: I am not at all certain about Microsoft Map.
>> Just had reports of
>> someone using POIFS to read that file, and
>> everything was 'one block off'.
>> When (if) I ever follow up on it, I will report
>> results back to this list
>> (unless I get viciously shouted down).
>> 
>> 
>> 
>> 
> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> poi-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail:
>> poi-user-help@jakarta.apache.org
>> 
>> 
>> 
> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> poi-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail:
>> poi-user-help@jakarta.apache.org
>> 
>> 
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org


Re: Problem generating a large file when following the POIFSstand ard

Posted by "Andrew C. Oliver" <ac...@apache.org>.
> BTW, since Excel is a great environment to do quick and very flexible
> analysis, large XLS data sheets/reports is fairly common esp. in many
> enterprise apps where the data comes from a database.  So, I think
> addressing this memory consumption would add a lot of value to many
> real-world usages of POI.

HSSF 3.0 is still in the early stages and makes great strides towards this
goal.  By my measurements we cut consumption by an average of 2/3 (meaning
only 1/3 of the existing footprint is taken) for large sheets.

> 1) When you are appending to an existing file, provide an appendToEnd
> function that will not force re-reading entire file into memory & allow
> addition of records at the end.  This would probably help many apps that
> generate large data sheets - the typical access pattern here is add to the
> end (& not random access)

The structure of OLE 2 CDF doesn't allow this.  We could do it but Excel
couldn't read it then :-)

> 2) Lazy and smart loading/rewrite of records as asked for by program.

We have some work going on to this effect but it may or may not make 3.0.

Ultimately my goal will be far less to accomplish this use case and more to
accomplish it in a follow up open source project which will use POI (it was
actually in the back of my mind when I founded POI) called Reports On JBoss.
I've not yet begun the discussion on it but a forum exists:
http://jboss.org/index.html?module=bb&op=viewforum&f=187.  Note that this
isn't a "fork" of POI just the reporting project I've been talking about
starting for years that POI will plug into.  I'll continue to be active on
the project (and moreso actually) once development gets under way, but my
requirements will be driven from there.  Anyone who is interested in
scalable analytical reporting for the enterprise is certainly welcome to
join me and help kick off the design phase.

Its still a few months out before I devote my full attention to it due to
the Mail Services project which I'm pretty focused on at the moment.

Thanks,

-Andy

-- 
Andrew C. Oliver
http://www.superlinksoftware.com/poi.jsp
Custom enhancements and Commercial Implementation for Jakarta POI

http://jakarta.apache.org/poi
For Java and Excel, Got POI?

The views expressed in this email are those of the author and are almost
definitely not shared by the Apache Software Foundation, its board or its
general membership.  In fact they probably most definitively disagree with
everything espoused in the above email.

> From: Ashok Natesan <AN...@ariba.com>
> Reply-To: "POI Users List" <po...@jakarta.apache.org>
> Date: Thu, 26 Feb 2004 10:40:12 -0800
> To: 'POI Users List' <po...@jakarta.apache.org>
> Subject: RE: Problem generating a large file when following the POIFSstand ard
> 
> I can attest to the fact that POI does not do very well with large XLS file
> generation.  As part of a larger server,  I was running out of memory (128M
> - but server took about 1/2 of that) when generating XLS files around 20k
> lines (with just a couple of cells per line).  You can test this out with a
> simple test case that just creates additional lines.  I had to switch back
> to generating plain CSVs rather than using POI in my app as a result of
> this.
> 
> BTW, since Excel is a great environment to do quick and very flexible
> analysis, large XLS data sheets/reports is fairly common esp. in many
> enterprise apps where the data comes from a database.  So, I think
> addressing this memory consumption would add a lot of value to many
> real-world usages of POI.
> 
> I can think of 2 capabilities that would have been useful for my needs -
> they might benefit others and the POI project (I am not very familiar either
> with XLS internal formats or the POI internal code structure - so these
> suggestions may be infeasible to actually implement).
> 
> 1) When you are appending to an existing file, provide an appendToEnd
> function that will not force re-reading entire file into memory & allow
> addition of records at the end.  This would probably help many apps that
> generate large data sheets - the typical access pattern here is add to the
> end (& not random access)
> 2) Lazy and smart loading/rewrite of records as asked for by program.
> 
> Regards
> 
> Ashok Natesan
> 
> -----Original Message-----
> From: Andrew C. Oliver [mailto:acoliver@apache.org]
> Sent: Thursday, February 26, 2004 10:05 AM
> To: POI Users List
> Subject: Re: Problem generating a large file when following the
> POIFSstandard
> 
> It will load very very slowly.  We also don't have support yet for
> "relative" rows and cells in HSSF.  Originally when we wrote the first few
> iterations we thought "No one will generate a sheet that big".  I'd like to
> see some testing done on the maximum valid XLS file, but I've not done it in
> awhile.
> 
> Also to make that big of a file with POI 2.0 you'll find we'll consume an
> oppressive amount of memory.  The garbage collector will run wild.  3.0
> should answer some of that once it gets under way.
> -- 
> Andrew C. Oliver
> http://www.superlinksoftware.com/poi.jsp
> Custom enhancements and Commercial Implementation for Jakarta POI
> 
> http://jakarta.apache.org/poi
> For Java and Excel, Got POI?
> 
>> From: Jeff Blackwell <jb...@tenacityinc.com>
>> Reply-To: "POI Users List" <po...@jakarta.apache.org>
>> Date: Thu, 26 Feb 2004 08:48:08 -0800 (PST)
>> To: POI Users List <po...@jakarta.apache.org>
>> Subject: RE: Problem generating a large file when following the
> POIFSstandard
>> 
>> I haven't used POI very much.  Rather new to the
>> POI/HSSF game, but been around Java for a while.  Kais
>> seems to mention memory issues with writing large
>> files.  Are there any gotcha's that I should be aware
>> of?  Anything specific to avoid?  I may have need to
>> go into the 20-40 mb range.  I know, it's insane to
>> stuff that much rot in a sheet - but it honestly may
>> need to be done in this case.  Again, I'm just
>> investigating, and need to know what to be aware of,
>> before jumping into it with both feet.  Any light you
>> can shed on it would be much appreciated...
>> 
>> Thanks in advance,
>> 
>> Jeff 
>> 
>> 
>> --- Kais Dukes <kd...@kaisdukes.com> wrote:
>>> Dear Michael
>>> 
>>> I can offer my answers to these questions, with the
>>> disclaimer that these
>>> are based on my own observations.
>>> 
>>> From my use of POIFS:
>>> 
>>> (1) POIFS seems to have no trouble with large files,
>>> in terms of XBATs.
>>> Memory is another question.
>>> (2) You are right that XBAT is not the usual name,
>>> in Microsoft speak, and
>>> reading standard docs on OLE2, it is the DIF.
>>> (3) I have noticed these issues. In the
>>> implementations I have written for
>>> OLE2 structured storage, it appears that the
>>> Microsoft implementation will
>>> correctly read your OLE2 files regardless of how you
>>> distribute BAT or XBAT
>>> blocks (as long as there correctly pointed to).
>>> (4) The POIFS implementation doesn't seem to hard
>>> code sector sizes, these
>>> have been abstacted (e.g. contants). It may be the
>>> case that switching these
>>> would result in the code still working, although I
>>> have not tested.
>>> (5) Related to the issue of sector sizes, the next
>>> version of Windows
>>> (Longhorn) will have much better support sector
>>> sizes other than 512 bytes,
>>> so I would hope a future POIFS system to deal with
>>> this correctly.
>>> 
>>> -----Original Message-----
>>> From: Michael Zalewski
>>> [mailto:zalewski@optonline.net]
>>> Sent: 26 February 2004 14:43
>>> To: POI Users List
>>> Subject: RE: Problem generating a large file when
>>> following the
>>> POIFSstandard
>>> 
>>> 
>>> I don't mean to add fuel to the fire. But I have
>>> some questions:
>>> 
>>> 1. Brandon pointed out that a C++ implementation
>>> based on POIFS has trouble
>>> writing files larger than 6.8 MB, where XBATs must
>>> be created. Do we know
>>> that POIFS creates these large files correctly? Has
>>> anyone used HSSF to make
>>> a XLS larger than 7 MB? How much heap space was
>>> required?
>>> 
>>> 2. For that matter, where does the term XBAT come
>>> from? I thought it was
>>> called DIF (Double Indirect FAT).
>>> 
>>> 3. There are some other differences that I notice in
>>> the way POIFS handles
>>> things versus COMPOBJ.DLL. The most notable is that
>>> HSSF does not put
>>> something called the 'Storage Class ID' into the
>>> root entry or the OLE Doc
>>> file header. But Excel doesn't seem to care. POIFS
>>> has no support for
>>> 'CompObj' stream, which is required for embedding
>>> (and is used for Word and
>>> Project even when the files are not embedded - but
>>> again the application
>>> doesn't seem to care if the stream is present or not
>>> when the document is
>>> read from a file). POIFS cannot create a document
>>> stream of 0 length (it's a
>>> known bug). And finally, POIFS puts the directory
>>> structure in block 0
>>> whereas Office always puts the document stream into
>>> block 0. None of these
>>> differences should account for Brandon's problems (I
>>> don't think).
>>> 
>>> 4. Does POIFS support any block size other than 512
>>> bytes? Not sure, but I
>>> think Microsoft Map uses a larger sector size. Since
>>> POIFSConstants.BIG_BLOCK_SIZE is statically defined
>>> (instead of interpreting
>>> the value at offset 30 from the beginning of the
>>> file. This value is (I
>>> guess) ignored by POIFS, but should be interpreted
>>> as the base 2 logarithm
>>> of the large sector size (so it should be 9 for 512
>>> bytes used by Microsoft
>>> Office, and I think 10 for 1024 bytes on some files
>>> produced by Microsoft
>>> Map).
>>> 
>>> If Brandon's C++ derivative supports the variant
>>> block sizes, perhaps he
>>> could try using a 1024 block size. Doing this should
>>> allow you to go to over
>>> 12 MB without the need for XBATs (or DIFs). I would
>>> really like to know if
>>> that works - but I seriously wonder if he would
>>> report back anyway.
>>> 
>>> Note: I am not at all certain about Microsoft Map.
>>> Just had reports of
>>> someone using POIFS to read that file, and
>>> everything was 'one block off'.
>>> When (if) I ever follow up on it, I will report
>>> results back to this list
>>> (unless I get viciously shouted down).
>>> 
>>> 
>>> 
>>> 
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> poi-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail:
>>> poi-user-help@jakarta.apache.org
>>> 
>>> 
>>> 
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> poi-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail:
>>> poi-user-help@jakarta.apache.org
>>> 
>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: poi-user-help@jakarta.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-user-help@jakarta.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org