You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Daniel Noll <da...@nuix.com.au> on 2005/06/15 02:01:31 UTC

Reading only the bit of the file which we need

I've just noticed that, even with the "eventfilesystem" framework, that 
POI seems to read the entire file anyway, even if I only ask for one 
section.

I got a hint of this because we only read the properties out of the 
files, and the time taken is O(n) with the size of the file.

I tracked this down to POIFSReader.read(InputStream), which creates a 
RawDataBlockList over the input stream, which then reads every chunk of 
the file into a 512-byte RawDataBlock.  After all this, it sets up the 
Block Allocation Table to link the chunks together.

Question 1: what advantages do I still get by using the event filesystem 
if it's going to read the entire file regardless?
Question 2: do other people buffer the input stream which is sent to 
POI?  If it's reading in 512-byte chunks (far lower than the block size 
of the vast majority of storage devices) then I guess it would make 
sense to buffer it in larger chunks on the way in.  Before reading this 
code I was working under the assumption that POI was reading in 
reasonable chunk sizes.

Are there any plans to replace this implementation with something more 
efficient?  Right now I'm thinking that it would make sense to create a 
MappedByteBuffer over the whole file and then create windows over that 
buffer and then glue them back together in the right order... only 
reading the data when it's finally asked for.

Daniel

-- 
Daniel Noll

NUIX Pty Ltd
Level 8, 143 York Street, Sydney 2000
Phone: (02) 9283 9010
Fax:   (02) 9283 9020

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: Reading only the bit of the file which we need

Posted by ac...@apache.org.
Daniel Noll wrote:
> andy@superlinksoftware.com wrote:
> 
>>> lace this implementation with something more efficient?  Right now 
>>> I'm thinking that it would make sense to create a MappedByteBuffer 
>>> over the whole file and then create windows over that buffer and then 
>>> glue them back together in the right order... only reading the data 
>>> when it's finally asked for.
>>>
>>
>> This is under way presently.  Note that it really wasn't possible to 
>> do this in JDK 1.22 (original POI target JVM when we started) with our 
>> use case (streamin->streamout).  Now it is.
> 
> 
> If RandomAccessFile were used in the first place, I wonder whether it 
> would have worked on JDK 1.22 anyway.  RawDataBlock even has a nice 
> framework which would allow the blocks to be cached. :-)
> 
> Daniel
> 

Its not exactly quite that simple....but okay.  An early prototype 
version of POI (that lived on my drive) tried to use the JDK 1.22 
RandomAccessFile...it was too ungodly slow to seek and rewind and stuff 
to even consider.  Moreover check the "since" on MappedByteBuffer (1.4). 
    Anyhow, it wasn't feasible to do this feasibly in Java at the time. 
  We're working on it now.  So you'll be able to read/write/modify xls 
files without loading them all in memory and shrink heap utilization by 
like a factor of 10...big deal :-)  I figured out how to get GCC to dump 
me out the assembler and tune it by hand so I can link it back in.  Now 
if only Java were open source :-P

-Andy
-- 
Andrew C. Oliver
SuperLink Software, Inc.

Java to Excel using POI
http://www.superlinksoftware.com/services/poi
Commercial support including features added/implemented, bugs fixed.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: Reading only the bit of the file which we need

Posted by Daniel Noll <da...@nuix.com.au>.
andy@superlinksoftware.com wrote:

>> lace this implementation with something more efficient?  Right now 
>> I'm thinking that it would make sense to create a MappedByteBuffer 
>> over the whole file and then create windows over that buffer and then 
>> glue them back together in the right order... only reading the data 
>> when it's finally asked for.
>>
>
> This is under way presently.  Note that it really wasn't possible to 
> do this in JDK 1.22 (original POI target JVM when we started) with our 
> use case (streamin->streamout).  Now it is.

If RandomAccessFile were used in the first place, I wonder whether it 
would have worked on JDK 1.22 anyway.  RawDataBlock even has a nice 
framework which would allow the blocks to be cached. :-)

Daniel

-- 
Daniel Noll

NUIX Pty Ltd
Level 8, 143 York Street, Sydney 2000
Phone: (02) 9283 9010
Fax:   (02) 9283 9020

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: Reading only the bit of the file which we need

Posted by an...@superlinksoftware.com.
Daniel Noll wrote:
> I've just noticed that, even with the "eventfilesystem" framework, that 
> POI seems to read the entire file anyway, even if I only ask for one 
> section.
> 
> I got a hint of this because we only read the properties out of the 
> files, and the time taken is O(n) with the size of the file.
> 
> I tracked this down to POIFSReader.read(InputStream), which creates a 
> RawDataBlockList over the input stream, which then reads every chunk of 
> the file into a 512-byte RawDataBlock.  After all this, it sets up the 
> Block Allocation Table to link the chunks together.
> 
> Question 1: what advantages do I still get by using the event filesystem 
> if it's going to read the entire file regardless?
> Question 2: do other people buffer the input stream which is sent to 
> POI?  If it's reading in 512-byte chunks (far lower than the block size 
> of the vast majority of storage devices) then I guess it would make 
> sense to buffer it in larger chunks on the way in.  Before reading this 
> code I was working under the assumption that POI was reading in 
> reasonable chunk sizes.
> 

It uses far less memory...benchmark it yourself if you don't believe.

> Are there any plans to replace this implementation with something more 
> efficient?  Right now I'm thinking that it would make sense to create a 
> MappedByteBuffer over the whole file and then create windows over that 
> buffer and then glue them back together in the right order... only 
> reading the data when it's finally asked for.
> 

This is under way presently.  Note that it really wasn't possible to do 
this in JDK 1.22 (original POI target JVM when we started) with our use 
case (streamin->streamout).  Now it is.

-Andy

> Daniel
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/