You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@poi.apache.org by Mike Serra <ms...@iwu.edu> on 2005/03/24 18:11:29 UTC

HSSF cannot open files that contain many strings

Hello again to the POI world,
  I have been having an ongoing problem with HSSF's ability to load an 
.xls file containing
only strings.  A 500kb file filled only with strings will not load, but 
it doesn't throw an exception or run out ram either.  The process sits 
there taking up CPU time and slowly nibbling at system ram, and the file 
might take hours to load (I haven't bothered to wait that long).

In the past, I thought that POI was simply not able to load large files, 
but I have since discovered that it can load enormous files, as long as 
they contain only numeric data.  The strings are the problem.  I would 
be very grateful if anyone has an idea what causes this.

Thank you,
Mike S.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: Holy COW! It worked (I think)

Posted by Sanjiv Jivan <sa...@gmail.com>.

When is the next release of POI planned. I'd like to pick up this change.

Thanks,
Sanjiv

PS : I have build locally based on the CVS but since our app is going into 
production in a month, I'd would like to be using an official release/patch 
version.

On 3/25/05, Michael Zalewski <za...@optonline.net> wrote:
> 
> When I wrote this comment, I thought after I posted -- This is ridiculous.
> No sane person is going to try to rewrite BinaryTree like I suggested. 
> Such
> a change must surely destroy every other dependant structure, which 
> includes
> the file structure itself. Only a raving lunatic would actually spend time
> on this.
> 
> So I tried :)
> 
> Result was memory requirements decreased by about 1/3. (My file consists 
> of
> only Strings, and the strings are short. I suspect most use cases would 
> not
> see such an improvement in memory requirements).
> 
> However, time to load my 65,000 unique string workbook decreased by a 
> factor
> of of almost 10 (from over 5 minutes to about 30 sec). The strange
> phenomenon with the CPU going idle happened briefly for less than 3 sec, 
> and
> only one time.
> 
> Here is part of the code (to make my idea more clear)
> 
> private static final class Node
> implements Map.Entry
> {
> private Comparable _dataKey; // instead of Comparable[] _data
> private Comparable _dataData;
> private Node _leftKey; // instead of Node[] _left
> private Node _leftData;
> private Node _rightKey; // instead of Node[] _right
> private Node _rightData;
> private Node _parentKey; // instead of Node[] _parent
> private Node _parentData;
> private boolean _blackKey; // instead of Boolean[] _black
> private boolean _blackData;
> private int _hashcode;
> private boolean _calculated_hashcode;
> 
> /**
> * Make a new cell with given key and value, and with null
> * links, and black (true) colors.
> *
> * @param key
> * @param value
> */
> 
> Node(final Comparable key, final Comparable value)
> {
> _dataKey = key; // much shorter ctor
> _dataData = value; // does not create any arrays
> }
> 
> I'll put this into Bugzilla soon to start the discussion. But I should 
> point
> out that I have run practically no tests. Gotta find the tests first.
> 
> Are there any tests?
> 
>

Re: Holy COW! It worked (I think)

Posted by an...@superlinksoftware.com.

I'd like to see a patch version of this that ALSO handles rich string 
format!  We still don't do that properly.
The BinaryTree was actually a stop gap to de-duplicate strings both when 
they weren't really duplicates (thus loosing
rich string formatting) and when they were added (and thus quite 
possibly were dupes).

This looks like good work.

Michael Zalewski wrote:

>When I wrote this comment, I thought after I posted -- This is ridiculous.
>No sane person is going to try to rewrite BinaryTree like I suggested. Such
>a change must surely destroy every other dependant structure, which includes
>the file structure itself. Only a raving lunatic would actually spend time
>on this.
>
>So I tried :)
>
>Result was memory requirements decreased by about 1/3. (My file consists of
>only Strings, and the strings are short. I suspect most use cases would not
>see such an improvement in memory requirements).
>
>However, time to load my 65,000 unique string workbook decreased by a factor
>of of almost 10 (from over 5 minutes to about 30 sec). The strange
>phenomenon with the CPU going idle happened briefly for less than 3 sec, and
>only one time.
>
>Here is part of the code (to make my idea more clear)
>
>    private static final class Node
>        implements Map.Entry
>    {
>        private Comparable   _dataKey; // instead of Comparable[] _data
>        private Comparable   _dataData;
>        private Node         _leftKey; // instead of Node[] _left
>        private Node         _leftData;
>        private Node         _rightKey; // instead of Node[] _right
>        private Node         _rightData;
>        private Node         _parentKey; // instead of Node[] _parent
>        private Node         _parentData;
>        private boolean      _blackKey; // instead of Boolean[] _black
>        private boolean      _blackData;
>        private int          _hashcode;
>        private boolean      _calculated_hashcode;
>
>        /**
>         * Make a new cell with given key and value, and with null
>         * links, and black (true) colors.
>         *
>         * @param key
>         * @param value
>         */
>
>        Node(final Comparable key, final Comparable value)
>        {
>            _dataKey = key;     // much shorter ctor
>            _dataData = value;  // does not create any arrays
>        }
>
>I'll put this into Bugzilla soon to start the discussion. But I should point
>out that I have run practically no tests. Gotta find the tests first.
>
>Are there any tests?
>
>-----Original Message-----
>From: Michael Zalewski [mailto:zalewski@optonline.net] 
>Sent: Friday, March 25, 2005 12:45 PM
>To: 'POI Users List'
>Subject: RE: HSSF cannot open files that contain many strings
>
>My own thought is that there are just too gosh darn many objects. (Gosh darn
>many objects => gosh darn long time to process).
>
>The SST table gets deserialized into a humongous double binary tree
>structure, (org.apache.poi.util.BinaryTree) which is actually indexed by
>both the index of the string and the value of the string. So this means that
>there are at least 10 objects created per String
>
>1) The String structure (type org.apache.poi.hssf.record.UnicodeString)
>2) The String value itself (contained as a field in type UnicodeString)
>3) The Integer value (which indexes the String). It's an Integer object
>instead of a primitive, so it can implement Comparable and be one of the
>keys in the double indexed tree structure
>4) The Node object (of the tree, which has a reference to both the String
>value and the Integer value)
>5) One or more LabelSST records which contain an index into the tree.
>
>If you look inside org.apache.poi.util.BinaryTree, you can see that each
>node of the binary tree (there is one node for each string) contains five
>array objects in addition to the ones I listed above.
>
>This means that my file of 65,000 unique strings will end up creating
>650,000 objects to represent those strings when deserialized. I'm probably
>missing some objects in this analysis, so my guess is that my 65,000 string
>spreadsheet required over a million java objects.
>
>You can get rid of 5 of these objects with a simple refactoring of
>BinaryTree -- replace each of the 5 arrays with 2 fields (replace the 5
>arrays with 10 primitive fields).
>
>
>
>-----Original Message-----
>From: Danny Mui [mailto:danny@muibros.com] 
>Sent: Friday, March 25, 2005 11:50 AM
>To: POI Users List
>Subject: Re: HSSF cannot open files that contain many strings
>
>I'm curious about the CPU utilization issues and why it takes so gosh 
>darn long!  Wonder what a profiler will say about loading a file as 
>you've described.
>
>It shouldn't be too difficult to adjust the way the SST's are 
>written/loaded to validate/invalidate this problem/fix.
>
>Michael Zalewski wrote:
>  
>
>>Ummm...
>>
>>Yes I think I might have identified an issue with POI and a large number
>>    
>>
>of
>  
>
>>strings. And I was looking at it partly in response to Mike's problem.
>>
>>But I don't think the issue I found is the root problem. It might explain
>>why large files generated in POI HSSF would not open correctly in Excel.
>>    
>>
>In
>  
>
>>fact, I couldn't find any problem with the way POI handles things. At this
>>point, I would say that what I have identified is just a difference in the
>>way Excel writes a file with more than 1024 strings, and the way the same
>>file is written from POI.
>>
>>I have tried reading a 3 MB Excel file which contains 65,000 unique
>>    
>>
>strings,
>  
>
>>130,000 BIFF records. Everything worked fine (if slowly, but 5 minutes
>>instead of 5 hours). I have a 2 Ghz Pentium laptop, with 1 GB RAM. I did
>>    
>>
>not
>  
>
>>increase the JVM heap size (so it was 128 MB).
>>
>>I did see one thing which I don't understand. I was debugging the
>>application in Eclipse, and many times during the load, the CPU
>>    
>>
>utilization
>  
>
>>went down to nearly zero for several seconds at a time. But after 15 to 30
>>seconds, it would pick up again and run for another 15 to 30 seconds at
>>100%. Toward the end of the run (when HSSFSheet creation was nearly
>>complete), the idle periods got longer. I am certain that the idle
>>    
>>
>intervals
>  
>
>>I observed were when the JVM was garbage collecting. I don't understand
>>    
>>
>why
>  
>
>>Windows showed 0% CPU Utilization during this time.
>>
>>-----Original Message-----
>>From: Danny Mui [mailto:danny@muibros.com] 
>>Sent: Thursday, March 24, 2005 2:27 PM
>>To: POI Users List
>>Subject: Re: HSSF cannot open files that contain many strings
>>
>>Mike Z has identitifed an issue with HSSF handling a bunch of unique 
>>strings (dev list).  Once that is taken care of, I have a suspicion your 
>>issue will be addressed as well.
>>
>>Can you go into bugzilla and provide your excel file as a validation 
>>point as well? I can't find an existing bug with this issue so it would 
>>help facilitate testing once the coding is complete.
>>
>>As for timeframe, I'll dedicate sometime in April and May as I'll be 
>>trekking around Europe and need something to do while sipping coffee ;D
>>
>>
>>
>>Mike Serra wrote:
>>
>>    
>>
>>>Hello again to the POI world,
>>>I have been having an ongoing problem with HSSF's ability to load an 
>>>.xls file containing
>>>only strings.  A 500kb file filled only with strings will not load, but 
>>>it doesn't throw an exception or run out ram either.  The process sits 
>>>there taking up CPU time and slowly nibbling at system ram, and the file 
>>>might take hours to load (I haven't bothered to wait that long).
>>>
>>>In the past, I thought that POI was simply not able to load large files, 
>>>but I have since discovered that it can load enormous files, as long as 
>>>they contain only numeric data.  The strings are the problem.  I would 
>>>be very grateful if anyone has an idea what causes this.
>>>
>>>Thank you,
>>>Mike S.
>>>
>>>      
>>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
>Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
>The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
>Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
>The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
>.
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: Holy COW! It worked (I think)

Posted by an...@superlinksoftware.com.

hehe, Mike if you want to go at it take a look at the performance branch 
before we reverted it where I ditched all NumericRecords, LabelSST and
String based records for parallel object arrays.  I just never could 
finish it because at the time the other features were too much of a 
moving target.  At
that time I decresed memory requirements of a mostly numeric file by 
about 1/3.

-Andy

Glen Stampoultzis wrote:

>
> Nice one.  You clearly know the Excel file format very well.   There 
> are lots of places where we create too many objects.  Looks like 
> that's one of them.
>
> There are some tests around string handling and SST but it would be 
> nice to have some for the new code you've written.
>
> Some of our existing tests against head are unfortunately still broken.
>
> Regards,
>
> Glen
>
>
> Michael Zalewski wrote:
>
>> When I wrote this comment, I thought after I posted -- This is 
>> ridiculous.
>> No sane person is going to try to rewrite BinaryTree like I 
>> suggested. Such
>> a change must surely destroy every other dependant structure, which 
>> includes
>> the file structure itself. Only a raving lunatic would actually spend 
>> time
>> on this.
>>
>> So I tried :)
>>
>> Result was memory requirements decreased by about 1/3. (My file 
>> consists of
>> only Strings, and the strings are short. I suspect most use cases 
>> would not
>> see such an improvement in memory requirements).
>>
>> However, time to load my 65,000 unique string workbook decreased by a 
>> factor
>> of of almost 10 (from over 5 minutes to about 30 sec). The strange
>> phenomenon with the CPU going idle happened briefly for less than 3 
>> sec, and
>> only one time.
>>
>> Here is part of the code (to make my idea more clear)
>>
>>    private static final class Node
>>        implements Map.Entry
>>    {
>>        private Comparable   _dataKey; // instead of Comparable[] _data
>>        private Comparable   _dataData;
>>        private Node         _leftKey; // instead of Node[] _left
>>        private Node         _leftData;
>>        private Node         _rightKey; // instead of Node[] _right
>>        private Node         _rightData;
>>        private Node         _parentKey; // instead of Node[] _parent
>>        private Node         _parentData;
>>        private boolean      _blackKey; // instead of Boolean[] _black
>>        private boolean      _blackData;
>>        private int          _hashcode;
>>        private boolean      _calculated_hashcode;
>>
>>        /**
>>         * Make a new cell with given key and value, and with null
>>         * links, and black (true) colors.
>>         *
>>         * @param key
>>         * @param value
>>         */
>>
>>        Node(final Comparable key, final Comparable value)
>>        {
>>            _dataKey = key;     // much shorter ctor
>>            _dataData = value;  // does not create any arrays
>>        }
>>
>> I'll put this into Bugzilla soon to start the discussion. But I 
>> should point
>> out that I have run practically no tests. Gotta find the tests first.
>>
>> Are there any tests?
>>
>> -----Original Message-----
>> From: Michael Zalewski [mailto:zalewski@optonline.net] Sent: Friday, 
>> March 25, 2005 12:45 PM
>> To: 'POI Users List'
>> Subject: RE: HSSF cannot open files that contain many strings
>>
>> My own thought is that there are just too gosh darn many objects. 
>> (Gosh darn
>> many objects => gosh darn long time to process).
>>
>> The SST table gets deserialized into a humongous double binary tree
>> structure, (org.apache.poi.util.BinaryTree) which is actually indexed by
>> both the index of the string and the value of the string. So this 
>> means that
>> there are at least 10 objects created per String
>>
>> 1) The String structure (type org.apache.poi.hssf.record.UnicodeString)
>> 2) The String value itself (contained as a field in type UnicodeString)
>> 3) The Integer value (which indexes the String). It's an Integer object
>> instead of a primitive, so it can implement Comparable and be one of the
>> keys in the double indexed tree structure
>> 4) The Node object (of the tree, which has a reference to both the 
>> String
>> value and the Integer value)
>> 5) One or more LabelSST records which contain an index into the tree.
>>
>> If you look inside org.apache.poi.util.BinaryTree, you can see that each
>> node of the binary tree (there is one node for each string) contains 
>> five
>> array objects in addition to the ones I listed above.
>>
>> This means that my file of 65,000 unique strings will end up creating
>> 650,000 objects to represent those strings when deserialized. I'm 
>> probably
>> missing some objects in this analysis, so my guess is that my 65,000 
>> string
>> spreadsheet required over a million java objects.
>>
>> You can get rid of 5 of these objects with a simple refactoring of
>> BinaryTree -- replace each of the 5 arrays with 2 fields (replace the 5
>> arrays with 10 primitive fields).
>>
>>
>>
>> -----Original Message-----
>> From: Danny Mui [mailto:danny@muibros.com] Sent: Friday, March 25, 
>> 2005 11:50 AM
>> To: POI Users List
>> Subject: Re: HSSF cannot open files that contain many strings
>>
>> I'm curious about the CPU utilization issues and why it takes so gosh 
>> darn long!  Wonder what a profiler will say about loading a file as 
>> you've described.
>>
>> It shouldn't be too difficult to adjust the way the SST's are 
>> written/loaded to validate/invalidate this problem/fix.
>>
>> Michael Zalewski wrote:
>>  
>>
>>> Ummm...
>>>
>>> Yes I think I might have identified an issue with POI and a large 
>>> number
>>>   
>>
>> of
>>  
>>
>>> strings. And I was looking at it partly in response to Mike's problem.
>>>
>>> But I don't think the issue I found is the root problem. It might 
>>> explain
>>> why large files generated in POI HSSF would not open correctly in 
>>> Excel.
>>>   
>>
>> In
>>  
>>
>>> fact, I couldn't find any problem with the way POI handles things. 
>>> At this
>>> point, I would say that what I have identified is just a difference 
>>> in the
>>> way Excel writes a file with more than 1024 strings, and the way the 
>>> same
>>> file is written from POI.
>>>
>>> I have tried reading a 3 MB Excel file which contains 65,000 unique
>>>   
>>
>> strings,
>>  
>>
>>> 130,000 BIFF records. Everything worked fine (if slowly, but 5 minutes
>>> instead of 5 hours). I have a 2 Ghz Pentium laptop, with 1 GB RAM. I 
>>> did
>>>   
>>
>> not
>>  
>>
>>> increase the JVM heap size (so it was 128 MB).
>>>
>>> I did see one thing which I don't understand. I was debugging the
>>> application in Eclipse, and many times during the load, the CPU
>>>   
>>
>> utilization
>>  
>>
>>> went down to nearly zero for several seconds at a time. But after 15 
>>> to 30
>>> seconds, it would pick up again and run for another 15 to 30 seconds at
>>> 100%. Toward the end of the run (when HSSFSheet creation was nearly
>>> complete), the idle periods got longer. I am certain that the idle
>>>   
>>
>> intervals
>>  
>>
>>> I observed were when the JVM was garbage collecting. I don't understand
>>>   
>>
>> why
>>  
>>
>>> Windows showed 0% CPU Utilization during this time.
>>>
>>> -----Original Message-----
>>> From: Danny Mui [mailto:danny@muibros.com] Sent: Thursday, March 24, 
>>> 2005 2:27 PM
>>> To: POI Users List
>>> Subject: Re: HSSF cannot open files that contain many strings
>>>
>>> Mike Z has identitifed an issue with HSSF handling a bunch of unique 
>>> strings (dev list).  Once that is taken care of, I have a suspicion 
>>> your issue will be addressed as well.
>>>
>>> Can you go into bugzilla and provide your excel file as a validation 
>>> point as well? I can't find an existing bug with this issue so it 
>>> would help facilitate testing once the coding is complete.
>>>
>>> As for timeframe, I'll dedicate sometime in April and May as I'll be 
>>> trekking around Europe and need something to do while sipping coffee ;D
>>>
>>>
>>>
>>> Mike Serra wrote:
>>>
>>>   
>>>
>>>> Hello again to the POI world,
>>>> I have been having an ongoing problem with HSSF's ability to load 
>>>> an .xls file containing
>>>> only strings.  A 500kb file filled only with strings will not load, 
>>>> but it doesn't throw an exception or run out ram either.  The 
>>>> process sits there taking up CPU time and slowly nibbling at system 
>>>> ram, and the file might take hours to load (I haven't bothered to 
>>>> wait that long).
>>>>
>>>> In the past, I thought that POI was simply not able to load large 
>>>> files, but I have since discovered that it can load enormous files, 
>>>> as long as they contain only numeric data.  The strings are the 
>>>> problem.  I would be very grateful if anyone has an idea what 
>>>> causes this.
>>>>
>>>> Thank you,
>>>> Mike S.
>>>>
>>>>     
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
>> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
>> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
>> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
>> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
>>
>>
>>  
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
> .
>


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: Holy COW! It worked (I think)

Posted by Glen Stampoultzis <gs...@iinet.net.au>.

Nice one.  You clearly know the Excel file format very well.   There are 
lots of places where we create too many objects.  Looks like that's one 
of them.

There are some tests around string handling and SST but it would be nice 
to have some for the new code you've written.

Some of our existing tests against head are unfortunately still broken.

Regards,

Glen


Michael Zalewski wrote:

>When I wrote this comment, I thought after I posted -- This is ridiculous.
>No sane person is going to try to rewrite BinaryTree like I suggested. Such
>a change must surely destroy every other dependant structure, which includes
>the file structure itself. Only a raving lunatic would actually spend time
>on this.
>
>So I tried :)
>
>Result was memory requirements decreased by about 1/3. (My file consists of
>only Strings, and the strings are short. I suspect most use cases would not
>see such an improvement in memory requirements).
>
>However, time to load my 65,000 unique string workbook decreased by a factor
>of of almost 10 (from over 5 minutes to about 30 sec). The strange
>phenomenon with the CPU going idle happened briefly for less than 3 sec, and
>only one time.
>
>Here is part of the code (to make my idea more clear)
>
>    private static final class Node
>        implements Map.Entry
>    {
>        private Comparable   _dataKey; // instead of Comparable[] _data
>        private Comparable   _dataData;
>        private Node         _leftKey; // instead of Node[] _left
>        private Node         _leftData;
>        private Node         _rightKey; // instead of Node[] _right
>        private Node         _rightData;
>        private Node         _parentKey; // instead of Node[] _parent
>        private Node         _parentData;
>        private boolean      _blackKey; // instead of Boolean[] _black
>        private boolean      _blackData;
>        private int          _hashcode;
>        private boolean      _calculated_hashcode;
>
>        /**
>         * Make a new cell with given key and value, and with null
>         * links, and black (true) colors.
>         *
>         * @param key
>         * @param value
>         */
>
>        Node(final Comparable key, final Comparable value)
>        {
>            _dataKey = key;     // much shorter ctor
>            _dataData = value;  // does not create any arrays
>        }
>
>I'll put this into Bugzilla soon to start the discussion. But I should point
>out that I have run practically no tests. Gotta find the tests first.
>
>Are there any tests?
>
>-----Original Message-----
>From: Michael Zalewski [mailto:zalewski@optonline.net] 
>Sent: Friday, March 25, 2005 12:45 PM
>To: 'POI Users List'
>Subject: RE: HSSF cannot open files that contain many strings
>
>My own thought is that there are just too gosh darn many objects. (Gosh darn
>many objects => gosh darn long time to process).
>
>The SST table gets deserialized into a humongous double binary tree
>structure, (org.apache.poi.util.BinaryTree) which is actually indexed by
>both the index of the string and the value of the string. So this means that
>there are at least 10 objects created per String
>
>1) The String structure (type org.apache.poi.hssf.record.UnicodeString)
>2) The String value itself (contained as a field in type UnicodeString)
>3) The Integer value (which indexes the String). It's an Integer object
>instead of a primitive, so it can implement Comparable and be one of the
>keys in the double indexed tree structure
>4) The Node object (of the tree, which has a reference to both the String
>value and the Integer value)
>5) One or more LabelSST records which contain an index into the tree.
>
>If you look inside org.apache.poi.util.BinaryTree, you can see that each
>node of the binary tree (there is one node for each string) contains five
>array objects in addition to the ones I listed above.
>
>This means that my file of 65,000 unique strings will end up creating
>650,000 objects to represent those strings when deserialized. I'm probably
>missing some objects in this analysis, so my guess is that my 65,000 string
>spreadsheet required over a million java objects.
>
>You can get rid of 5 of these objects with a simple refactoring of
>BinaryTree -- replace each of the 5 arrays with 2 fields (replace the 5
>arrays with 10 primitive fields).
>
>
>
>-----Original Message-----
>From: Danny Mui [mailto:danny@muibros.com] 
>Sent: Friday, March 25, 2005 11:50 AM
>To: POI Users List
>Subject: Re: HSSF cannot open files that contain many strings
>
>I'm curious about the CPU utilization issues and why it takes so gosh 
>darn long!  Wonder what a profiler will say about loading a file as 
>you've described.
>
>It shouldn't be too difficult to adjust the way the SST's are 
>written/loaded to validate/invalidate this problem/fix.
>
>Michael Zalewski wrote:
>  
>
>>Ummm...
>>
>>Yes I think I might have identified an issue with POI and a large number
>>    
>>
>of
>  
>
>>strings. And I was looking at it partly in response to Mike's problem.
>>
>>But I don't think the issue I found is the root problem. It might explain
>>why large files generated in POI HSSF would not open correctly in Excel.
>>    
>>
>In
>  
>
>>fact, I couldn't find any problem with the way POI handles things. At this
>>point, I would say that what I have identified is just a difference in the
>>way Excel writes a file with more than 1024 strings, and the way the same
>>file is written from POI.
>>
>>I have tried reading a 3 MB Excel file which contains 65,000 unique
>>    
>>
>strings,
>  
>
>>130,000 BIFF records. Everything worked fine (if slowly, but 5 minutes
>>instead of 5 hours). I have a 2 Ghz Pentium laptop, with 1 GB RAM. I did
>>    
>>
>not
>  
>
>>increase the JVM heap size (so it was 128 MB).
>>
>>I did see one thing which I don't understand. I was debugging the
>>application in Eclipse, and many times during the load, the CPU
>>    
>>
>utilization
>  
>
>>went down to nearly zero for several seconds at a time. But after 15 to 30
>>seconds, it would pick up again and run for another 15 to 30 seconds at
>>100%. Toward the end of the run (when HSSFSheet creation was nearly
>>complete), the idle periods got longer. I am certain that the idle
>>    
>>
>intervals
>  
>
>>I observed were when the JVM was garbage collecting. I don't understand
>>    
>>
>why
>  
>
>>Windows showed 0% CPU Utilization during this time.
>>
>>-----Original Message-----
>>From: Danny Mui [mailto:danny@muibros.com] 
>>Sent: Thursday, March 24, 2005 2:27 PM
>>To: POI Users List
>>Subject: Re: HSSF cannot open files that contain many strings
>>
>>Mike Z has identitifed an issue with HSSF handling a bunch of unique 
>>strings (dev list).  Once that is taken care of, I have a suspicion your 
>>issue will be addressed as well.
>>
>>Can you go into bugzilla and provide your excel file as a validation 
>>point as well? I can't find an existing bug with this issue so it would 
>>help facilitate testing once the coding is complete.
>>
>>As for timeframe, I'll dedicate sometime in April and May as I'll be 
>>trekking around Europe and need something to do while sipping coffee ;D
>>
>>
>>
>>Mike Serra wrote:
>>
>>    
>>
>>>Hello again to the POI world,
>>>I have been having an ongoing problem with HSSF's ability to load an 
>>>.xls file containing
>>>only strings.  A 500kb file filled only with strings will not load, but 
>>>it doesn't throw an exception or run out ram either.  The process sits 
>>>there taking up CPU time and slowly nibbling at system ram, and the file 
>>>might take hours to load (I haven't bothered to wait that long).
>>>
>>>In the past, I thought that POI was simply not able to load large files, 
>>>but I have since discovered that it can load enormous files, as long as 
>>>they contain only numeric data.  The strings are the problem.  I would 
>>>be very grateful if anyone has an idea what causes this.
>>>
>>>Thank you,
>>>Mike S.
>>>
>>>      
>>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
>Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
>The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
>Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
>The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Holy COW! It worked (I think)

Posted by Michael Zalewski <za...@optonline.net>.

When I wrote this comment, I thought after I posted -- This is ridiculous.
No sane person is going to try to rewrite BinaryTree like I suggested. Such
a change must surely destroy every other dependant structure, which includes
the file structure itself. Only a raving lunatic would actually spend time
on this.

So I tried :)

Result was memory requirements decreased by about 1/3. (My file consists of
only Strings, and the strings are short. I suspect most use cases would not
see such an improvement in memory requirements).

However, time to load my 65,000 unique string workbook decreased by a factor
of of almost 10 (from over 5 minutes to about 30 sec). The strange
phenomenon with the CPU going idle happened briefly for less than 3 sec, and
only one time.

Here is part of the code (to make my idea more clear)

    private static final class Node
        implements Map.Entry
    {
        private Comparable   _dataKey; // instead of Comparable[] _data
        private Comparable   _dataData;
        private Node         _leftKey; // instead of Node[] _left
        private Node         _leftData;
        private Node         _rightKey; // instead of Node[] _right
        private Node         _rightData;
        private Node         _parentKey; // instead of Node[] _parent
        private Node         _parentData;
        private boolean      _blackKey; // instead of Boolean[] _black
        private boolean      _blackData;
        private int          _hashcode;
        private boolean      _calculated_hashcode;

        /**
         * Make a new cell with given key and value, and with null
         * links, and black (true) colors.
         *
         * @param key
         * @param value
         */

        Node(final Comparable key, final Comparable value)
        {
            _dataKey = key;     // much shorter ctor
            _dataData = value;  // does not create any arrays
        }

I'll put this into Bugzilla soon to start the discussion. But I should point
out that I have run practically no tests. Gotta find the tests first.

Are there any tests?

-----Original Message-----
From: Michael Zalewski [mailto:zalewski@optonline.net] 
Sent: Friday, March 25, 2005 12:45 PM
To: 'POI Users List'
Subject: RE: HSSF cannot open files that contain many strings

My own thought is that there are just too gosh darn many objects. (Gosh darn
many objects => gosh darn long time to process).

The SST table gets deserialized into a humongous double binary tree
structure, (org.apache.poi.util.BinaryTree) which is actually indexed by
both the index of the string and the value of the string. So this means that
there are at least 10 objects created per String

1) The String structure (type org.apache.poi.hssf.record.UnicodeString)
2) The String value itself (contained as a field in type UnicodeString)
3) The Integer value (which indexes the String). It's an Integer object
instead of a primitive, so it can implement Comparable and be one of the
keys in the double indexed tree structure
4) The Node object (of the tree, which has a reference to both the String
value and the Integer value)
5) One or more LabelSST records which contain an index into the tree.

If you look inside org.apache.poi.util.BinaryTree, you can see that each
node of the binary tree (there is one node for each string) contains five
array objects in addition to the ones I listed above.

This means that my file of 65,000 unique strings will end up creating
650,000 objects to represent those strings when deserialized. I'm probably
missing some objects in this analysis, so my guess is that my 65,000 string
spreadsheet required over a million java objects.

You can get rid of 5 of these objects with a simple refactoring of
BinaryTree -- replace each of the 5 arrays with 2 fields (replace the 5
arrays with 10 primitive fields).

-----Original Message-----
From: Danny Mui [mailto:danny@muibros.com] 
Sent: Friday, March 25, 2005 11:50 AM
To: POI Users List
Subject: Re: HSSF cannot open files that contain many strings

I'm curious about the CPU utilization issues and why it takes so gosh 
darn long!  Wonder what a profiler will say about loading a file as 
you've described.

It shouldn't be too difficult to adjust the way the SST's are 
written/loaded to validate/invalidate this problem/fix.

Michael Zalewski wrote:
> Ummm...
> 
> Yes I think I might have identified an issue with POI and a large number
of
> strings. And I was looking at it partly in response to Mike's problem.
> 
> But I don't think the issue I found is the root problem. It might explain
> why large files generated in POI HSSF would not open correctly in Excel.
In
> fact, I couldn't find any problem with the way POI handles things. At this
> point, I would say that what I have identified is just a difference in the
> way Excel writes a file with more than 1024 strings, and the way the same
> file is written from POI.
> 
> I have tried reading a 3 MB Excel file which contains 65,000 unique
strings,
> 130,000 BIFF records. Everything worked fine (if slowly, but 5 minutes
> instead of 5 hours). I have a 2 Ghz Pentium laptop, with 1 GB RAM. I did
not
> increase the JVM heap size (so it was 128 MB).
> 
> I did see one thing which I don't understand. I was debugging the
> application in Eclipse, and many times during the load, the CPU
utilization
> went down to nearly zero for several seconds at a time. But after 15 to 30
> seconds, it would pick up again and run for another 15 to 30 seconds at
> 100%. Toward the end of the run (when HSSFSheet creation was nearly
> complete), the idle periods got longer. I am certain that the idle
intervals
> I observed were when the JVM was garbage collecting. I don't understand
why
> Windows showed 0% CPU Utilization during this time.
> 
> -----Original Message-----
> From: Danny Mui [mailto:danny@muibros.com] 
> Sent: Thursday, March 24, 2005 2:27 PM
> To: POI Users List
> Subject: Re: HSSF cannot open files that contain many strings
> 
> Mike Z has identitifed an issue with HSSF handling a bunch of unique 
> strings (dev list).  Once that is taken care of, I have a suspicion your 
> issue will be addressed as well.
> 
> Can you go into bugzilla and provide your excel file as a validation 
> point as well? I can't find an existing bug with this issue so it would 
> help facilitate testing once the coding is complete.
> 
> As for timeframe, I'll dedicate sometime in April and May as I'll be 
> trekking around Europe and need something to do while sipping coffee ;D
> 
> 
> 
> Mike Serra wrote:
> 
>>Hello again to the POI world,
>> I have been having an ongoing problem with HSSF's ability to load an 
>>.xls file containing
>>only strings.  A 500kb file filled only with strings will not load, but 
>>it doesn't throw an exception or run out ram either.  The process sits 
>>there taking up CPU time and slowly nibbling at system ram, and the file 
>>might take hours to load (I haven't bothered to wait that long).
>>
>>In the past, I thought that POI was simply not able to load large files, 
>>but I have since discovered that it can load enormous files, as long as 
>>they contain only numeric data.  The strings are the problem.  I would 
>>be very grateful if anyone has an idea what causes this.
>>
>>Thank you,
>>Mike S.
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

RE: HSSF cannot open files that contain many strings

Posted by Michael Zalewski <za...@optonline.net>.

My own thought is that there are just too gosh darn many objects. (Gosh darn
many objects => gosh darn long time to process).

The SST table gets deserialized into a humongous double binary tree
structure, (org.apache.poi.util.BinaryTree) which is actually indexed by
both the index of the string and the value of the string. So this means that
there are at least 10 objects created per String

1) The String structure (type org.apache.poi.hssf.record.UnicodeString)
2) The String value itself (contained as a field in type UnicodeString)
3) The Integer value (which indexes the String). It's an Integer object
instead of a primitive, so it can implement Comparable and be one of the
keys in the double indexed tree structure
4) The Node object (of the tree, which has a reference to both the String
value and the Integer value)
5) One or more LabelSST records which contain an index into the tree.

If you look inside org.apache.poi.util.BinaryTree, you can see that each
node of the binary tree (there is one node for each string) contains five
array objects in addition to the ones I listed above.

This means that my file of 65,000 unique strings will end up creating
650,000 objects to represent those strings when deserialized. I'm probably
missing some objects in this analysis, so my guess is that my 65,000 string
spreadsheet required over a million java objects.

You can get rid of 5 of these objects with a simple refactoring of
BinaryTree -- replace each of the 5 arrays with 2 fields (replace the 5
arrays with 10 primitive fields).

-----Original Message-----
From: Danny Mui [mailto:danny@muibros.com] 
Sent: Friday, March 25, 2005 11:50 AM
To: POI Users List
Subject: Re: HSSF cannot open files that contain many strings

I'm curious about the CPU utilization issues and why it takes so gosh 
darn long!  Wonder what a profiler will say about loading a file as 
you've described.

It shouldn't be too difficult to adjust the way the SST's are 
written/loaded to validate/invalidate this problem/fix.

Michael Zalewski wrote:
> Ummm...
> 
> Yes I think I might have identified an issue with POI and a large number
of
> strings. And I was looking at it partly in response to Mike's problem.
> 
> But I don't think the issue I found is the root problem. It might explain
> why large files generated in POI HSSF would not open correctly in Excel.
In
> fact, I couldn't find any problem with the way POI handles things. At this
> point, I would say that what I have identified is just a difference in the
> way Excel writes a file with more than 1024 strings, and the way the same
> file is written from POI.
> 
> I have tried reading a 3 MB Excel file which contains 65,000 unique
strings,
> 130,000 BIFF records. Everything worked fine (if slowly, but 5 minutes
> instead of 5 hours). I have a 2 Ghz Pentium laptop, with 1 GB RAM. I did
not
> increase the JVM heap size (so it was 128 MB).
> 
> I did see one thing which I don't understand. I was debugging the
> application in Eclipse, and many times during the load, the CPU
utilization
> went down to nearly zero for several seconds at a time. But after 15 to 30
> seconds, it would pick up again and run for another 15 to 30 seconds at
> 100%. Toward the end of the run (when HSSFSheet creation was nearly
> complete), the idle periods got longer. I am certain that the idle
intervals
> I observed were when the JVM was garbage collecting. I don't understand
why
> Windows showed 0% CPU Utilization during this time.
> 
> -----Original Message-----
> From: Danny Mui [mailto:danny@muibros.com] 
> Sent: Thursday, March 24, 2005 2:27 PM
> To: POI Users List
> Subject: Re: HSSF cannot open files that contain many strings
> 
> Mike Z has identitifed an issue with HSSF handling a bunch of unique 
> strings (dev list).  Once that is taken care of, I have a suspicion your 
> issue will be addressed as well.
> 
> Can you go into bugzilla and provide your excel file as a validation 
> point as well? I can't find an existing bug with this issue so it would 
> help facilitate testing once the coding is complete.
> 
> As for timeframe, I'll dedicate sometime in April and May as I'll be 
> trekking around Europe and need something to do while sipping coffee ;D
> 
> 
> 
> Mike Serra wrote:
> 
>>Hello again to the POI world,
>> I have been having an ongoing problem with HSSF's ability to load an 
>>.xls file containing
>>only strings.  A 500kb file filled only with strings will not load, but 
>>it doesn't throw an exception or run out ram either.  The process sits 
>>there taking up CPU time and slowly nibbling at system ram, and the file 
>>might take hours to load (I haven't bothered to wait that long).
>>
>>In the past, I thought that POI was simply not able to load large files, 
>>but I have since discovered that it can load enormous files, as long as 
>>they contain only numeric data.  The strings are the problem.  I would 
>>be very grateful if anyone has an idea what causes this.
>>
>>Thank you,
>>Mike S.
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: HSSF cannot open files that contain many strings

Posted by Danny Mui <da...@muibros.com>.

I'm curious about the CPU utilization issues and why it takes so gosh 
darn long!  Wonder what a profiler will say about loading a file as 
you've described.

It shouldn't be too difficult to adjust the way the SST's are 
written/loaded to validate/invalidate this problem/fix.

Michael Zalewski wrote:
> Ummm...
> 
> Yes I think I might have identified an issue with POI and a large number of
> strings. And I was looking at it partly in response to Mike's problem.
> 
> But I don't think the issue I found is the root problem. It might explain
> why large files generated in POI HSSF would not open correctly in Excel. In
> fact, I couldn't find any problem with the way POI handles things. At this
> point, I would say that what I have identified is just a difference in the
> way Excel writes a file with more than 1024 strings, and the way the same
> file is written from POI.
> 
> I have tried reading a 3 MB Excel file which contains 65,000 unique strings,
> 130,000 BIFF records. Everything worked fine (if slowly, but 5 minutes
> instead of 5 hours). I have a 2 Ghz Pentium laptop, with 1 GB RAM. I did not
> increase the JVM heap size (so it was 128 MB).
> 
> I did see one thing which I don't understand. I was debugging the
> application in Eclipse, and many times during the load, the CPU utilization
> went down to nearly zero for several seconds at a time. But after 15 to 30
> seconds, it would pick up again and run for another 15 to 30 seconds at
> 100%. Toward the end of the run (when HSSFSheet creation was nearly
> complete), the idle periods got longer. I am certain that the idle intervals
> I observed were when the JVM was garbage collecting. I don't understand why
> Windows showed 0% CPU Utilization during this time.
> 
> -----Original Message-----
> From: Danny Mui [mailto:danny@muibros.com] 
> Sent: Thursday, March 24, 2005 2:27 PM
> To: POI Users List
> Subject: Re: HSSF cannot open files that contain many strings
> 
> Mike Z has identitifed an issue with HSSF handling a bunch of unique 
> strings (dev list).  Once that is taken care of, I have a suspicion your 
> issue will be addressed as well.
> 
> Can you go into bugzilla and provide your excel file as a validation 
> point as well? I can't find an existing bug with this issue so it would 
> help facilitate testing once the coding is complete.
> 
> As for timeframe, I'll dedicate sometime in April and May as I'll be 
> trekking around Europe and need something to do while sipping coffee ;D
> 
> 
> 
> Mike Serra wrote:
> 
>>Hello again to the POI world,
>> I have been having an ongoing problem with HSSF's ability to load an 
>>.xls file containing
>>only strings.  A 500kb file filled only with strings will not load, but 
>>it doesn't throw an exception or run out ram either.  The process sits 
>>there taking up CPU time and slowly nibbling at system ram, and the file 
>>might take hours to load (I haven't bothered to wait that long).
>>
>>In the past, I thought that POI was simply not able to load large files, 
>>but I have since discovered that it can load enormous files, as long as 
>>they contain only numeric data.  The strings are the problem.  I would 
>>be very grateful if anyone has an idea what causes this.
>>
>>Thank you,
>>Mike S.
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

RE: HSSF cannot open files that contain many strings

Posted by Michael Zalewski <za...@optonline.net>.

Ummm...

Yes I think I might have identified an issue with POI and a large number of
strings. And I was looking at it partly in response to Mike's problem.

But I don't think the issue I found is the root problem. It might explain
why large files generated in POI HSSF would not open correctly in Excel. In
fact, I couldn't find any problem with the way POI handles things. At this
point, I would say that what I have identified is just a difference in the
way Excel writes a file with more than 1024 strings, and the way the same
file is written from POI.

I have tried reading a 3 MB Excel file which contains 65,000 unique strings,
130,000 BIFF records. Everything worked fine (if slowly, but 5 minutes
instead of 5 hours). I have a 2 Ghz Pentium laptop, with 1 GB RAM. I did not
increase the JVM heap size (so it was 128 MB).

I did see one thing which I don't understand. I was debugging the
application in Eclipse, and many times during the load, the CPU utilization
went down to nearly zero for several seconds at a time. But after 15 to 30
seconds, it would pick up again and run for another 15 to 30 seconds at
100%. Toward the end of the run (when HSSFSheet creation was nearly
complete), the idle periods got longer. I am certain that the idle intervals
I observed were when the JVM was garbage collecting. I don't understand why
Windows showed 0% CPU Utilization during this time.

-----Original Message-----
From: Danny Mui [mailto:danny@muibros.com] 
Sent: Thursday, March 24, 2005 2:27 PM
To: POI Users List
Subject: Re: HSSF cannot open files that contain many strings

Mike Z has identitifed an issue with HSSF handling a bunch of unique 
strings (dev list).  Once that is taken care of, I have a suspicion your 
issue will be addressed as well.

Can you go into bugzilla and provide your excel file as a validation 
point as well? I can't find an existing bug with this issue so it would 
help facilitate testing once the coding is complete.

As for timeframe, I'll dedicate sometime in April and May as I'll be 
trekking around Europe and need something to do while sipping coffee ;D

Mike Serra wrote:
> 
> Hello again to the POI world,
>  I have been having an ongoing problem with HSSF's ability to load an 
> .xls file containing
> only strings.  A 500kb file filled only with strings will not load, but 
> it doesn't throw an exception or run out ram either.  The process sits 
> there taking up CPU time and slowly nibbling at system ram, and the file 
> might take hours to load (I haven't bothered to wait that long).
> 
> In the past, I thought that POI was simply not able to load large files, 
> but I have since discovered that it can load enormous files, as long as 
> they contain only numeric data.  The strings are the problem.  I would 
> be very grateful if anyone has an idea what causes this.
> 
> Thank you,
> Mike S.
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: HSSF cannot open files that contain many strings

Posted by Danny Mui <da...@muibros.com>.

Mike Z has identitifed an issue with HSSF handling a bunch of unique 
strings (dev list).  Once that is taken care of, I have a suspicion your 
issue will be addressed as well.

Can you go into bugzilla and provide your excel file as a validation 
point as well? I can't find an existing bug with this issue so it would 
help facilitate testing once the coding is complete.

As for timeframe, I'll dedicate sometime in April and May as I'll be 
trekking around Europe and need something to do while sipping coffee ;D



Mike Serra wrote:
> 
> Hello again to the POI world,
>  I have been having an ongoing problem with HSSF's ability to load an 
> .xls file containing
> only strings.  A 500kb file filled only with strings will not load, but 
> it doesn't throw an exception or run out ram either.  The process sits 
> there taking up CPU time and slowly nibbling at system ram, and the file 
> might take hours to load (I haven't bothered to wait that long).
> 
> In the past, I thought that POI was simply not able to load large files, 
> but I have since discovered that it can load enormous files, as long as 
> they contain only numeric data.  The strings are the problem.  I would 
> be very grateful if anyone has an idea what causes this.
> 
> Thank you,
> Mike S.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/