You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by Br...@sungard.com on 2010/06/01 19:04:01 UTC
RE: Performance Question with CTSheetDataImpl.java

Yegor,

Thanks for making the change to the column ordering check as well.
Every little bit helps.


I like your idea about improving the write performance by ensuring that rows are added in the correct order.  But I do have a concern with regard to how that will affect performance when creating a new sheet that has over 65,000 rows.  I think the idea is worth looking at, so long as we can make the row addition less than O(log n) time.  Then I would say that idea might be worth exploring further.


I also agree with you that the manipulation of low level DOM objects is a complexity that may cause long term maint problems, and will gernerally make it harder for POI to upgrade the xmlBeans package in the future.



Bryce Alcock * Performance Architect * SunGard *
Asset Arena * 377 E. Butterfield Road Suite 800, Lombard, IL 60148 *
www.sungard.com/assetarena



-----Original Message-----
From: Yegor Kozlov [mailto:yegor@dinom.ru]
Sent: Sat 5/29/2010 6:21 AM
To: POI Developers List
Subject: Re: Performance Question with CTSheetDataImpl.java
 
On 5/27/10 12:49 AM, Bryce.Alcock@sungard.com wrote:
> Couple of thoughts:
>
> 1.  I did a test with "real" data in a controlled environment, and
> Identified that in the real world condition this change did help
> significantly for large sheets that did not require re-ordering.  12 Min
> processing time dropped to 4ish Minutes.
>
>    

I made a similar improvement in XSSFRow#onDocumentWrite(), this method 
always re-assigned the array of CTCell even if the cells were ordered.
The performance boost is not so big as we got from the optimization of 
XSSFSheet#write. The number of cells is usually small and the call of 
CTRow#setCArray is not that expensive, the improvement should be 
significant with 'wide' sheets having more than 100 columns.

The fix was committed in r949375
> 2.  I have been giving it a bit more thought, and would like to through
> the idea of "block" re-order comparison for a further improvements where
> the data does not match.  The basic idea is that because the rows are
> maintained in a linked list in xml beans, can we on the poi side of
> things look at only comparing blocks of data and doing some of the list
> manipulation as needed, but skip blocks where we know or can predict
> that the underlying list is ordered correctly.
>
> I am just floating this idea to the list.  I will not be able to dig
> into it for a bit...
> I would love to hear thoughts?
>
>    
I don't quite understand. Are you suggesting to manipulate low-level DOM 
objects and swap only those pieces that are not sorted?
In theory you can do that using XmlCursor but the code will be tricky 
and harder to maintain. Also, the performance gain is not evident to me.

An alternative approach is to improve XSSFSheet#createRow so that the 
array of CTRow beans is kept in a sorted state. Current implementation 
appends new rows, this is why we need the trick with re-ordering to 
ensure that the resulting XML is valid. If we do this optimization then 
the re-ordering trick is extra and can be removed. The performance of 
XSSFSheet#createRow can be slower (random insertions in LinkedList!) but 
the performance of XSSFSheet#write will be definitely faster.

Yegor
> Bryce Alcock * Performance Architect * SunGard * Asset Arena *
> 377 E. Butterfield Road Suite 800, Lombard, IL 60148 *
> www.sungard.com/assetarena
>
> Think before you print
>
>
>
> -----Original Message-----
> From: Bryce.Alcock@sungard.com [mailto:Bryce.Alcock@sungard.com]
> Sent: Monday, May 24, 2010 10:06 AM
> To: yegor@dinom.ru; dev@poi.apache.org
> Subject: RE: Performance Question with CTSheetDataImpl.java
>
> The performance improvement is significant for my simple "Laboratory"
> test.
> See attached PNG file showing Yegor's code vs. Poi before the change.
>
>
> Yegor's Code Changes Inplace:
> Note that the AVG Time per Row does not change (O(1) per row leads to
> O(n) over all)
>
> File Name       	Rows	Total 	AVG
> 				Time 	Mills/
> 				Millis	Row	
> ./20000_true.xls  	20000	1964	0.1	
> ./24000_true.xls  	24000	2019	0.08	
> ./28800_true.xls  	28800	2624	0.09	
> ./34560_true.xls  	34560	3012	0.09	
> ./41472_true.xls  	41472	3409	0.08	
> ./49766_true.xls  	49766	4617	0.09	
> ./59719_true.xls  	59719	6162	0.1	
> ./71662_true.xls  	71662	5705	0.08	
> ./85994_true.xls  	85994	7487	0.09	
> ./103192_true.xls 	103192	9060	0.09	
> ./123830_true.xls 	123830	11875	0.1	
> ./148596_true.xls 	148596	13591	0.09	
>
>
> Original Poi (pre row order check)
> Notice that the AVG Mills/Row increases linearly with rows
> (O(n) which leads to O(n^2) as seen in the Total Time)
>
> File Name       	Rows	Total 	AVG
> 				Time 	Mills/
> 				Millis	Row	
> ./20000_true.xls  	20000	4655	0.23
> ./24000_true.xls  	24000	6398	0.27
> ./28800_true.xls  	28800	7626	0.26
> ./34560_true.xls  	34560	10178	0.29
> ./41472_true.xls  	41472	17647	0.43
> ./49766_true.xls  	49766	24645	0.5
> ./59719_true.xls  	59719	38765	0.65
> ./71662_true.xls  	71662	63009	0.88
> ./85994_true.xls  	85994	98448	1.14
> ./103192_true.xls 	103192	144313	1.4
> ./123830_true.xls 	123830	217893	1.76
> ./148596_true.xls 	148596	320606	2.16
>
>
>
> Thanks for your effort's Yegor.
> I will test this in the "wild" with real data soon.
>
>
> Bryce Alcock * Performance Architect * SunGard * Asset Arena * 377 E.
> Butterfield Road Suite 800, Lombard, IL 60148 * Tel 630-986-3006 *
> Confidential Fax 630-515-1908 * www.sungard.com/assetarena
>
>
>
> -----Original Message-----
> From: Yegor Kozlov [mailto:yegor@dinom.ru]
> Sent: Mon 5/24/2010 12:58 AM
> To: POI Developers List
> Cc: Alcock, Bryce
> Subject: Re: Performance Question with CTSheetDataImpl.java
>
> the purpose of calling CTSheetData#setRowArray is to ensure that the
> array of CTRow beans in CTSheetData is ordered. The point is that we
> don't always need this operation, in many cases rows are already ordered
> and re-assigning them to CTSheetData is an unnecessary expensive
> operation.
>
> Compare two use cases:
>
> case 1: rows are written in increasing order
>
> XSSFSheet sheet = workbook.createSheet(); sheet.createRow(1);
> sheet.createRow(2); sheet.createRow(3); workbook.write(out); //no need
> to re-order rows before writing
>
>
> case 2: row are created in random order
>
> XSSFSheet sheet = workbook.createSheet(); sheet.createRow(3);
> sheet.createRow(2); sheet.createRow(1); workbook.write(out); //need to
> re-order rows before writing
>
> In the first case the call of CTSheetData#setRowArray is extra because
> the state of CTSheetData match the logical model. In the second case we
> do need to call CTSheetData#setRowArray to ensure that rows are written
> in ascending order.
>
> So, the performance of XSSFSheet#write depends on how rows were added:
> if rows are added in strict ascending order then XSSFSheet#write is
> fast. If the order of rows is random or rows were shifted then
> XSSFSheet#write involves CTSheetData#setRowArray and becomes more
> expensive.
>
> I committed this improvement in r947542. Please try the latest build
> from trunk.
>
> Below are two my tests, the only difference is that in the first test
> rows are written in ascending order and in the second test in
> descending.
>
>       public static void main(String[] args) throws Exception {
>           long t1 = System.currentTimeMillis();
>
>           XSSFWorkbook wb = new XSSFWorkbook();
>           XSSFSheet sh = wb.createSheet();
>           for (int i = 0; i<  Short.MAX_VALUE; i++) {
>               XSSFRow row = sh.createRow(i);
>               for (int j = 0; j<  5; j++) {
>                   XSSFCell cell = row.createCell(j);
>                   cell.setCellValue(i*j);
>               }
>           }
>           long t2 = System.currentTimeMillis();
>           System.out.println("generate: " + (t2-t1) + " ms");
>
>           FileOutputStream out = new FileOutputStream(new
> File("/temp/rows.xlsx"));
>           wb.write(out);
>           out.close();
>           long t3 = System.currentTimeMillis();
>           System.out.println("write: " + (t3-t2) + " ms");
>       }
>
>
> generate: 5488 ms
> write: 2507 ms
>
>       public static void main(String[] args) throws Exception {
>           long t1 = System.currentTimeMillis();
>
>           XSSFWorkbook wb = new XSSFWorkbook();
>           XSSFSheet sh = wb.createSheet();
>           for (int i = Short.MAX_VALUE; i>  0; i--) {
>               XSSFRow row = sh.createRow(i);
>               for (int j = 0; j<  5; j++) {
>                   XSSFCell cell = row.createCell(j);
>                   cell.setCellValue(i*j);
>               }
>           }
>           long t2 = System.currentTimeMillis();
>           System.out.println("generate: " + (t2-t1) + " ms");
>
>           FileOutputStream out = new FileOutputStream(new
> File("/temp/rows.xlsx"));
>           wb.write(out); //will involve CTSheetData#setRowArray
>           out.close();
>           long t3 = System.currentTimeMillis();
>           System.out.println("write: " + (t3-t2) + " ms");
>       }
>
> generate: 9291 ms
> write: 94888 ms
>
> The first test is almost 40X faster!
>
> Regards,
> Yegor
>
>
>
>    
>> Continuing to chip away at CTSheetDataImpl's performance issues.
>> However, the problematic code is unlikely to be changed, I have a
>>      
> discussion going on the XMLBeans website, and as it turns out that code
> is most likely not going to change, or will not change easily.
>    
>>
>> With that in mind, I started taking a close look at the POI side of
>>      
> things and very specifically the problematic code from a POI
> perspective.
>    
>>
>> At approximately line 2368 of XSSFSheet.java
>>
>> the following line is the suspect line:
>> I have modified the code to break 1 line into 3  the slow line is the
>>      
> last one.
>    
>>           CTRow[] lctRow = new CTRow[rArray.size()];
>> 	lctRow = rArray.toArray(lctRow);
>>           sheetData.setRowArray(lctRow);
>>
>>
>>
>> My Question is:  in looking at XSSFSheet.java
>> are there any assumption that we can make about the state of sheetData
>>      
> (A CTSheetData object)
>    
>> that would allow us to either overwrite the default
>>      
> setRowArray(lctRow) method,
>    
>> or side step this call all together by calling a method that is not
>>      
> generic, but instead much
>    
>> more targeted to the function and process that is currently going on.
>>
>>
>>
>>
>>
>>
>>
>>
>> Here is the full call in context...Take from my modified version of
>>      
> XSSFSheet.java
>    
>>
>>       protected void write(OutputStream out) throws IOException {
>>
>>           if(worksheet.getColsArray().length == 1) {
>>               CTCols col = worksheet.getColsArray(0);
>>               CTCol[] cols = col.getColArray();
>>               if(cols.length == 0) {
>>                   worksheet.setColsArray(null);
>>               }
>>           }
>>
>>           // Now re-generate our CTHyperlinks, if needed
>>           if(hyperlinks.size()>   0) {
>>               if(worksheet.getHyperlinks() == null) {
>>                   worksheet.addNewHyperlinks();
>>               }
>>               CTHyperlink[] ctHls = new CTHyperlink[hyperlinks.size()];
>>               for(int i=0; i<ctHls.length; i++) {
>>                   // If our sheet has hyperlinks, have them add
>>                   //  any relationships that they might need
>>                   XSSFHyperlink hyperlink = hyperlinks.get(i);
>>                   hyperlink.generateRelationIfNeeded(getPackagePart());
>>                   // Now grab their underling object
>>                   ctHls[i] = hyperlink.getCTHyperlink();
>>               }
>>               worksheet.getHyperlinks().setHyperlinkArray(ctHls);
>>           }
>>
>>           CTSheetData sheetData = worksheet.getSheetData();
>>           ArrayList<CTRow>   rArray = new ArrayList<CTRow>(rows.size());
>>           for(XSSFRow row : rows.values()){
>>               row.onDocumentWrite();
>>               rArray.add(row.getCTRow());
>>           }
>> 		long startTimeSheetData = System.currentTimeMillis();
>> 		CTRow[] lctRow = new CTRow[rArray.size()];
>> 		lctRow = rArray.toArray(lctRow);
>>           sheetData.setRowArray(lctRow);
>> 		System.out.println("Sheet Time : " +
>>      
> (System.currentTimeMillis() - startTimeSheetData));
>    
>>
>>           XmlOptions xmlOptions = new XmlOptions(DEFAULT_XML_OPTIONS);
>>           xmlOptions.setSaveSyntheticDocumentElement(new
>>      
> QName(CTWorksheet.type.getName().getNamespaceURI(), "worksheet"));
>    
>>           Map<String, String>   map = new HashMap<String, String>();
>>           map.put(STRelationshipId.type.getName().getNamespaceURI(),
>>      
> "r");
>    
>>           xmlOptions.setSaveSuggestedPrefixes(map);
>> 		long startTimeWorkSheetSave =
>>      
> System.currentTimeMillis();
>    
>>           worksheet.save(out, xmlOptions);
>> 		System.out.println("Worksheet save Time : " +
>>      
> (System.currentTimeMillis() - startTimeWorkSheetSave));
>    
>>       }
>>
>>
>>
>> Best Regards
>> Bryce Alcock
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>> For additional commands, e-mail: dev-help@poi.apache.org
>>
>>
>>
>>      
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org