You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by "Murphy, Mark" <mu...@metalexmfg.com> on 2015/09/21 15:12:48 UTC

Fill and Border efficiency for XSSF format

POI 3.12

In looking at the styles.xml file generated by POI, It appears that a lot of borders and fills are created that are not used. Maybe POI needs a method to trim off those unused fills and borders on save. This method would also be able to normalize the fills and borders to remove duplicates and system colors (i.e. fgColor = 64 or bgColor = 65). I notice that Excel does this when it saves a spreadsheet created by POI. Here is an example of the fills, first as created by POI, second as saved by Excel (same spreadsheet):

POI 3.12
       <fills count="9">
              <fill>
                     <patternFill patternType="none" />
              </fill>
              <fill>
                     <patternFill patternType="darkGray" />
              </fill>
              <fill>
                     <patternFill patternType="none">
                           <fgColor indexed="64" />
                     </patternFill>
              </fill>
              <fill>
                     <patternFill patternType="none">
                           <fgColor indexed="64" />
                           <bgColor indexed="65" />
                     </patternFill>
              </fill>
              <fill>
                     <patternFill>
                           <fgColor indexed="64" />
                           <bgColor indexed="65" />
                     </patternFill>
              </fill>
              <fill>
                     <patternFill patternType="none">
                           <fgColor indexed="44" />
                     </patternFill>
              </fill>
              <fill>
                     <patternFill patternType="none">
                           <fgColor indexed="44" />
                           <bgColor indexed="65" />
                     </patternFill>
              </fill>
              <fill>
                     <patternFill patternType="solid">
                           <fgColor indexed="44" />
                           <bgColor indexed="65" />
                     </patternFill>
              </fill>
              <fill>
                     <patternFill patternType="none">
                           <bgColor indexed="65" />
                     </patternFill>
              </fill>
       </fills>

Excel 2010 Save As (same spreadsheet)
                <fills count="4">
              <fill>
                     <patternFill patternType="none" />
              </fill>
              <fill>
                     <patternFill patternType="gray125" />
              </fill>
              <fill>
                     <patternFill patternType="none" />
              </fill>
              <fill>
                     <patternFill patternType="solid">
                           <fgColor indexed="44" />
                     </patternFill>
              </fill>
       </fills>

Note, not even Excel was perfect as it left a duplicate fill (0, and 2). Borders was even more dramatic POI generated 94, Excel saved 13. And now that I look at it, POI generated 42 styles while Excel kept 23. I understand that you are keeping all the fills borders and styles as they are specified, and re-using them as you get to a final setting, but normalizing and trimming seem to be what you are missing. That can also be applied to fonts and formats.

Some additional thoughts, based on not yet looking at all the code. I can see three steps to collecting all the garbage: Normalization, Duplicate removal, Orphan removal. Normalization would involve setting default values, and removing system colors. On the fill above, the POI generated fill would look like this after normalization:

       <fills count="9">
              <fill>
                     <patternFill patternType="none" />
              </fill>
              <fill>
                     <patternFill patternType="darkGray" />
              </fill>
              <fill>
                     <patternFill patternType="none" />
              </fill>
              <fill>
                     <patternFill patternType="none" />
              </fill>
              <fill>
                     <patternFill patternType="none" />
              </fill>
              <fill>
                     <patternFill patternType="none">
                           <fgColor indexed="44" />
                     </patternFill>
              </fill>
              <fill>
                     <patternFill patternType="none">
                           <fgColor indexed="44" />
                     </patternFill>
              </fill>
              <fill>
                     <patternFill patternType="solid">
                           <fgColor indexed="44" />
                     </patternFill>
              </fill>
              <fill>
                     <patternFill patternType="none" />
              </fill>
       </fills>

Duplicate removal would remove the many duplicates left by normalization. This would have to be carried back to the styles themselves and potentially to individual cells in the spreadsheet that contain fill id's. After duplicate removal you would have something like the following in the fills:

       <fills count="4">
              <fill>
                     <patternFill patternType="none" />
              </fill>
              <fill>
                     <patternFill patternType="darkGray" />
              </fill>
              <fill>
                     <patternFill patternType="none">
                           <fgColor indexed="44" />
                     </patternFill>
              </fill>
              <fill>
                     <patternFill patternType="solid">
                           <fgColor indexed="44" />
                     </patternFill>
              </fill>
       </fills>

And I just happen to know that my spreadsheet never uses fill "none" - 44, so that could d be stripped out during orphan removal to bring the fill to:

       <fills count="3">
              <fill>
                     <patternFill patternType="none" />
              </fill>
              <fill>
                     <patternFill patternType="darkGray" />
              </fill>
              <fill>
                     <patternFill patternType="solid">
                           <fgColor indexed="44" />
                     </patternFill>
              </fill>
       </fills>

I suspect that the same process could be used for borders, fonts, formats and styles where the order of operation would be normalization and duplicate removal for fills, borders, fonts, formats, then styles. This followed by orphan removal for styles then fills, borders, fonts, and formats.

Re: Fill and Border efficiency for XSSF format

Posted by Javen O'Neal <ja...@gmail.com>.
If you're using a HSSFWorkbook, HSSFOptimiser [1] might be helpful.
Unfortunately, a XSSFOptimiser or generic SpreadsheetOptimiser
counterpart hasn't been written yet.

[1] https://poi.apache.org/apidocs/org/apache/poi/hssf/usermodel/HSSFOptimiser.html
Javen O'Neal


On Tue, Sep 22, 2015 at 1:54 PM, Javen O'Neal <ja...@gmail.com> wrote:
> This problem isn't limited to just fill and border cell styles, but
> cell styles in general. I have a similar problem when creating a
> workbook with many (identical) styles
> Along the same note, creating a new workbook, then calling:
>
> for (int i=0; i<1000; i++) {
>   wb.createCellStyle()
> }
>
> And saving the workbook produces a workbook with 1000 styles, even
> though the styles are all duplicates (easy to check) and never used (a
> little more difficult/expensive to check). Opening, saving, and
> closing the 1000 style workbook in Excel will consolidate the styles.
> I believe the style consolidation is done when the file is saved, but
> it's possible it's done when the file is opened in Excel. I'd need to
> write some VBA code to verify that
>
> In my POI application, I want to take a cell and change just the data
> format of the cell style (either 0.00 or 0.000)
> Let's say I'm starting with a workbook where all numbers are formatted
> as 0.00 and I want to format a few columns as 0.000.
> cell.getCellStyle().setDataFormat(format) will change all cells with
> the same style to 0.000, so I'm left with a workbook with all 0.000.
> cell.setCellStyle(cell.getCellStyle().clone().setDataFormat(format))
> will change the data format for just the one cell, but creates a new
> style for every cell.
> What I'm doing is temporarily changing the cell style with the new
> data format, checking if there is another cell style in the styles
> table that matches the modified style, and if so setting the cell's
> style to the matching cell style, otherwise clone the cell style.
> Finally, revert the cell style back to the original data format. It
> looks something like this:
>
> CellStyle originalStyle = cell.getCellStyle();
> DataFormat originalDataFormat = originalStyle.getDataFormat();
> // temporarily modify originalStyle to be the desired style
> originalStyle.setDataFormat(newDataFormat);
> try {
>     // search for existing styles that match the desired style
>     CellStyle matchingStyle = null;
>     for (CellStyle style : stylesSource) {
>         // styles match and do not refer to the same style
>         if (originalStyle.equals(style) && originalStyle != style) {
>             matchingStyle = style;
>             break;
>         }
>     }
>     // only create a new style if there are no styles that match the
> desired style
>     if (matchingStyle != null) {
>         cell.setCellStyle(matchingStyle);
>     } else {
>         cell.setCellStyle(originalStyle.clone());
>     }
> } finally {
>     // revert originalStyle back to what it was before
>     originalStyle.setDataFormat(originalDataFormat);
> }
>
> This consolidation is done when the styles are created rather than
> when the workbook is written to file.
> This seems like something that many users write into their own POI
> applications if they modify cell styles and are concerned about file
> size or memory limitations, but I'm not sure how to standardize this
> as part of the POI API.
>
> I agree that calling a method to consolidate existing cell styles
> should be called explicitly, as it might invalidate any existing
> CellStyle variables the user has access to. Might want to only make
> this change when transforming the in-memory objects into XML to write
> to file, and leave the in-memory objects unmodified as a result of
> this consolidation. Thus, either the workbook holds a
> "consolidateCellStylesOnWrite" state or the write method is passed an
> optional boolean consolidateCellStyles parameter. The former seems
> cleaner, though adds state into the workbook.
>
> I'd be happy to help you write a fix.
>
>
> On Tue, Sep 22, 2015 at 12:24 PM, Dominik Stadler
> <do...@gmx.at> wrote:
>> Hi,
>>
>> I think the main reason this is not done currently is that nobody
>> spent enough time to do this cleanly.
>>
>> Not sure if comparing the single items of CellStyle would be a good
>> idea, as this is likely to change at some point with new features
>> being added, so you would probably compare the XML representation, but
>> this can quickly become a performance problem if done for each
>> save-operation, so I would opt for something like this being an
>> optional operation that the user of POI has to call. Many users care
>> more about runtime of the application than the size of the resulting
>> document!
>>
>> If you are interesting in working on something like this, please
>> create a bug in our bugtracker and post patches there fore
>> review/inclusion.
>>
>> Dominik.
>>
>> On Mon, Sep 21, 2015 at 3:12 PM, Murphy, Mark <mu...@metalexmfg.com> wrote:
>>> POI 3.12
>>>
>>> In looking at the styles.xml file generated by POI, It appears that a lot of borders and fills are created that are not used. Maybe POI needs a method to trim off those unused fills and borders on save. This method would also be able to normalize the fills and borders to remove duplicates and system colors (i.e. fgColor = 64 or bgColor = 65). I notice that Excel does this when it saves a spreadsheet created by POI. Here is an example of the fills, first as created by POI, second as saved by Excel (same spreadsheet):
>>>
>>> POI 3.12
>>>        <fills count="9">
>>>               <fill>
>>>                      <patternFill patternType="none" />
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="darkGray" />
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="none">
>>>                            <fgColor indexed="64" />
>>>                      </patternFill>
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="none">
>>>                            <fgColor indexed="64" />
>>>                            <bgColor indexed="65" />
>>>                      </patternFill>
>>>               </fill>
>>>               <fill>
>>>                      <patternFill>
>>>                            <fgColor indexed="64" />
>>>                            <bgColor indexed="65" />
>>>                      </patternFill>
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="none">
>>>                            <fgColor indexed="44" />
>>>                      </patternFill>
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="none">
>>>                            <fgColor indexed="44" />
>>>                            <bgColor indexed="65" />
>>>                      </patternFill>
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="solid">
>>>                            <fgColor indexed="44" />
>>>                            <bgColor indexed="65" />
>>>                      </patternFill>
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="none">
>>>                            <bgColor indexed="65" />
>>>                      </patternFill>
>>>               </fill>
>>>        </fills>
>>>
>>> Excel 2010 Save As (same spreadsheet)
>>>                 <fills count="4">
>>>               <fill>
>>>                      <patternFill patternType="none" />
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="gray125" />
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="none" />
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="solid">
>>>                            <fgColor indexed="44" />
>>>                      </patternFill>
>>>               </fill>
>>>        </fills>
>>>
>>> Note, not even Excel was perfect as it left a duplicate fill (0, and 2). Borders was even more dramatic POI generated 94, Excel saved 13. And now that I look at it, POI generated 42 styles while Excel kept 23. I understand that you are keeping all the fills borders and styles as they are specified, and re-using them as you get to a final setting, but normalizing and trimming seem to be what you are missing. That can also be applied to fonts and formats.
>>>
>>> Some additional thoughts, based on not yet looking at all the code. I can see three steps to collecting all the garbage: Normalization, Duplicate removal, Orphan removal. Normalization would involve setting default values, and removing system colors. On the fill above, the POI generated fill would look like this after normalization:
>>>
>>>        <fills count="9">
>>>               <fill>
>>>                      <patternFill patternType="none" />
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="darkGray" />
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="none" />
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="none" />
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="none" />
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="none">
>>>                            <fgColor indexed="44" />
>>>                      </patternFill>
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="none">
>>>                            <fgColor indexed="44" />
>>>                      </patternFill>
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="solid">
>>>                            <fgColor indexed="44" />
>>>                      </patternFill>
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="none" />
>>>               </fill>
>>>        </fills>
>>>
>>> Duplicate removal would remove the many duplicates left by normalization. This would have to be carried back to the styles themselves and potentially to individual cells in the spreadsheet that contain fill id's. After duplicate removal you would have something like the following in the fills:
>>>
>>>        <fills count="4">
>>>               <fill>
>>>                      <patternFill patternType="none" />
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="darkGray" />
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="none">
>>>                            <fgColor indexed="44" />
>>>                      </patternFill>
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="solid">
>>>                            <fgColor indexed="44" />
>>>                      </patternFill>
>>>               </fill>
>>>        </fills>
>>>
>>> And I just happen to know that my spreadsheet never uses fill "none" - 44, so that could d be stripped out during orphan removal to bring the fill to:
>>>
>>>        <fills count="3">
>>>               <fill>
>>>                      <patternFill patternType="none" />
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="darkGray" />
>>>               </fill>
>>>               <fill>
>>>                      <patternFill patternType="solid">
>>>                            <fgColor indexed="44" />
>>>                      </patternFill>
>>>               </fill>
>>>        </fills>
>>>
>>> I suspect that the same process could be used for borders, fonts, formats and styles where the order of operation would be normalization and duplicate removal for fills, borders, fonts, formats, then styles. This followed by orphan removal for styles then fills, borders, fonts, and formats.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>> For additional commands, e-mail: user-help@poi.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Fill and Border efficiency for XSSF format

Posted by Javen O'Neal <ja...@gmail.com>.
This problem isn't limited to just fill and border cell styles, but
cell styles in general. I have a similar problem when creating a
workbook with many (identical) styles
Along the same note, creating a new workbook, then calling:

for (int i=0; i<1000; i++) {
  wb.createCellStyle()
}

And saving the workbook produces a workbook with 1000 styles, even
though the styles are all duplicates (easy to check) and never used (a
little more difficult/expensive to check). Opening, saving, and
closing the 1000 style workbook in Excel will consolidate the styles.
I believe the style consolidation is done when the file is saved, but
it's possible it's done when the file is opened in Excel. I'd need to
write some VBA code to verify that

In my POI application, I want to take a cell and change just the data
format of the cell style (either 0.00 or 0.000)
Let's say I'm starting with a workbook where all numbers are formatted
as 0.00 and I want to format a few columns as 0.000.
cell.getCellStyle().setDataFormat(format) will change all cells with
the same style to 0.000, so I'm left with a workbook with all 0.000.
cell.setCellStyle(cell.getCellStyle().clone().setDataFormat(format))
will change the data format for just the one cell, but creates a new
style for every cell.
What I'm doing is temporarily changing the cell style with the new
data format, checking if there is another cell style in the styles
table that matches the modified style, and if so setting the cell's
style to the matching cell style, otherwise clone the cell style.
Finally, revert the cell style back to the original data format. It
looks something like this:

CellStyle originalStyle = cell.getCellStyle();
DataFormat originalDataFormat = originalStyle.getDataFormat();
// temporarily modify originalStyle to be the desired style
originalStyle.setDataFormat(newDataFormat);
try {
    // search for existing styles that match the desired style
    CellStyle matchingStyle = null;
    for (CellStyle style : stylesSource) {
        // styles match and do not refer to the same style
        if (originalStyle.equals(style) && originalStyle != style) {
            matchingStyle = style;
            break;
        }
    }
    // only create a new style if there are no styles that match the
desired style
    if (matchingStyle != null) {
        cell.setCellStyle(matchingStyle);
    } else {
        cell.setCellStyle(originalStyle.clone());
    }
} finally {
    // revert originalStyle back to what it was before
    originalStyle.setDataFormat(originalDataFormat);
}

This consolidation is done when the styles are created rather than
when the workbook is written to file.
This seems like something that many users write into their own POI
applications if they modify cell styles and are concerned about file
size or memory limitations, but I'm not sure how to standardize this
as part of the POI API.

I agree that calling a method to consolidate existing cell styles
should be called explicitly, as it might invalidate any existing
CellStyle variables the user has access to. Might want to only make
this change when transforming the in-memory objects into XML to write
to file, and leave the in-memory objects unmodified as a result of
this consolidation. Thus, either the workbook holds a
"consolidateCellStylesOnWrite" state or the write method is passed an
optional boolean consolidateCellStyles parameter. The former seems
cleaner, though adds state into the workbook.

I'd be happy to help you write a fix.


On Tue, Sep 22, 2015 at 12:24 PM, Dominik Stadler
<do...@gmx.at> wrote:
> Hi,
>
> I think the main reason this is not done currently is that nobody
> spent enough time to do this cleanly.
>
> Not sure if comparing the single items of CellStyle would be a good
> idea, as this is likely to change at some point with new features
> being added, so you would probably compare the XML representation, but
> this can quickly become a performance problem if done for each
> save-operation, so I would opt for something like this being an
> optional operation that the user of POI has to call. Many users care
> more about runtime of the application than the size of the resulting
> document!
>
> If you are interesting in working on something like this, please
> create a bug in our bugtracker and post patches there fore
> review/inclusion.
>
> Dominik.
>
> On Mon, Sep 21, 2015 at 3:12 PM, Murphy, Mark <mu...@metalexmfg.com> wrote:
>> POI 3.12
>>
>> In looking at the styles.xml file generated by POI, It appears that a lot of borders and fills are created that are not used. Maybe POI needs a method to trim off those unused fills and borders on save. This method would also be able to normalize the fills and borders to remove duplicates and system colors (i.e. fgColor = 64 or bgColor = 65). I notice that Excel does this when it saves a spreadsheet created by POI. Here is an example of the fills, first as created by POI, second as saved by Excel (same spreadsheet):
>>
>> POI 3.12
>>        <fills count="9">
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="darkGray" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <fgColor indexed="64" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <fgColor indexed="64" />
>>                            <bgColor indexed="65" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill>
>>                            <fgColor indexed="64" />
>>                            <bgColor indexed="65" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <fgColor indexed="44" />
>>                            <bgColor indexed="65" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="solid">
>>                            <fgColor indexed="44" />
>>                            <bgColor indexed="65" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <bgColor indexed="65" />
>>                      </patternFill>
>>               </fill>
>>        </fills>
>>
>> Excel 2010 Save As (same spreadsheet)
>>                 <fills count="4">
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="gray125" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="solid">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>        </fills>
>>
>> Note, not even Excel was perfect as it left a duplicate fill (0, and 2). Borders was even more dramatic POI generated 94, Excel saved 13. And now that I look at it, POI generated 42 styles while Excel kept 23. I understand that you are keeping all the fills borders and styles as they are specified, and re-using them as you get to a final setting, but normalizing and trimming seem to be what you are missing. That can also be applied to fonts and formats.
>>
>> Some additional thoughts, based on not yet looking at all the code. I can see three steps to collecting all the garbage: Normalization, Duplicate removal, Orphan removal. Normalization would involve setting default values, and removing system colors. On the fill above, the POI generated fill would look like this after normalization:
>>
>>        <fills count="9">
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="darkGray" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="solid">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>        </fills>
>>
>> Duplicate removal would remove the many duplicates left by normalization. This would have to be carried back to the styles themselves and potentially to individual cells in the spreadsheet that contain fill id's. After duplicate removal you would have something like the following in the fills:
>>
>>        <fills count="4">
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="darkGray" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="solid">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>        </fills>
>>
>> And I just happen to know that my spreadsheet never uses fill "none" - 44, so that could d be stripped out during orphan removal to bring the fill to:
>>
>>        <fills count="3">
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="darkGray" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="solid">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>        </fills>
>>
>> I suspect that the same process could be used for borders, fonts, formats and styles where the order of operation would be normalization and duplicate removal for fills, borders, fonts, formats, then styles. This followed by orphan removal for styles then fills, borders, fonts, and formats.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Fill and Border efficiency for XSSF format

Posted by Dominik Stadler <do...@gmx.at>.
Hi,

I think the main reason this is not done currently is that nobody
spent enough time to do this cleanly.

Not sure if comparing the single items of CellStyle would be a good
idea, as this is likely to change at some point with new features
being added, so you would probably compare the XML representation, but
this can quickly become a performance problem if done for each
save-operation, so I would opt for something like this being an
optional operation that the user of POI has to call. Many users care
more about runtime of the application than the size of the resulting
document!

If you are interesting in working on something like this, please
create a bug in our bugtracker and post patches there fore
review/inclusion.

Dominik.

On Mon, Sep 21, 2015 at 3:12 PM, Murphy, Mark <mu...@metalexmfg.com> wrote:
> POI 3.12
>
> In looking at the styles.xml file generated by POI, It appears that a lot of borders and fills are created that are not used. Maybe POI needs a method to trim off those unused fills and borders on save. This method would also be able to normalize the fills and borders to remove duplicates and system colors (i.e. fgColor = 64 or bgColor = 65). I notice that Excel does this when it saves a spreadsheet created by POI. Here is an example of the fills, first as created by POI, second as saved by Excel (same spreadsheet):
>
> POI 3.12
>        <fills count="9">
>               <fill>
>                      <patternFill patternType="none" />
>               </fill>
>               <fill>
>                      <patternFill patternType="darkGray" />
>               </fill>
>               <fill>
>                      <patternFill patternType="none">
>                            <fgColor indexed="64" />
>                      </patternFill>
>               </fill>
>               <fill>
>                      <patternFill patternType="none">
>                            <fgColor indexed="64" />
>                            <bgColor indexed="65" />
>                      </patternFill>
>               </fill>
>               <fill>
>                      <patternFill>
>                            <fgColor indexed="64" />
>                            <bgColor indexed="65" />
>                      </patternFill>
>               </fill>
>               <fill>
>                      <patternFill patternType="none">
>                            <fgColor indexed="44" />
>                      </patternFill>
>               </fill>
>               <fill>
>                      <patternFill patternType="none">
>                            <fgColor indexed="44" />
>                            <bgColor indexed="65" />
>                      </patternFill>
>               </fill>
>               <fill>
>                      <patternFill patternType="solid">
>                            <fgColor indexed="44" />
>                            <bgColor indexed="65" />
>                      </patternFill>
>               </fill>
>               <fill>
>                      <patternFill patternType="none">
>                            <bgColor indexed="65" />
>                      </patternFill>
>               </fill>
>        </fills>
>
> Excel 2010 Save As (same spreadsheet)
>                 <fills count="4">
>               <fill>
>                      <patternFill patternType="none" />
>               </fill>
>               <fill>
>                      <patternFill patternType="gray125" />
>               </fill>
>               <fill>
>                      <patternFill patternType="none" />
>               </fill>
>               <fill>
>                      <patternFill patternType="solid">
>                            <fgColor indexed="44" />
>                      </patternFill>
>               </fill>
>        </fills>
>
> Note, not even Excel was perfect as it left a duplicate fill (0, and 2). Borders was even more dramatic POI generated 94, Excel saved 13. And now that I look at it, POI generated 42 styles while Excel kept 23. I understand that you are keeping all the fills borders and styles as they are specified, and re-using them as you get to a final setting, but normalizing and trimming seem to be what you are missing. That can also be applied to fonts and formats.
>
> Some additional thoughts, based on not yet looking at all the code. I can see three steps to collecting all the garbage: Normalization, Duplicate removal, Orphan removal. Normalization would involve setting default values, and removing system colors. On the fill above, the POI generated fill would look like this after normalization:
>
>        <fills count="9">
>               <fill>
>                      <patternFill patternType="none" />
>               </fill>
>               <fill>
>                      <patternFill patternType="darkGray" />
>               </fill>
>               <fill>
>                      <patternFill patternType="none" />
>               </fill>
>               <fill>
>                      <patternFill patternType="none" />
>               </fill>
>               <fill>
>                      <patternFill patternType="none" />
>               </fill>
>               <fill>
>                      <patternFill patternType="none">
>                            <fgColor indexed="44" />
>                      </patternFill>
>               </fill>
>               <fill>
>                      <patternFill patternType="none">
>                            <fgColor indexed="44" />
>                      </patternFill>
>               </fill>
>               <fill>
>                      <patternFill patternType="solid">
>                            <fgColor indexed="44" />
>                      </patternFill>
>               </fill>
>               <fill>
>                      <patternFill patternType="none" />
>               </fill>
>        </fills>
>
> Duplicate removal would remove the many duplicates left by normalization. This would have to be carried back to the styles themselves and potentially to individual cells in the spreadsheet that contain fill id's. After duplicate removal you would have something like the following in the fills:
>
>        <fills count="4">
>               <fill>
>                      <patternFill patternType="none" />
>               </fill>
>               <fill>
>                      <patternFill patternType="darkGray" />
>               </fill>
>               <fill>
>                      <patternFill patternType="none">
>                            <fgColor indexed="44" />
>                      </patternFill>
>               </fill>
>               <fill>
>                      <patternFill patternType="solid">
>                            <fgColor indexed="44" />
>                      </patternFill>
>               </fill>
>        </fills>
>
> And I just happen to know that my spreadsheet never uses fill "none" - 44, so that could d be stripped out during orphan removal to bring the fill to:
>
>        <fills count="3">
>               <fill>
>                      <patternFill patternType="none" />
>               </fill>
>               <fill>
>                      <patternFill patternType="darkGray" />
>               </fill>
>               <fill>
>                      <patternFill patternType="solid">
>                            <fgColor indexed="44" />
>                      </patternFill>
>               </fill>
>        </fills>
>
> I suspect that the same process could be used for borders, fonts, formats and styles where the order of operation would be normalization and duplicate removal for fills, borders, fonts, formats, then styles. This followed by orphan removal for styles then fills, borders, fonts, and formats.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org