You are viewing a plain text version of this content. The canonical link for it is here.
Posted to odf-dev@incubator.apache.org by "Nimarukan (JIRA)" <ji...@apache.org> on 2016/05/28 22:56:12 UTC

[jira] [Updated] (ODFTOOLKIT-434) PERFORMANCE/SPACE: Reduce memory per table cell

     [ https://issues.apache.org/jira/browse/ODFTOOLKIT-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nimarukan updated ODFTOOLKIT-434:
---------------------------------
    Attachment: 434-part5-odfdom-pkg-DOMRDFaParser-avoidGetAttributesIfNotHasAttributes.patch
                434-part4-odfdom-OfficeValueAttribute_setValue-reuseValueEnumString.patch
                434-part3-odfdom-OdfAttribute-reuse-OdfName_localName.patch
                434-part2-odfdom-OdfStylableElement-OdfElement-reuse-OdfName_localName.patch
                434-part1-odfdom_pkg_OdfName-precomputeQName-includePrefixInMapKey.patch

> PERFORMANCE/SPACE: Reduce memory per table cell
> -----------------------------------------------
>
>                 Key: ODFTOOLKIT-434
>                 URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-434
>             Project: ODF Toolkit
>          Issue Type: Improvement
>          Components: odfdom
>    Affects Versions: 0.6.2-incubating
>         Environment: odfdom-java-0.8.11-incubating-SNAPSHOT, simple-odf-0.8.2-incubating-SNAPSHOT, jdk1.8.0_79, MSWin7
>            Reporter: Nimarukan
>            Priority: Minor
>              Labels: performance
>         Attachments: 434-part1-odfdom_pkg_OdfName-precomputeQName-includePrefixInMapKey.patch, 434-part2-odfdom-OdfStylableElement-OdfElement-reuse-OdfName_localName.patch, 434-part3-odfdom-OdfAttribute-reuse-OdfName_localName.patch, 434-part4-odfdom-OfficeValueAttribute_setValue-reuseValueEnumString.patch, 434-part5-odfdom-pkg-DOMRDFaParser-avoidGetAttributesIfNotHasAttributes.patch
>
>
> h2. PERFORMANCE/SPACE: Reduce memory per table cell
> ODFTOOLKIT-333 provides a [test case|https://issues.apache.org/jira/secure/attachment/12806838/odftoolkit-333-test.zip] with file bigFile.ods, which is 1.3MB in normal compressed form, or ~180MB uncompressed.
> Reading the file takes 1.5GB or so, which can cause a 64bit JVM with default memory settings to run out of memory on a system with less than 6GB RAM (assuming default -Xmx size is one quarter system RAM). 
> (I ran the test case using simple-odf-0.8.2-inclubating-SNAPTSHOT and odfdom-java-0.8.11-incubating-SNAPSHOT from svn trunk, plus patches from ODFTOOLKIT-424, approach A, which reduces initial runtime by a factor of 12 or so over simpleapi 0.8.1 and odfdom 0.8.10.)
> With the changes proposed below, the ODFTOOLKIT-333 test case runs in 25% less time with unconstrained memory (java option {{-Xmx3000M}}).  With less memory than {{-Xmx2200M}}, the changes produce greater improvement because fewer full-gc passes occur.
> The changes:
> * part1: Precompute OdfName qName
> * part2: Use precomputed OdfName parts for table-cell element name, do not store new ones.
> * part3: Use precomputed OdfName parts for value-type attribute name, do not store new ones.
> * part4: Use OfficeValueTypeAttribute.Value for value-type attribute value, do not store new ones.
> * part5: Avoid creating an empty AttributeMap on p elements with no attributes.
> These changes reduce the memory requirement by about 20% (1.5GB to 1.2GB).
> Contents
> - [Initial diagnosis|#InitialDiagnosis]
> - [Reduce duplicate element name strings|#ReduceElementNameStrings]
> - [Reduce duplicate attribute name strings|#ReduceAttributeNameStrings]
> - [Reduce duplicate value type strings|#ReduceValueTypeStrings]
> - [Reduce empty attribute maps|#ReduceEmptyAttributeMaps]
> - [Table cell memory footprint|#TableCellMemoryFootprint]
> ** [Users can further reduce memory|#UsersCanFurtherReduceMemory]
> {anchor:InitialDiagnosis}
> h3. INITIAL DIAGNOSIS
> A heap dump during a profiled run showed (in Netbeans) that the top memory uses are:
> {code}
>   6.7M char[]
>   6.7M String
>   2.7M org.apache.xerces.dom.AttributeMap
>   1.3M Object[]
>   1.3M Vector
>   1.3M org.apache.xerces.dom.TextImpl
>   1.3M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
>   1.3M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
>   1.3M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
>   47K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
>   47K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
> {code}
> So it looks like there were about 47K rows holding 1.3M cells.
> But why so many Strings?
> Each table cell is represented as elements such as:
> {code}
> <table:table-cell office:value-type="string"><text:p>Test data 47014</text:p></table:table-cell>
> {code}
> Browsing the latest {{String}} instances shows a large number of them are:
> * element tag name parts like {{"table-cell"}} and {{"p"}},
> * attribute name parts like {{"office"}} and {{"value-type"}}
> * attribute values like {{"string"}}, 
> * and the content string values in the cells, like {{"Test data 47021"}}.
> {anchor:ReduceElementNameStrings}
> h3. REDUCE DUPLICATE ELEMENT TAG NAME STRINGS
> The element tag names {{"table-cell"}} and {{"p"}} should be shared, not duplicated for every cell.
> {panel}
>   1. {{TableTableCellElement}} defines a constant {{ELEMENT_NAME}} which is an {{OdfName}}.
>   2. {{TableTableCellElement}} passes the {{OdfName}} to {{TableTableCellElementBase}}.
>   3. {{TableTableCellElementBase}} passes the {{OdfName}} to {{OdfStylableElement}}.
>   4. {{OdfStylableElement}} passes {{name.getURI()}} and {{name.getQName()}} to {{OdfElement}}.
>      *CULPRIT 1*: {{OdfName.getQName()}} constructs a new string each time it is called, concatentating the namespace prefix and the local name.
>   5. {{OdfElement}} passes the {{qName}} to to {{xerces.dom.ElementNSImpl}}.
>   6. {{ElementNSImpl(ownerDoc, ns, qname)}} stores the prefix and local name.
>      *CULPRIT 2*: {{ElementNSImpl}} creates strings for the prefix and local name, checks them, and stores the local name.
> {panel}
> To avoid creating strings for every element tag qname, prefix, and local name:
> {panel:title=part1}
>   1. {{OdfName}} needs to precompute the qName.
> {panel}
> {panel:title=part2}
>   4. {{OdfStylableElement(ownerDoc, OdfName, ...)}}
>      must call {{OdfElement(ownerDoc, OdfName)}}
>      \[not {{OdfElement(ownerDoc, ns, qname)}}]
>   5. {{OdfElement(ownerDoc, OdfName)}}
>      must call {{ElementNSImpl(ownerDoc, ns, qname, localName)}}
>      \[not {{ElementNSImpl(ownerDoc, ns, qname)}}]
> {panel}
> After this change a profile run showed the following:
> {code}
>   4.8M char[]
>   4.8M String
>   3.1M org.apache.xerces.dom.AttributeMap
>   1.6M Object[]
>   1.6M Vector
>   1.5M org.apache.xerces.dom.TextImpl
>   1.5M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
>   1.5M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
>   1.5M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
>   55K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
>   55K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
> {code}
> (row numbers are larger because snapshot was later in run)
> Browsing the latest {{String}} instances shows the element names {{"table-cell"}} and {{"p"}} are no longer frequent.
> {anchor:ReduceAttributeNameStrings}
> h3. REDUCE DUPLICATE ATTRIBUTE NAME STRINGS
> A large number of remaining strings are attribute parts like {{"office"}}, {{"value-type"}}, {{"string"}}, plus the test string values in the cells, like {{"Test data 47021"}}.
> Attribute name parts like {{"office"}} and {{"value-type"}} should be shared, not duplicated for every cell.
> {panel}
>      *CULPRIT 3*: {{AttrNSImpl(ownerDoc, ns, qName)}} creates strings for the prefix and local name, checks them, and stores the local name.
> {panel}
> To share the attribute name strings, a similar change is needed:
> {panel:title=part3}
>   1. {{OdfAttribute(ownerDoc, OdfName)}}
>     must call {{AttrNSImpl(ownerDoc, ns, qName, localName)}}
>     \[not {{AttrNSImpl(ownerDoc, ns, qName)}}]
> {panel}
> After adding this change a profile run showed the following:
> {code}
>   3.4M char[]
>   3.4M String
>   3.4M org.apache.xerces.dom.AttributeMap
>   1.7M Object[]
>   1.7M Vector
>   1.6M org.apache.xerces.dom.TextImpl
>   1.6M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
>   1.6M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
>   1.6M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
>   60K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
>   60K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
> {code}
> Browsing the latest instances shows {{"office"}} and {{"value-type"}} are no longer frequent.
> {anchor:ReduceValueTypeStrings}
> h3. REDUCE DUPLICATE VALUE TYPE STRINGS
> The {{value-type}} attribute value {{"string"}} is duplicated for each cell.
> To share {{value-type}} attribute value strings, such as {{"string"}} in {{office:value-type="string"}}, do not store the string from the input.
> Instead, use the value to find the enum {{OfficeValueTypeAttribute.Value}}.
> {panel:title=part4}
> 1. OfficeValueTypeAttribute_setAttribute(stringValue)
>    Find enum value with
>    OfficeValueTypeAttribute.Value.enumValueOf(stringValue)
>    If not null, use its string instead of the stringValue.
> {panel}
> After adding this change, a profile run showed the following:
> {code}
>   3.3M org.apache.xerces.dom.AttributeMap
>   1.7M char[]
>   1.7M String
>   1.7M Object[]
>   1.7M Vector
>   1.6M org.apache.xerces.dom.TextImpl
>   1.6M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
>   1.6M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
>   1.6M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
>   58K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
>   58K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
> {code}
> Much better, now the number of strings is near the number of cells.
> {anchor:ReduceEmptyAttributeMaps}
> h3. REDUCE EMPTY ATTRIBUTE MAPS
> However, the number of {{AttributeMap}} is too high.  Browsing instances of {{AttributeMap}} reveals that each cell has two elements: a {{"table-cell"}} element and a {{"p"}} (paragraph) element.
> {code}
> <table:table-cell office:value-type="string"><text:p>Test data 47014</text:p></table:table-cell>
> {code}
> Only the {{"table-cell"}} elements have an attribute ({{office:value-type="string"}}), the {{"p"}} elements have no attributes.
> An empty {{AttributeMap}} may be created and stored in an {{Element}} if xerces {{ElementImpl.getAttributes()}} is called when there are no attributes.  To avoid this, a caller should check if the {{Element.hasAttributes()}} and only call {{Element.getAttributes()}} if so.
> Setting a breakpoint on {{ElementImpl.getAttributes()}} reveals that {{odfdom.pkg.rdfa.DOMRDFaParser}} is the culprit.  To eliminate the creation of empty {{AttributeMap}}:
> {panel:title=part5}
> 1. Change DOMRDFaParser.process to check whether an
>    Element.hasAttributes().  If not, do not call
>    Element.getAttributes(), instead, use a static EmptyAttributes
>    object.
> {panel}
> With this change, a heap dump during a profile run shows:
> {code}
>   1.7M char[]
>   1.7M String
>   1.7M Object[]
>   1.7M Vector
>   1.6M org.apache.xerces.dom.AttributeMap
>   1.6M org.apache.xerces.dom.TextImpl
>   1.6M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
>   1.6M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
>   1.6M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
>   58K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
>   58K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
> {code}
> Now the number of {{AttributeMap}} matches the number of cells.
> {anchor:TableCellMemoryFootprint}
> h3. TABLE-CELL MEMORY FOOTPRINT
> The test case file has cells represented as follows:
> {code}
> <table:table-cell office:value-type="string"><text:p>Test data 47014</text:p></table:table-cell>
> {code}
> After these patches, all the strings are shared by many cells, except the content
> strings like "Test data 47014".  So the memory foot print is as follows:
> {code}
> - (17+2fields) Element "table-cell" (TableTableCellElement)
> - ( 4+2fields) Element "table-cell" AttributeMap
> - ( 4+2fields) Element "table-cell" AttributeMap Vector
> - ( 5+2fields) Element "table-cell" AttributeMap Vector Object array
>                       (4 array slots are null, and could be reclaimed in
>                        theory, but the vector is not public so not easy.)
> - ( 7+2fields) Element "table-cell" Attr "office:value-type='string'"
> - (17+2fields) Element "p" (OdfTextParagraph)
> - ( 5+2fields) TextImpl
> - ( 2+2fields) String
> ~ (15 char) char array "Test data 57014"
> ____________
>  ~61 fields + 9 * 2 (for object headers) + data
>  is about  80 words of memory.
>  or about 320 bytes (4-byte words in 32bit-JVM)
>  or about 640 bytes (8-byte words in 64bit-JVM)
> {code}
> As noted, especially for large data spreadsheets, the full literal DOM tree is not a space-efficient representation, so it requires the JVM to have access to plenty of memory.  The JVM default maximum memory is often one quarter of system RAM, so specifying a larger {{java -Xmx}} value may be required if the default is too small.
> {anchor:UsersCanFurtherReduceMemory}
> {panel:title=Users can further reduce memory footprint of this file.}
> In this file, the cell values are unformatted strings, so they could alternatively be stored using an attribute rather than a nested paragraph.
> {code}
> <table:table-cell office:value-type="string" office:string-value="Test data 47014"/></table:table-cell>
> {code}
> This is longer xml text, and does not compress as well for some reason, so the file is larger on disk.
> But in memory, this removes the large {{text p}} element as well as the {{TextImpl}} object, and adds the {{office:string-value}} attribute name.  With this reduced xml, each cell has the following object sizes:
> {code}
> - (17+2fields) Element "table-cell" (TableTableCellElement)
> - ( 4+2fields) Element "table-cell" AttributeMap
> - ( 4+2fields) Element "table-cell" AttributeMap Vector
> - ( 5+2fields) Element "table-cell" AttributeMap Vector Object array
>                       (4 elements are null, so 32 B could be reclaimed in
>                        theory, but the vector is not public so not easy.)
> - ( 7+2fields) Element "table-cell" Attr "office:value-type='string'"
> - ( 7+2fields) Element "table-cell" Attr "office:string-value='Test data 57014'"
> - ( 2+2fields) String
> ~ (15 char) char array "Test data 57014"
> ______________
>  ~46 fields + 8 * 2 (for object headers) + data
>  is about  62 words of memory.
>  or about 248 bytes (4-byte words in 32bit-JVM)
>  or about 496 bytes (8-byte words in 64bit-JVM)
> {code}
> Even though the file is longer, this ~20% reduction in memory can reduce the runtime of the test case by 20%.
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)