You are viewing a plain text version of this content. The canonical link for it is here.
Posted to odf-dev@incubator.apache.org by "Nimarukan (JIRA)" <ji...@apache.org> on 2016/05/28 22:56:12 UTC
[jira] [Updated] (ODFTOOLKIT-434) PERFORMANCE/SPACE: Reduce memory
per table cell
[ https://issues.apache.org/jira/browse/ODFTOOLKIT-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nimarukan updated ODFTOOLKIT-434:
---------------------------------
Attachment: 434-part5-odfdom-pkg-DOMRDFaParser-avoidGetAttributesIfNotHasAttributes.patch
434-part4-odfdom-OfficeValueAttribute_setValue-reuseValueEnumString.patch
434-part3-odfdom-OdfAttribute-reuse-OdfName_localName.patch
434-part2-odfdom-OdfStylableElement-OdfElement-reuse-OdfName_localName.patch
434-part1-odfdom_pkg_OdfName-precomputeQName-includePrefixInMapKey.patch
> PERFORMANCE/SPACE: Reduce memory per table cell
> -----------------------------------------------
>
> Key: ODFTOOLKIT-434
> URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-434
> Project: ODF Toolkit
> Issue Type: Improvement
> Components: odfdom
> Affects Versions: 0.6.2-incubating
> Environment: odfdom-java-0.8.11-incubating-SNAPSHOT, simple-odf-0.8.2-incubating-SNAPSHOT, jdk1.8.0_79, MSWin7
> Reporter: Nimarukan
> Priority: Minor
> Labels: performance
> Attachments: 434-part1-odfdom_pkg_OdfName-precomputeQName-includePrefixInMapKey.patch, 434-part2-odfdom-OdfStylableElement-OdfElement-reuse-OdfName_localName.patch, 434-part3-odfdom-OdfAttribute-reuse-OdfName_localName.patch, 434-part4-odfdom-OfficeValueAttribute_setValue-reuseValueEnumString.patch, 434-part5-odfdom-pkg-DOMRDFaParser-avoidGetAttributesIfNotHasAttributes.patch
>
>
> h2. PERFORMANCE/SPACE: Reduce memory per table cell
> ODFTOOLKIT-333 provides a [test case|https://issues.apache.org/jira/secure/attachment/12806838/odftoolkit-333-test.zip] with file bigFile.ods, which is 1.3MB in normal compressed form, or ~180MB uncompressed.
> Reading the file takes 1.5GB or so, which can cause a 64bit JVM with default memory settings to run out of memory on a system with less than 6GB RAM (assuming default -Xmx size is one quarter system RAM).
> (I ran the test case using simple-odf-0.8.2-inclubating-SNAPTSHOT and odfdom-java-0.8.11-incubating-SNAPSHOT from svn trunk, plus patches from ODFTOOLKIT-424, approach A, which reduces initial runtime by a factor of 12 or so over simpleapi 0.8.1 and odfdom 0.8.10.)
> With the changes proposed below, the ODFTOOLKIT-333 test case runs in 25% less time with unconstrained memory (java option {{-Xmx3000M}}). With less memory than {{-Xmx2200M}}, the changes produce greater improvement because fewer full-gc passes occur.
> The changes:
> * part1: Precompute OdfName qName
> * part2: Use precomputed OdfName parts for table-cell element name, do not store new ones.
> * part3: Use precomputed OdfName parts for value-type attribute name, do not store new ones.
> * part4: Use OfficeValueTypeAttribute.Value for value-type attribute value, do not store new ones.
> * part5: Avoid creating an empty AttributeMap on p elements with no attributes.
> These changes reduce the memory requirement by about 20% (1.5GB to 1.2GB).
> Contents
> - [Initial diagnosis|#InitialDiagnosis]
> - [Reduce duplicate element name strings|#ReduceElementNameStrings]
> - [Reduce duplicate attribute name strings|#ReduceAttributeNameStrings]
> - [Reduce duplicate value type strings|#ReduceValueTypeStrings]
> - [Reduce empty attribute maps|#ReduceEmptyAttributeMaps]
> - [Table cell memory footprint|#TableCellMemoryFootprint]
> ** [Users can further reduce memory|#UsersCanFurtherReduceMemory]
> {anchor:InitialDiagnosis}
> h3. INITIAL DIAGNOSIS
> A heap dump during a profiled run showed (in Netbeans) that the top memory uses are:
> {code}
> 6.7M char[]
> 6.7M String
> 2.7M org.apache.xerces.dom.AttributeMap
> 1.3M Object[]
> 1.3M Vector
> 1.3M org.apache.xerces.dom.TextImpl
> 1.3M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
> 1.3M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
> 1.3M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
> 47K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
> 47K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
> {code}
> So it looks like there were about 47K rows holding 1.3M cells.
> But why so many Strings?
> Each table cell is represented as elements such as:
> {code}
> <table:table-cell office:value-type="string"><text:p>Test data 47014</text:p></table:table-cell>
> {code}
> Browsing the latest {{String}} instances shows a large number of them are:
> * element tag name parts like {{"table-cell"}} and {{"p"}},
> * attribute name parts like {{"office"}} and {{"value-type"}}
> * attribute values like {{"string"}},
> * and the content string values in the cells, like {{"Test data 47021"}}.
> {anchor:ReduceElementNameStrings}
> h3. REDUCE DUPLICATE ELEMENT TAG NAME STRINGS
> The element tag names {{"table-cell"}} and {{"p"}} should be shared, not duplicated for every cell.
> {panel}
> 1. {{TableTableCellElement}} defines a constant {{ELEMENT_NAME}} which is an {{OdfName}}.
> 2. {{TableTableCellElement}} passes the {{OdfName}} to {{TableTableCellElementBase}}.
> 3. {{TableTableCellElementBase}} passes the {{OdfName}} to {{OdfStylableElement}}.
> 4. {{OdfStylableElement}} passes {{name.getURI()}} and {{name.getQName()}} to {{OdfElement}}.
> *CULPRIT 1*: {{OdfName.getQName()}} constructs a new string each time it is called, concatentating the namespace prefix and the local name.
> 5. {{OdfElement}} passes the {{qName}} to to {{xerces.dom.ElementNSImpl}}.
> 6. {{ElementNSImpl(ownerDoc, ns, qname)}} stores the prefix and local name.
> *CULPRIT 2*: {{ElementNSImpl}} creates strings for the prefix and local name, checks them, and stores the local name.
> {panel}
> To avoid creating strings for every element tag qname, prefix, and local name:
> {panel:title=part1}
> 1. {{OdfName}} needs to precompute the qName.
> {panel}
> {panel:title=part2}
> 4. {{OdfStylableElement(ownerDoc, OdfName, ...)}}
> must call {{OdfElement(ownerDoc, OdfName)}}
> \[not {{OdfElement(ownerDoc, ns, qname)}}]
> 5. {{OdfElement(ownerDoc, OdfName)}}
> must call {{ElementNSImpl(ownerDoc, ns, qname, localName)}}
> \[not {{ElementNSImpl(ownerDoc, ns, qname)}}]
> {panel}
> After this change a profile run showed the following:
> {code}
> 4.8M char[]
> 4.8M String
> 3.1M org.apache.xerces.dom.AttributeMap
> 1.6M Object[]
> 1.6M Vector
> 1.5M org.apache.xerces.dom.TextImpl
> 1.5M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
> 1.5M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
> 1.5M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
> 55K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
> 55K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
> {code}
> (row numbers are larger because snapshot was later in run)
> Browsing the latest {{String}} instances shows the element names {{"table-cell"}} and {{"p"}} are no longer frequent.
> {anchor:ReduceAttributeNameStrings}
> h3. REDUCE DUPLICATE ATTRIBUTE NAME STRINGS
> A large number of remaining strings are attribute parts like {{"office"}}, {{"value-type"}}, {{"string"}}, plus the test string values in the cells, like {{"Test data 47021"}}.
> Attribute name parts like {{"office"}} and {{"value-type"}} should be shared, not duplicated for every cell.
> {panel}
> *CULPRIT 3*: {{AttrNSImpl(ownerDoc, ns, qName)}} creates strings for the prefix and local name, checks them, and stores the local name.
> {panel}
> To share the attribute name strings, a similar change is needed:
> {panel:title=part3}
> 1. {{OdfAttribute(ownerDoc, OdfName)}}
> must call {{AttrNSImpl(ownerDoc, ns, qName, localName)}}
> \[not {{AttrNSImpl(ownerDoc, ns, qName)}}]
> {panel}
> After adding this change a profile run showed the following:
> {code}
> 3.4M char[]
> 3.4M String
> 3.4M org.apache.xerces.dom.AttributeMap
> 1.7M Object[]
> 1.7M Vector
> 1.6M org.apache.xerces.dom.TextImpl
> 1.6M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
> 1.6M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
> 1.6M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
> 60K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
> 60K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
> {code}
> Browsing the latest instances shows {{"office"}} and {{"value-type"}} are no longer frequent.
> {anchor:ReduceValueTypeStrings}
> h3. REDUCE DUPLICATE VALUE TYPE STRINGS
> The {{value-type}} attribute value {{"string"}} is duplicated for each cell.
> To share {{value-type}} attribute value strings, such as {{"string"}} in {{office:value-type="string"}}, do not store the string from the input.
> Instead, use the value to find the enum {{OfficeValueTypeAttribute.Value}}.
> {panel:title=part4}
> 1. OfficeValueTypeAttribute_setAttribute(stringValue)
> Find enum value with
> OfficeValueTypeAttribute.Value.enumValueOf(stringValue)
> If not null, use its string instead of the stringValue.
> {panel}
> After adding this change, a profile run showed the following:
> {code}
> 3.3M org.apache.xerces.dom.AttributeMap
> 1.7M char[]
> 1.7M String
> 1.7M Object[]
> 1.7M Vector
> 1.6M org.apache.xerces.dom.TextImpl
> 1.6M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
> 1.6M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
> 1.6M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
> 58K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
> 58K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
> {code}
> Much better, now the number of strings is near the number of cells.
> {anchor:ReduceEmptyAttributeMaps}
> h3. REDUCE EMPTY ATTRIBUTE MAPS
> However, the number of {{AttributeMap}} is too high. Browsing instances of {{AttributeMap}} reveals that each cell has two elements: a {{"table-cell"}} element and a {{"p"}} (paragraph) element.
> {code}
> <table:table-cell office:value-type="string"><text:p>Test data 47014</text:p></table:table-cell>
> {code}
> Only the {{"table-cell"}} elements have an attribute ({{office:value-type="string"}}), the {{"p"}} elements have no attributes.
> An empty {{AttributeMap}} may be created and stored in an {{Element}} if xerces {{ElementImpl.getAttributes()}} is called when there are no attributes. To avoid this, a caller should check if the {{Element.hasAttributes()}} and only call {{Element.getAttributes()}} if so.
> Setting a breakpoint on {{ElementImpl.getAttributes()}} reveals that {{odfdom.pkg.rdfa.DOMRDFaParser}} is the culprit. To eliminate the creation of empty {{AttributeMap}}:
> {panel:title=part5}
> 1. Change DOMRDFaParser.process to check whether an
> Element.hasAttributes(). If not, do not call
> Element.getAttributes(), instead, use a static EmptyAttributes
> object.
> {panel}
> With this change, a heap dump during a profile run shows:
> {code}
> 1.7M char[]
> 1.7M String
> 1.7M Object[]
> 1.7M Vector
> 1.6M org.apache.xerces.dom.AttributeMap
> 1.6M org.apache.xerces.dom.TextImpl
> 1.6M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
> 1.6M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
> 1.6M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
> 58K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
> 58K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
> {code}
> Now the number of {{AttributeMap}} matches the number of cells.
> {anchor:TableCellMemoryFootprint}
> h3. TABLE-CELL MEMORY FOOTPRINT
> The test case file has cells represented as follows:
> {code}
> <table:table-cell office:value-type="string"><text:p>Test data 47014</text:p></table:table-cell>
> {code}
> After these patches, all the strings are shared by many cells, except the content
> strings like "Test data 47014". So the memory foot print is as follows:
> {code}
> - (17+2fields) Element "table-cell" (TableTableCellElement)
> - ( 4+2fields) Element "table-cell" AttributeMap
> - ( 4+2fields) Element "table-cell" AttributeMap Vector
> - ( 5+2fields) Element "table-cell" AttributeMap Vector Object array
> (4 array slots are null, and could be reclaimed in
> theory, but the vector is not public so not easy.)
> - ( 7+2fields) Element "table-cell" Attr "office:value-type='string'"
> - (17+2fields) Element "p" (OdfTextParagraph)
> - ( 5+2fields) TextImpl
> - ( 2+2fields) String
> ~ (15 char) char array "Test data 57014"
> ____________
> ~61 fields + 9 * 2 (for object headers) + data
> is about 80 words of memory.
> or about 320 bytes (4-byte words in 32bit-JVM)
> or about 640 bytes (8-byte words in 64bit-JVM)
> {code}
> As noted, especially for large data spreadsheets, the full literal DOM tree is not a space-efficient representation, so it requires the JVM to have access to plenty of memory. The JVM default maximum memory is often one quarter of system RAM, so specifying a larger {{java -Xmx}} value may be required if the default is too small.
> {anchor:UsersCanFurtherReduceMemory}
> {panel:title=Users can further reduce memory footprint of this file.}
> In this file, the cell values are unformatted strings, so they could alternatively be stored using an attribute rather than a nested paragraph.
> {code}
> <table:table-cell office:value-type="string" office:string-value="Test data 47014"/></table:table-cell>
> {code}
> This is longer xml text, and does not compress as well for some reason, so the file is larger on disk.
> But in memory, this removes the large {{text p}} element as well as the {{TextImpl}} object, and adds the {{office:string-value}} attribute name. With this reduced xml, each cell has the following object sizes:
> {code}
> - (17+2fields) Element "table-cell" (TableTableCellElement)
> - ( 4+2fields) Element "table-cell" AttributeMap
> - ( 4+2fields) Element "table-cell" AttributeMap Vector
> - ( 5+2fields) Element "table-cell" AttributeMap Vector Object array
> (4 elements are null, so 32 B could be reclaimed in
> theory, but the vector is not public so not easy.)
> - ( 7+2fields) Element "table-cell" Attr "office:value-type='string'"
> - ( 7+2fields) Element "table-cell" Attr "office:string-value='Test data 57014'"
> - ( 2+2fields) String
> ~ (15 char) char array "Test data 57014"
> ______________
> ~46 fields + 8 * 2 (for object headers) + data
> is about 62 words of memory.
> or about 248 bytes (4-byte words in 32bit-JVM)
> or about 496 bytes (8-byte words in 64bit-JVM)
> {code}
> Even though the file is longer, this ~20% reduction in memory can reduce the runtime of the test case by 20%.
> {panel}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)