You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2010/03/18 17:49:12 UTC

DO NOT REPLY [Bug 48936] New: Writing specific sequence of strings to XSSFSheet results in malformed XML

https://issues.apache.org/bugzilla/show_bug.cgi?id=48936

           Summary: Writing specific sequence of strings to XSSFSheet
                    results in malformed XML
           Product: POI
           Version: 3.6
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XSSF
        AssignedTo: dev@poi.apache.org
        ReportedBy: philippe.laflamme@gmail.com


Created an attachment (id=25147)
 --> (https://issues.apache.org/bugzilla/attachment.cgi?id=25147)
This patch adds a test case. Created against 3_6 tag.

A specific sequence of strings results in broken CDATA section in
sharedStrings.xml file.

Attached patch adds a test case to 3_6 tag to reproduce the problem.

java.lang.reflect.InvocationTargetException
org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException
        at
org.apache.poi.xssf.usermodel.XSSFFactory.createDocumentPart(XSSFFactory.java:61)
        at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:256)
        at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:196)
        at
org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:179)
        at
org.apache.poi.xssf.TestBrokenCdata.testWritingSomeStringsResultsInMalformedCdata(TestBrokenCdata.java:53)
Caused by: java.lang.reflect.InvocationTargetException
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at
org.apache.poi.xssf.usermodel.XSSFFactory.createDocumentPart(XSSFFactory.java:59)
Caused by: java.io.IOException: error: Unexpected character encountered (lex
state 9): '!'
        at
org.apache.poi.xssf.model.SharedStringsTable.readFrom(SharedStringsTable.java:119)
        at
org.apache.poi.xssf.model.SharedStringsTable.<init>(SharedStringsTable.java:97)

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 48936] Writing specific sequence of strings to XSSFSheet results in malformed XML

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48936

--- Comment #1 from Josh Micich <jo...@gildedtree.com> 2010-03-20 01:08:57 UTC ---
Created an attachment (id=25154)
 --> (https://issues.apache.org/bugzilla/attachment.cgi?id=25154)
A simplified version of the strings file

The problem seems to be within piccolo (part of xmlbeans).  The encoding logic
is incorrectly identifying the ']]' in one string as ']]>' and thus attempts a
mis-aligned escape.  This logic is being performed in a buffer which gets
re-used multiple times across the document.  When piccolo reads one character
too far, it reads whatever was left in the buffer from the previous use.  If
the input data is constructed carefully, the '>' character may be incorrectly
detected.

There are many sets of data which can cause this problem though they can be a
little hard to find.  Here are a few guidelines:
  - there needs to be enough total data to cause a wrap around in the buffer. 
The problem occurs with strings written after the first buffer flush.
  - one of those strings needs to 
     - be longer than 32 chars, 
     - have more than 5 entities that need escaping (e.g. '>')
     - end in ']]'
  - the lengths of all previous strings are chosen to cause a '>' from the
previous buffer usage align with the']]'

This bug should probably be forwarded off to the xmlbeans team.  It will
probably take some effort to create a test case that is independent of POI.

xmlbeans 2.5.0 displays the same problem (POI currently uses 2.3.0)

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 48936] Writing specific sequence of strings to XSSFSheet results in malformed XML

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48936

Yegor Kozlov <ye...@dinom.ru> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

--- Comment #2 from Yegor Kozlov <ye...@dinom.ru> 2010-04-25 09:03:35 EDT ---
Quite an interesting bug. 
The problem is in the way XmlBeans detects and writes CDATA blocks. Luckily, we
can control this funny behavior.

I traced the problem down to the internal class TextSaver in XmlBeans: 

http://svn.apache.org/viewvc/xmlbeans/tags/2.3.0/src/store/org/apache/xmlbeans/impl/store/Saver.java?view=markup

The logic for detection CDATA starts at line #1286. The heuristic is quite
complex, but it turned out that it can be controlled with two options:

  XmlOptions#setSaveCDataLengthThreshold(int)
  XmlOptions#setSaveCDataEntityCountThreshold(int)

The default value of cdataEntityCountThreshold is 5 and the default value of
cdataLengthThreshold is 32. These values perfectly agree with Josh's
observations. 

According to the docs, XmlBeans will use CDATA if the following condition is
true:
    textLength > cdataLengthThreshold && entityCount >
cdataEntityCountThreshold

The combination of XmlOptions.setSaveCDataEntityCountThreshold(0) and
XmlOptions.setSaveCDataLengthThreshold(-1) will make every text CDATA. 

The combination of XmlOptions.setSaveCDataEntityCountThreshold(MAXLENGTH) and
XmlOptions.setSaveCDataLengthThreshold(-1) will detect CDATA only if the text
is longer than MAXLENGTH chars. I used the following values to disable CDATA
when saving sharedStrings.xml:

  XmlOptions options = new XmlOptions(DEFAULT_XML_OPTIONS);
  options.setSaveCDataLengthThreshold(1000000); 
  options.setSaveCDataEntityCountThreshold(-1);


I committed the fix in r937792. 

Existing code using POI-3.6 can be fixed as follows:

  XmlOptions options = POIXMLDocumentPart.DEFAULT_XML_OPTIONS;
  options.setSaveCDataLengthThreshold(1000000);
  options.setSaveCDataEntityCountThreshold(-1);

Add these lines before calling  workbook.write(out)

Regards,
Yegor

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org