You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2017/02/03 04:13:08 UTC
[Bug 60685] New: Unable to parse .pub files
-java.lang.ArrayIndexOutOfBoundsException: 88
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685
Bug ID: 60685
Summary: Unable to parse .pub files
-java.lang.ArrayIndexOutOfBoundsException: 88
Product: POI
Version: unspecified
Hardware: PC
Status: NEW
Severity: normal
Priority: P2
Component: POI Overall
Assignee: dev@poi.apache.org
Reporter: sharathkumarmn@gmail.com
Target Milestone: ---
Created attachment 34710
--> https://bz.apache.org/bugzilla/attachment.cgi?id=34710&action=edit
test document
When i try to parse the attached .pub file, it fails with the below exception
Caused by: java.lang.ArrayIndexOutOfBoundsException: 88
at org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java:343)
at org.apache.poi.hpbf.model.qcbits.QCPLCBit$Type12.<init>(QCPLCBit.java:215)
at org.apache.poi.hpbf.model.qcbits.QCPLCBit$Type12.<init>(QCPLCBit.java:176)
at org.apache.poi.hpbf.model.qcbits.QCPLCBit.createQCPLCBit(QCPLCBit.java:90)
at org.apache.poi.hpbf.model.QuillContents.<init>(QuillContents.java:71)
at org.apache.poi.hpbf.HPBFDocument.<init>(HPBFDocument.java:67)
at
org.apache.poi.hpbf.extractor.PublisherTextExtractor.<init>(PublisherTextExtractor.java:45)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:141)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 28 more
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 60685] Unable to parse .pub files
-java.lang.ArrayIndexOutOfBoundsException: 88
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685
--- Comment #4 from Javen O'Neal <on...@apache.org> ---
Workaround applied in r1801405. Will be included in POI 3.17 beta 2.
Looking for any volunteers willing to experiment with the .pub format and
extend our documented understanding here:
https://poi.apache.org/hpbf/file-format.html
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 60685] Unable to parse .pub files
-java.lang.ArrayIndexOutOfBoundsException: 88
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685
--- Comment #1 from Javen O'Neal <on...@apache.org> ---
Knowing nothing about the Compound File Binary Format (is this the same as or a
predecessor to OLE2 containers?) [1.2]
CHNKINK record offset = 0x8200
QC Bit offset = 0x8340 - 0x8200 = 0x0140
Annotated contents of data[offset:offset+24]:
+0 | +2 | +6 | +8 | +10 | +12 | +16
| +20
recID | thingType | optA | optB | optC | bitType | from
| len
00008340 18 00 | 54 4f 4b 4e | 00 00 | 01 00 | 00 00 | 50 4c 43 20 | 32 62 00
00 | 58 00 00 00
data QCBit | "TOKN" | false | true | false | "PLC " | 0x6232
| 0x58 = 88 bytes
Location Len Hex Value Field Meaning (Little Endian conv, ASCII, hex
to dec, etc)
00008200+00 [8] 43 48 4e 4b 49 4e 4b 20 "CHNKINK "
...
00008340+00 [2] 18 00 QC Bit recID
00008340+02 [4] 54 4f 4b 4e thingType "TOKN"
00008340+06 [2] 00 00 optA 0x0000 -> false
00008340+08 [2] 01 00 optB 0x0001 -> true
00008340+10 [2] 00 00 optC 0x0000 -> false
00008340+12 [4] 50 4c 43 20 bitType "PLC "
00008340+16 [4] 32 62 00 00 data from 0x6232, the byte offset from the
beginning of the CHNKINK record at 0x8200
00008340+20 [4] 58 00 00 00 data len 0x58 = 88 bytes
...
And the raw QCPLCBit record at 0x8200+0x6232=0xe432:
0000e430 03 00 00 00 0c 00 00 00 ff ff 01 00 06 01 |..............|
0000e440 00 00 11 01 00 00 4e 07 00 00 5a 07 00 00 16 00 |......N...Z.....|
0000e450 00 00 00 22 00 06 00 00 01 22 09 00 00 00 02 22 |..."....."....."|
0000e460 07 00 00 00 0a 00 00 00 01 22 0f 00 00 00 0a 00 |........."......|
0000e470 00 00 01 22 0a 00 00 00 0a 00 00 00 00 22 ff ff |..."........."..|
0000e480 ff ff 04 00 00 00 04 00 00 00 |..........|
Interpreting the QCPLCBit:
0000e432+0 03 00 00 00 3 number of PLCs
0000e432+4 0c 00 00 00 Type12 (holds hyperlinks, complicated) type of PLCs
...
The QC Bit header specifies the QC PLC Bit record has a length of 88 bytes.
The QCPLCBit specifies it contains 3 hyperlink PLCs (Type 12, 0x0c).
From how I interpret the current code, there's no way that 3 PLC hyperlinks can
be specified in 88 bytes.
> oneStartsAt = 0x4c
> twoStartsAt = 0x68
> threePlusIncrement = 22
Therefore three probably starts at 0x68+22=0x7e and ends at 0x68+22*2=0x94
With 0x58=88 bytes, there aren't even enough bytes for a second, let alone a
third PLC.
I guess we'd have to consult [MS-CFB][2] to figure out if this QCPLCBit record
really can be 88 bytes long or if the file is corrupt and silently skips over
reading these hyperlinks.
[1] https://en.wikipedia.org/wiki/Compound_File_Binary_Format
[2] https://msdn.microsoft.com/en-us/library/dd942138.aspx
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 60685] Unable to parse .pub files
-java.lang.ArrayIndexOutOfBoundsException: 88
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685
Dominik Stadler <do...@gmx.at> changed:
What |Removed |Added
----------------------------------------------------------------------------
OS| |All
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 60685] Unable to parse .pub files
-java.lang.ArrayIndexOutOfBoundsException: 88
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685
gaurav.chd3@gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |gaurav.chd3@gmail.com
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 60685] Unable to parse .pub files
-java.lang.ArrayIndexOutOfBoundsException: 88
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685
Javen O'Neal <on...@apache.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|POI Overall |HPBF
--- Comment #3 from Javen O'Neal <on...@apache.org> ---
The Microsoft Publisher binary .pub format is undocumented, as indicated here:
https://poi.apache.org/hpbf/index.html
OpenOffice/LibreOffice doesn't have documentation or an open source application
that reads this .pub format, to my knowledge, so that means we'd have to resort
to figuring out the format through lots of hard work.
Assuming the file you have provided is valid (opens without warnings or errors
in Microsoft Publisher), if you're mostly interested in text extraction, then
skipping over this hyperlink is probably preferable over throwing an exception.
We can log the error that we catch and move forward with extraction.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 60685] Unable to parse .pub files
-java.lang.ArrayIndexOutOfBoundsException: 88
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685
--- Comment #2 from Javen O'Neal <on...@apache.org> ---
The last real change to supporting HPBF hyperlinks was nearly 9 years ago, and
even then the commit message indicated partial hyperlink support. So it's quite
likely that we haven't fully implemented all hyperlink variations.
I expected to see some hyperlink URL as a string in the hexdump, but perhaps
this is a hyperlink to another element within the document.
Nonetheless, there are some nuggets of insight in the comments and variable
names to figure out what's going on in this QC PLC hyperlink bit.
https://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hpbf/model/qcbits/QCPLCBit.java?r1=690729&r2=690534
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 60685] Unable to parse .pub files
-java.lang.ArrayIndexOutOfBoundsException: 88
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685
--- Comment #5 from Tim Allison <ta...@mitre.org> ---
Looks like we have ~8500 publisher files in our regression corpus if those
would be of any interest. Some are likely truncated...so it goes w Common
Crawl.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org