You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2012/03/09 03:14:30 UTC

DO NOT REPLY [Bug 52863] New: java.lang.ArrayIndexOutOfBoundsException in org.apache.poi.hwpf.sprm.SprmOperation.initSize

https://issues.apache.org/bugzilla/show_bug.cgi?id=52863

             Bug #: 52863
           Summary: java.lang.ArrayIndexOutOfBoundsException in
                    org.apache.poi.hwpf.sprm.SprmOperation.initSize
           Product: POI
           Version: unspecified
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: blocker
          Priority: P2
         Component: HWPF
        AssignedTo: dev@poi.apache.org
        ReportedBy: simonsharry@gmail.com
    Classification: Unclassified


1. When converting a bunch of Microsoft Word documents using the command,

    java -jar tika-app-1.1-SNAPSHOT.jar -v -t

, I'm getting the following exception. Ditto with Tika 1.1 release candidate.

org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@5d3ac0
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
    at org.apache.tika.cli.TikaCLI$TikaServer$1.run(TikaCLI.java:735)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 487
    at org.apache.poi.hwpf.sprm.SprmOperation.initSize(SprmOperation.java:174)
    at org.apache.poi.hwpf.sprm.SprmOperation.<init>(SprmOperation.java:80)
    at org.apache.poi.hwpf.sprm.SprmIterator.next(SprmIterator.java:48)
    at
org.apache.poi.hwpf.sprm.ParagraphSprmUncompressor.uncompressPAP(ParagraphSprmUncompressor.java:67)
    at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph(Paragraph.java:103)
    at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:943)
    at
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:146)
    at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:97)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    ... 4 more

A user, Nick Burch, has advised me to raise this as a POI bug.

2. Here's the output of the BFF Validator tool:

<BFFValidation path="failing.doc" datetime="03/08/12 07:14:27" result="FAILED">
<ParseStack>
<Type builtinType="Docfile" docName="MS-DOC" sectionTitle="File Structure"
msdnLink="http://msdn.microsoft.com/en-us/library/4eaddc8f-4abd-43bb-8fd4-aef9c6121737">
<Info>Built-in type "Docfile": The root storage object of an OLE compound file.
For more information, see
http://msdn.microsoft.com/en-us/library/dd942138.aspx.</Info>
</Type>
<Type builtinType="Stream" docName="MS-DOC" sectionTitle="File Structure"
msdnLink="http://msdn.microsoft.com/en-us/library/4eaddc8f-4abd-43bb-8fd4-aef9c6121737"
streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0">
<Info>Built-in type "Stream": Any stream object for OLE compound files. The
entire file contents for other files.</Info>
</Type>
<Type docName="MS-DOC" sectionTitle="Fib" sectionNumber="2.5.1"
msdnLink="http://msdn.microsoft.com/en-us/library/9AEAA2E7-4A45-468E-AB13-3F6193EB9394"
streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0"/>
<Type docName="MS-DOC" sectionTitle="FibBase" sectionNumber="2.5.2"
msdnLink="http://msdn.microsoft.com/en-us/library/26FB6C06-4E5C-4778-AB4E-EDBF26A545BB"
streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0"/>
<Type builtinType="USHORT" streamName="WordDocument" bitfield="True"
bitOffsetWithinStruct="84" hexBitOffsetWithinStruct="0x54" bitCount="4"
streamOffsetOfStruct="0" hexStreamOffsetOfStruct="0x0" streamOffset="10"
hexStreamOffset="0xa" childId="10" hexChildId="0xa">
<Info>Built-in type "USHORT": Unsigned 2-byte integer.</Info>
</Type>
</ParseStack>
<LastData><![CDATA[
EC A5 01 01 4D 20 09 04  00 00 08 12 BF 00 00 00  ....M...........
00 00 00 30 00 00 00 00  00 08 00 00 66 EF 00 00  ...0........f...
]]></LastData>
</BFFValidation>
--------------------------------------------

Would greatly appreciate a timely fix, as I have 2000+ of documents that
POI/Tika are failing on. I cannot proceed any further.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 52863] java.lang.ArrayIndexOutOfBoundsException in org.apache.poi.hwpf.sprm.SprmOperation.initSize

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52863

HarrySimons <si...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #2 from HarrySimons <si...@gmail.com> 2012-03-09 12:14:24 UTC ---
Unfortunately, this is a classified document.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 52863] java.lang.ArrayIndexOutOfBoundsException in org.apache.poi.hwpf.sprm.SprmOperation.initSize

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52863

mseele@guh-software.de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mseele@guh-software.de

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 52863] java.lang.ArrayIndexOutOfBoundsException in org.apache.poi.hwpf.sprm.SprmOperation.initSize

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52863

Sergey Vladimirov <vl...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |WONTFIX

--- Comment #6 from Sergey Vladimirov <vl...@gmail.com> ---
Since there is a problem with original file (i.e. structure is broken), i'm
closing this bug as WONTFIX.

But in trunk the workaround will be added to skip the problematic SPRMs. I
could NOT guarantee that the file will be correctly processed after such
errors, but it worse to try.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 52863] java.lang.ArrayIndexOutOfBoundsException in org.apache.poi.hwpf.sprm.SprmOperation.initSize

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52863

Sepp <se...@nightmail.ru> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sepp@nightmail.ru

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 52863] java.lang.ArrayIndexOutOfBoundsException in org.apache.poi.hwpf.sprm.SprmOperation.initSize

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52863

Yegor Kozlov <ye...@dinom.ru> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO
           Severity|blocker                     |normal

--- Comment #1 from Yegor Kozlov <ye...@dinom.ru> 2012-03-09 11:33:23 UTC ---
Can you upload a failing document? 

Yegor

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 52863] java.lang.ArrayIndexOutOfBoundsException in org.apache.poi.hwpf.sprm.SprmOperation.initSize

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52863

--- Comment #4 from HarrySimons <si...@gmail.com> 2012-03-10 01:33:39 UTC ---
> Do you know the origin of these failing
> docs? Were they created by MS Word or
> by OpenOffice or by what ? 

They were created by a post-2003 and pre-2007 version of MS Word. 


> Without a sample file we can't do much.

Just the name itself of the document is 'Business Intelligence', so you can
imagine my difficulty. Even other documents that failing are sensitive enough.
I thought, I should be able to remove the sensitive parts of this document and
then upload it for the Tika/POI developers. But even mere re-saving the
document in Word 2007 (i.e., without any new edits whatsoever) makes the
problem mostly go away. I say 'mostly' because, while Tika/POI are then able to
extract the text, they also append text like this to the output

_-1388201556/ole-[42, 4D, 0E, 0A, 00, 00, 00, 00]

_-1388203796/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]

_-1388843352/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]

_-1388845272/ole-[42, 4D, BA, 09, 00, 00, 00, 00]

_-1388297360/ole-[42, 4D, BA, 09, 00, 00, 00, 00]

_-1388297680/ole-[42, 4D, D6, 09, 00, 00, 00, 00]

_-1388296720/ole-[42, 4D, BA, 09, 00, 00, 00, 00]

_-1388203476/ole-[42, 4D, 66, 09, 00, 00, 00, 00]

_-1382869532/ole-[42, 4D, 36, 0C, 00, 00, 00, 00]

_-1388200596/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]

_-1388200916/ole-[42, 4D, BA, 09, 00, 00, 00, 00]

_-1383036196/ole-[42, 4D, 12, 09, 00, 00, 00, 00]

_-1382867932/ole-[42, 4D, 86, 0A, 00, 00, 00, 00]

_-1382868252/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]

_-1380808936/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]


Being a developer myself, I am fully aware how hard it can be to fix (certain)
bugs without appropriate test input. I will watch out for newer releases.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 52863] java.lang.ArrayIndexOutOfBoundsException in org.apache.poi.hwpf.sprm.SprmOperation.initSize

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52863

--- Comment #3 from Yegor Kozlov <ye...@dinom.ru> 2012-03-09 12:17:03 UTC ---
Do you know the origin of these failing docs? Were they created by MS Word or
by OpenOffice or by what ? 

Without a sample file we can't do much.

Yegor

(In reply to comment #2)
> Unfortunately, this is a classified document.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 52863] java.lang.ArrayIndexOutOfBoundsException in org.apache.poi.hwpf.sprm.SprmOperation.initSize

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52863

HarrySimons <si...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |simonsharry@gmail.com

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 52863] java.lang.ArrayIndexOutOfBoundsException in org.apache.poi.hwpf.sprm.SprmOperation.initSize

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52863

--- Comment #5 from Sepp <se...@nightmail.ru> 2012-04-06 16:23:22 UTC ---
Created attachment 28554
  --> https://issues.apache.org/bugzilla/attachment.cgi?id=28554
The same problem with MS PowerPoint files

Hi *,

I have the same problem with tika-app-1.1.jar und MS PowerPoint files. In the
zip archive you can find 2 PPT files. The file Tika.ppt is the "old" file, that
cannot be converted with the error message:

System.ApplicationException : Extraction of text from the file 'Tika.ppt'
failed.
  ----> org.apache.tika.exception.TikaException : Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@2a784f5
  ----> java.lang.ArrayIndexOutOfBoundsException : 
at TikaOnDotNet.TextExtractor.Extract(String filePath) in
d:\Work\tikaondotnet.git\trunk\TikaOnDotnet\TextExtractor.cs:line 63
at TikaOnDotNet.tikadriver_examples.should_extract_from_ppt() in
d:\Work\tikaondotnet.git\trunk\TikaOnDotnet\tikadriver_examples.cs:line 104
--TikaException
at org.apache.tika.parser.CompositeParser.parse(InputStream stream,
ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.CompositeParser.parse(InputStream stream,
ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.AutoDetectParser.parse(InputStream stream,
ContentHandler handler, Metadata metadata, ParseContext context)
at TikaOnDotNet.TextExtractor.Extract(String filePath) in
d:\Work\tikaondotnet.git\trunk\TikaOnDotnet\TextExtractor.cs:line 55
--ArrayIndexOutOfBoundsException
at IKVM.Runtime.ByteCodeHelper.arraycopy_primitive_1(Array src, Int32 srcStart,
Array dest, Int32 destStart, Int32 len)
at org.apache.poi.util.LittleEndian.getByteArray(Byte[] data, Int32 offset,
Int32 size)
at org.apache.poi.hpsf.UnicodeString..ctor(Byte[] , Int32 )
at org.apache.poi.hpsf.TypedPropertyValue.readValue(Byte[] , Int32 )
at org.apache.poi.hpsf.Vector.read(Byte[] , Int32 )
at org.apache.poi.hpsf.TypedPropertyValue.readValue(Byte[] , Int32 )
at org.apache.poi.hpsf.VariantSupport.read(Byte[] src, Int32 offset, Int32
length, Int64 type, Int32 codepage)
at org.apache.poi.hpsf.Property..ctor(Int64 id, Byte[] src, Int64 offset, Int32
length, Int32 codepage)
at org.apache.poi.hpsf.Section..ctor(Byte[] src, Int32 offset)
at org.apache.poi.hpsf.PropertySet.init(Byte[] , Int32 , Int32 )
at org.apache.poi.hpsf.PropertySet..ctor(InputStream stream)
at
org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(DirectoryNode
, String )
at
org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(DirectoryNode
)
at org.apache.tika.parser.microsoft.OfficeParser.parse(DirectoryNode root,
ParseContext context, Metadata metadata, XHTMLContentHandler xhtml)
at org.apache.tika.parser.microsoft.OfficeParser.parse(InputStream stream,
ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.CompositeParser.parse(InputStream stream,
ContentHandler handler, Metadata metadata, ParseContext context)

The second file Tika_new.ppt is the same file, that has been saved with the MS
PowerPoint 2010 (File -> Save as...), can be converted without any problems.

With tika-app-0.9.jar the file Tika.ppt can be converted too ==> the error is
in the new version of tika-app-1.1.jar???

Thank you
Sepp

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org