You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2011/10/04 02:27:55 UTC

DO NOT REPLY [Bug 51946] New: [BUG] TextPieceTable ArrayIndexOutOfBoundsException and IllegalStateException - Hong Kong encoding?

https://issues.apache.org/bugzilla/show_bug.cgi?id=51946

             Bug #: 51946
           Summary: [BUG] TextPieceTable <init>
                    ArrayIndexOutOfBoundsException and
                    IllegalStateException - Hong Kong encoding?
           Product: POI
           Version: 3.8-dev
          Platform: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HWPF
        AssignedTo: dev@poi.apache.org
        ReportedBy: hockey69guy@yahoo.com
    Classification: Unclassified


Unable to include sample document due to sensitive nature.

If there any pointers for utilities that can further investigate the documents,
let me know and I'll see what further information I can supply.

A few of my documents are trying to perform an arraycopy with a length thats
greater than the amount remaining in the stream buffer.  File opens
successfully in Word 2010, and may be older than a Word97 document.  Documents
likely has encoding from Hong Kong region.


A couple produce the following Stack Trace (Daily Build)
Caused by: java.lang.ArrayIndexOutOfBoundsException
    at java.lang.System.arraycopy(Native Method)
    at org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:108)
    at
org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:70)
    at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:71)
    at
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:410)
    at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:69)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:200)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)



More than a handful are caught earlier on and produce this stack trace:
Caused by: java.lang.IllegalStateException: Told we're for characters 0 ->
6385, but actually covers 6373 characters!
    at org.apache.poi.hwpf.model.TextPiece.<init>(TextPiece.java:73)
    at org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:115)
    at
org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:70)
    at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:71)
    at
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:410)
    at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:69)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:200)

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 51946] [BUG] TextPieceTable ArrayIndexOutOfBoundsException and IllegalStateException - Hong Kong encoding?

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=51946

--- Comment #2 from Jeremy <ho...@yahoo.com> 2011-10-04 20:00:26 UTC ---
I'm currently using a nightly build now for pretty much all of my
investigation, and have actually had a bit of luck with getting improvements
submitted.

The problem with many of these documents is that they are older versions of
word likely from 1995-2001.  And also have the possability of originating from
Asian countries.

The files aren't corrupt enough to the point where Word2010 can't open them...
but thats not saying too much.  I've encountered numerous header signature
issues which I'm kind of avoiding all together since the largest % are from
~based files... though a few are able to be opened by word.

I'll take a look at using the validator on a few of the files and see what I
get in the next few days.


BTW, thanks Nick for the help on the Outlook issue #51873 a week ago.  If you
get a chance can you revist my final msg there.  There was a small bug in the
patch you placed into the trunk for me.


Thanks again.


(In reply to comment #1)
> You can use the Binary File Format Validator to check files are valid, see
> http://poi.apache.org/faq.html#faq-N10109
> Also, have you tried with a recent svn checkout / recent nightly build?

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 51946] [BUG] TextPieceTable ArrayIndexOutOfBoundsException and IllegalStateException - Hong Kong encoding?

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=51946

Jeremy <rp...@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #3 from Jeremy <rp...@yahoo.com> 2011-12-27 17:06:08 UTC ---
Added link to bug 52349

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 51946] [BUG] TextPieceTable ArrayIndexOutOfBoundsException and IllegalStateException - Hong Kong encoding?

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=51946

Sergey Vladimirov <vl...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |DUPLICATE

--- Comment #4 from Sergey Vladimirov <vl...@gmail.com> ---
This is a duplicate of Bug 50955

*** This bug has been marked as a duplicate of bug 50955 ***

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 51946] [BUG] TextPieceTable ArrayIndexOutOfBoundsException and IllegalStateException - Hong Kong encoding?

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=51946

Nick Burch <ni...@alfresco.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO
         OS/Version|                            |All

--- Comment #1 from Nick Burch <ni...@alfresco.com> 2011-10-04 08:52:55 UTC ---
You can use the Binary File Format Validator to check files are valid, see
http://poi.apache.org/faq.html#faq-N10109

Also, have you tried with a recent svn checkout / recent nightly build?

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org