You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2006/11/29 14:44:46 UTC

DO NOT REPLY [Bug 41076] New: - StringIndexOutOfBoundsException when extracting text from a Word document.

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=41076

           Summary: StringIndexOutOfBoundsException when extracting text
                    from a Word document.
           Product: POI
           Version: 3.0-dev
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: critical
          Priority: P1
         Component: POI Overall
        AssignedTo: poi-dev@jakarta.apache.org
        ReportedBy: bjorn.wang@creuna.no


I use POI through Nutch.

Many Word documents cause the following error when being parsed for text extraction:
Can't be handled as Microsoft document.
java.lang.StringIndexOutOfBoundsException: String index out of range: -520

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


DO NOT REPLY [Bug 41076] - StringIndexOutOfBoundsException when extracting text from a Word document.

Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=41076


bjorn.wang@creuna.no changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                URL|                            |http://marc.theaimsgroup.com
                   |                            |/?l=poi-
                   |                            |user&m=110183472231615&w=2




-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


DO NOT REPLY [Bug 41076] - StringIndexOutOfBoundsException when extracting text from a Word document.

Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=41076





------- Additional Comments From stephen.polyak@pearson.com  2007-03-19 15:06 -------
is this fixed in poi-bin-3.0-alpha3-20061212.zip? i just applied these jars and
i still see the same problem.

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


DO NOT REPLY [Bug 41076] - StringIndexOutOfBoundsException when extracting text from a Word document.

Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=41076





------- Additional Comments From stephen.polyak@pearson.com  2007-03-21 09:56 -------
Created an attachment (id=19768)
 --> (http://issues.apache.org/bugzilla/attachment.cgi?id=19768&action=view)
Here is a proposed fix to this issue. 

It simply catches the index out of bounds exception on the substring method
call and returns an empty string in that scenario.

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


DO NOT REPLY [Bug 41076] - StringIndexOutOfBoundsException when extracting text from a Word document.

Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=41076


eporter@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #19768|0                           |1
        is obsolete|                            |




------- Additional Comments From eporter@gmail.com  2007-03-26 10:45 -------
Created an attachment (id=19798)
 --> (http://issues.apache.org/bugzilla/attachment.cgi?id=19798&action=view)
A proposed fix which rewrites the loops

The code gets a List of text runs and a List of text pieces.  The existing code
fails when the start of one text piece is not the same as the end of the
previous piece.  The assumption is made in several places.
My proposed patch rewrites the loop to make the code smaller and simpler.  The
first proposed patch is made obsolete by this patch because the
StringIndexOutOfBoundsException won't happen anymore.

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


DO NOT REPLY [Bug 41076] - StringIndexOutOfBoundsException when extracting text from a Word document.

Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=41076


pkleszczewski@infovidematrix.pl changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED




-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


DO NOT REPLY [Bug 41076] - StringIndexOutOfBoundsException when extracting text from a Word document.

Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=41076





------- Additional Comments From nick@torchbox.com  2007-03-29 03:49 -------
I might be being stupid, but I can't actually figure out what file the most
recent patch applies to...

The patch header refers to WordExtractor.java, but the code doesn't look
anything like org.apache.poi.hwpf.extractor.WordExtractor

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


DO NOT REPLY [Bug 41076] - StringIndexOutOfBoundsException when extracting text from a Word document.

Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=41076





------- Additional Comments From bjorn.wang@creuna.no  2006-11-29 05:46 -------
Created an attachment (id=19200)
 --> (http://issues.apache.org/bugzilla/attachment.cgi?id=19200&action=view)
Simplest possible testcase showing the StringIndexOutOfBoundsException 


-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/