You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2006/11/29 14:44:46 UTC
DO NOT REPLY [Bug 41076] New: - StringIndexOutOfBoundsException when extracting text from a Word document.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=41076
Summary: StringIndexOutOfBoundsException when extracting text
from a Word document.
Product: POI
Version: 3.0-dev
Platform: Other
OS/Version: other
Status: NEW
Severity: critical
Priority: P1
Component: POI Overall
AssignedTo: poi-dev@jakarta.apache.org
ReportedBy: bjorn.wang@creuna.no
I use POI through Nutch.
Many Word documents cause the following error when being parsed for text extraction:
Can't be handled as Microsoft document.
java.lang.StringIndexOutOfBoundsException: String index out of range: -520
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
DO NOT REPLY [Bug 41076] - StringIndexOutOfBoundsException when extracting text from a Word document.
Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=41076
bjorn.wang@creuna.no changed:
What |Removed |Added
----------------------------------------------------------------------------
URL| |http://marc.theaimsgroup.com
| |/?l=poi-
| |user&m=110183472231615&w=2
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
DO NOT REPLY [Bug 41076] - StringIndexOutOfBoundsException when extracting text from a Word document.
Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=41076
------- Additional Comments From stephen.polyak@pearson.com 2007-03-19 15:06 -------
is this fixed in poi-bin-3.0-alpha3-20061212.zip? i just applied these jars and
i still see the same problem.
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
DO NOT REPLY [Bug 41076] - StringIndexOutOfBoundsException when extracting text from a Word document.
Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=41076
------- Additional Comments From stephen.polyak@pearson.com 2007-03-21 09:56 -------
Created an attachment (id=19768)
--> (http://issues.apache.org/bugzilla/attachment.cgi?id=19768&action=view)
Here is a proposed fix to this issue.
It simply catches the index out of bounds exception on the substring method
call and returns an empty string in that scenario.
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
DO NOT REPLY [Bug 41076] - StringIndexOutOfBoundsException when extracting text from a Word document.
Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=41076
eporter@gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #19768|0 |1
is obsolete| |
------- Additional Comments From eporter@gmail.com 2007-03-26 10:45 -------
Created an attachment (id=19798)
--> (http://issues.apache.org/bugzilla/attachment.cgi?id=19798&action=view)
A proposed fix which rewrites the loops
The code gets a List of text runs and a List of text pieces. The existing code
fails when the start of one text piece is not the same as the end of the
previous piece. The assumption is made in several places.
My proposed patch rewrites the loop to make the code smaller and simpler. The
first proposed patch is made obsolete by this patch because the
StringIndexOutOfBoundsException won't happen anymore.
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
DO NOT REPLY [Bug 41076] - StringIndexOutOfBoundsException when extracting text from a Word document.
Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=41076
pkleszczewski@infovidematrix.pl changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
DO NOT REPLY [Bug 41076] - StringIndexOutOfBoundsException when extracting text from a Word document.
Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=41076
------- Additional Comments From nick@torchbox.com 2007-03-29 03:49 -------
I might be being stupid, but I can't actually figure out what file the most
recent patch applies to...
The patch header refers to WordExtractor.java, but the code doesn't look
anything like org.apache.poi.hwpf.extractor.WordExtractor
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
DO NOT REPLY [Bug 41076] - StringIndexOutOfBoundsException when extracting text from a Word document.
Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=41076>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=41076
------- Additional Comments From bjorn.wang@creuna.no 2006-11-29 05:46 -------
Created an attachment (id=19200)
--> (http://issues.apache.org/bugzilla/attachment.cgi?id=19200&action=view)
Simplest possible testcase showing the StringIndexOutOfBoundsException
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/