You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Arjohn Kampman (JIRA)" <ji...@apache.org> on 2013/10/28 13:36:34 UTC

[jira] [Comment Edited] (PDFBOX-1607) StringIndexOutOfBoundsException in PDFParser

    [ https://issues.apache.org/jira/browse/PDFBOX-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13806720#comment-13806720 ] 

Arjohn Kampman edited comment on PDFBOX-1607 at 10/28/13 12:35 PM:
-------------------------------------------------------------------

Unfortunately, using the non sequential parser is not an option for us yet since that has other parsing problems. So I have investigated this parsing problem today.

First of all this problem has been introduced in svn revision 1451638 as part of PDFBOX-1513. Looking at the changes that have been made to {{BaseParser}} in that revision, I fail to how {{sBuf}} is related to the length of {{strmBuf}} in this line:

{{sBuf.deleteCharAt(strmBuf.length-1);}}

This looks like a genuine bug to me. The intention of this line was probably to discard the last character if the buffer contains an odd number of hexadecimals. If this line is fixed then the problematic documents parse successfully, albeit with an error being logged. That error is the result of the {{wasLastParsedObjectEOF}} in {{PDFParser.parse()}} being reset to false.

The attached patch fixes both issues. That patch is based on today's trunk code. Please consider applying this patch.


was (Author: arjohn):
Unfortunately, using the non sequential parser is not an option for us yet since that has other parsing problems. So I have investigated this parsing problem today.

First of all this problem has been introduced in svn revision 1451638 as part of PDFBOX-1513. Looking at the changes that have been made to BaseParser in that revision, I fail to how sBuf is related to the length of strmBuf in this line:

{{sBuf.deleteCharAt(strmBuf.length-1);}}

This looks like a genuine bug to me. The intention of this line was probably to discard the last character if the buffer contains an odd number of hexadecimals. If this line is fixed then the problematic documents parse successfully, albeit with an error being logged. That error is the result of the {{wasLastParsedObjectEOF}} in {{PDFParser.parse()}} being reset to false.

The attached patch fixes both issues. That patch is based on today's trunk code. Please consider applying this patch.

> StringIndexOutOfBoundsException in PDFParser
> --------------------------------------------
>
>                 Key: PDFBOX-1607
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1607
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.1
>         Environment: Windows 7, JRE 1.7.0_15-b03
>            Reporter: Alex Alishevskikh
>         Attachments: pdfbox-1607-fix.patch, pdf-govdocs-036902.pdf, pdf-govdocs-107566.pdf
>
>
> I have few test files parsed fine in PDFBox 1.7.1 but not in 1.8.1:
> java.lang.StringIndexOutOfBoundsException: String index out of range: 2047
>      at java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:762)
>      at java.lang.StringBuilder.deleteCharAt(StringBuilder.java:258)
>      at org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1000)
>      at org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)
>      at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1241)
>      at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:558)
>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188)



--
This message was sent by Atlassian JIRA
(v6.1#6144)