You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2015/06/04 18:59:38 UTC
[jira] [Commented] (PDFBOX-2823) StringIndexOutOfBoundsException when doing DateConverter.parseDate()

    [ https://issues.apache.org/jira/browse/PDFBOX-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573134#comment-14573134 ] 

Tilman Hausherr commented on PDFBOX-2823:
-----------------------------------------

I can confirm that it happens with 1.8.9, but not for 2.0. I don't think it is a regression... Yours is "one space and then nothing", the old issue was "nothing".

> StringIndexOutOfBoundsException when doing DateConverter.parseDate()
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-2823
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2823
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.9, 1.8.10
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>             Fix For: 1.8.10
>
>         Attachments: PDFBOX-2823.pdf
>
>
> From Kevin J. in the user mailing list:
> We are currently using Apache Solr / Tika to index documents for searching. The exact version that is being used is version 1.8.8 of PDFBox.
> We can across a document that produced this stack trace (trimmed to the relevant part of PDFBox):
> {code}
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 1
>         at java.lang.String.charAt(String.java:658)
>         at org.apache.pdfbox.util.DateConverter.parseDate(DateConverter.java:679)
>         at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:808)
>         at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:780)
>         at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:753)
>         at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:849)
>         at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:212)
> {code}
> Inspection of the document's binary revealed that it contained a creationDate consisting of a single white space (ASCII 0x20), which is probably illegal. I managed to create a small reproduction of the error using like so:
> {code}
> File file = new File("/path/to/document/bad.pdf");
> InputStream stream = new FileInputStream(file);
> PDFParser parser = new PDFParser(stream);
> parser.parse();
> PDDocumentInformation info = parser.getPDDocument().getDocumentInformation();
> Calendar creationDate = info.getCreationDate();
> System.out.println(creationDate.toString());
> {code}
> Which produces the same stack trace. I verified this against the latest build from the site on 1.8.9, and the behavior remains. This looks very similar to PDFBOX-1803, however that issue is marked as resolved in 1.8.5. So, my questions:
>   *   Is the exception an expected behavior? Ideally Tika would just index the document anyway, the creation date isn't important to us. Tika had an issue for this, TIKA-1233, that marks it as fixed by swallowing the exception, but looking at the comments for it, they removed the try/catch in r1593983 since it is marked as fixed here.
>   *   Is this a regression, or slightly different somehow from 1803? Shall I create a new issue or get the existing 1803 re-opened?
>   *   The PDF that reproduces the issue can be downloaded here: https://www.dropbox.com/s/tll5rscrlt95xuc/bad.pdf?dl=0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org