You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Daniel Wilson (JIRA)" <ji...@apache.org> on 2009/05/12 20:29:45 UTC

[jira] Commented: (PDFBOX-465) invalid date formats

    [ https://issues.apache.org/jira/browse/PDFBOX-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708550#action_12708550 ] 

Daniel Wilson commented on PDFBOX-465:
--------------------------------------

>>Is the policy of pdfbox to be as forgiving as possible when reading pdf documents?

I won't claim to be able to state what the PDFBox policy may be.  But I will say we have been making it more and more forgiving in a LOT of areas.

>> Maybe toCalendar should return a new Calendar() if parsing fails, rather than throwing.

I like that idea, and unless other developers may present reasons against it, would happily implement it.

I would certainly be interested in seeing some test case PDF's with these formats ... and if possible some code to parse some of them.

> invalid date formats 
> ---------------------
>
>                 Key: PDFBOX-465
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-465
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Sean Bridges
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 94 invalid date formats,
> It seems that all of these have the stack trace of,
> Caused by: java.io.IOException: Error converting date:Friday, July 11, 2008
> 	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:240)
> 	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
> 	at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:783)
> 	at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
> 	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:50)
> Some examples of invalid dates are,
> 20070430193647+713'00'
> Tue Aug 21 10:35:22 2007
> Tuesday, November 04, 2008
> 200712172:2:3 
> Unknown
> 20090319 200122
> 9:47 5/12/2008
> i don't think there is any hope of parsing all these date formats.  If would be nice if this was not a fatal error, and the parser could continue without a creation date. 
> Is the policy of pdfbox to be as forgiving as possible when reading pdf documents?  Maybe toCalendar should return a new Calendar() if parsing fails, rather than throwing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.