You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Sean Bridges (JIRA)" <ji...@apache.org> on 2009/05/12 20:07:45 UTC
[jira] Created: (PDFBOX-465) invalid date formats
invalid date formats
---------------------
Key: PDFBOX-465
URL: https://issues.apache.org/jira/browse/PDFBOX-465
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 0.8.0-incubator
Reporter: Sean Bridges
This is with the latest from svn, Revision: 773978
>From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 94 invalid date formats,
It seems that all of these have the stack trace of,
Caused by: java.io.IOException: Error converting date:Friday, July 11, 2008
at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:240)
at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:783)
at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:50)
Some examples of invalid dates are,
20070430193647+713'00'
Tue Aug 21 10:35:22 2007
Tuesday, November 04, 2008
200712172:2:3
Unknown
20090319 200122
9:47 5/12/2008
i don't think there is any hope of parsing all these date formats. If would be nice if this was not a fatal error, and the parser could continue without a creation date.
Is the policy of pdfbox to be as forgiving as possible when reading pdf documents? Maybe toCalendar should return a new Calendar() if parsing fails, rather than throwing.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-465) invalid date formats
Posted by "Sean Bridges (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709455#action_12709455 ]
Sean Bridges commented on PDFBOX-465:
-------------------------------------
I'm also getting,
Caused by: java.io.IOException: Error: Invalid date format 'P8'
at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:157)
at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:784)
at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:63)
... 2 more
The pdf is invalid,
removing the length check fixes it,
date = date.substring( 2, date.length() );
}
if( date.length() < 4 )
- {
- throw new IOException( "Error: Invalid date format '" + date + "'" );
+ {
+ return null;
}
year = Integer.parseInt( date.substring( 0, 4 ) );
if( date.length() >= 6 )
I'm not attaching the diffs as a file since my copy of the code has so many changes now you won't be able to simply apply the diff. Most of my changes are trivial.
> invalid date formats
> ---------------------
>
> Key: PDFBOX-465
> URL: https://issues.apache.org/jira/browse/PDFBOX-465
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Reporter: Sean Bridges
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 94 invalid date formats,
> It seems that all of these have the stack trace of,
> Caused by: java.io.IOException: Error converting date:Friday, July 11, 2008
> at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:240)
> at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
> at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:783)
> at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
> at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:50)
> Some examples of invalid dates are,
> 20070430193647+713'00'
> Tue Aug 21 10:35:22 2007
> Tuesday, November 04, 2008
> 200712172:2:3
> Unknown
> 20090319 200122
> 9:47 5/12/2008
> i don't think there is any hope of parsing all these date formats. If would be nice if this was not a fatal error, and the parser could continue without a creation date.
> Is the policy of pdfbox to be as forgiving as possible when reading pdf documents? Maybe toCalendar should return a new Calendar() if parsing fails, rather than throwing.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-465) invalid date formats
Posted by "Daniel Wilson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708550#action_12708550 ]
Daniel Wilson commented on PDFBOX-465:
--------------------------------------
>>Is the policy of pdfbox to be as forgiving as possible when reading pdf documents?
I won't claim to be able to state what the PDFBox policy may be. But I will say we have been making it more and more forgiving in a LOT of areas.
>> Maybe toCalendar should return a new Calendar() if parsing fails, rather than throwing.
I like that idea, and unless other developers may present reasons against it, would happily implement it.
I would certainly be interested in seeing some test case PDF's with these formats ... and if possible some code to parse some of them.
> invalid date formats
> ---------------------
>
> Key: PDFBOX-465
> URL: https://issues.apache.org/jira/browse/PDFBOX-465
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Reporter: Sean Bridges
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 94 invalid date formats,
> It seems that all of these have the stack trace of,
> Caused by: java.io.IOException: Error converting date:Friday, July 11, 2008
> at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:240)
> at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
> at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:783)
> at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
> at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:50)
> Some examples of invalid dates are,
> 20070430193647+713'00'
> Tue Aug 21 10:35:22 2007
> Tuesday, November 04, 2008
> 200712172:2:3
> Unknown
> 20090319 200122
> 9:47 5/12/2008
> i don't think there is any hope of parsing all these date formats. If would be nice if this was not a fatal error, and the parser could continue without a creation date.
> Is the policy of pdfbox to be as forgiving as possible when reading pdf documents? Maybe toCalendar should return a new Calendar() if parsing fails, rather than throwing.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PDFBOX-465) invalid date formats
Posted by "Sean Bridges (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709455#action_12709455 ]
Sean Bridges edited comment on PDFBOX-465 at 5/14/09 9:14 AM:
--------------------------------------------------------------
I'm also getting,
Caused by: java.io.IOException: Error: Invalid date format 'P8''
at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:157)
at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:784)
at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:63)
... 2 more
The pdf is invalid,
/CreationDate (P8)
It looks like they are trying to utf-16 encode the meta data properties for some reason.
removing the length check fixes it,
date = date.substring( 2, date.length() );
}
if( date.length() < 4 )
- {
- throw new IOException( "Error: Invalid date format '" + date + "'" );
+ {
+ return null;
}
year = Integer.parseInt( date.substring( 0, 4 ) );
if( date.length() >= 6 )
I'm not attaching the diffs as a file since my copy of the code has so many changes now you won't be able to simply apply the diff. Most of my changes are trivial.
was (Author: sgbridges):
I'm also getting,
Caused by: java.io.IOException: Error: Invalid date format 'P8'
at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:157)
at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:784)
at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:63)
... 2 more
The pdf is invalid,
removing the length check fixes it,
date = date.substring( 2, date.length() );
}
if( date.length() < 4 )
- {
- throw new IOException( "Error: Invalid date format '" + date + "'" );
+ {
+ return null;
}
year = Integer.parseInt( date.substring( 0, 4 ) );
if( date.length() >= 6 )
I'm not attaching the diffs as a file since my copy of the code has so many changes now you won't be able to simply apply the diff. Most of my changes are trivial.
> invalid date formats
> ---------------------
>
> Key: PDFBOX-465
> URL: https://issues.apache.org/jira/browse/PDFBOX-465
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Reporter: Sean Bridges
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 94 invalid date formats,
> It seems that all of these have the stack trace of,
> Caused by: java.io.IOException: Error converting date:Friday, July 11, 2008
> at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:240)
> at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
> at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:783)
> at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
> at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:50)
> Some examples of invalid dates are,
> 20070430193647+713'00'
> Tue Aug 21 10:35:22 2007
> Tuesday, November 04, 2008
> 200712172:2:3
> Unknown
> 20090319 200122
> 9:47 5/12/2008
> i don't think there is any hope of parsing all these date formats. If would be nice if this was not a fatal error, and the parser could continue without a creation date.
> Is the policy of pdfbox to be as forgiving as possible when reading pdf documents? Maybe toCalendar should return a new Calendar() if parsing fails, rather than throwing.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.