You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Sean Bridges (JIRA)" <ji...@apache.org> on 2009/05/12 20:07:45 UTC

[jira] Created: (PDFBOX-465) invalid date formats

invalid date formats 
---------------------

                 Key: PDFBOX-465
                 URL: https://issues.apache.org/jira/browse/PDFBOX-465
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 0.8.0-incubator
            Reporter: Sean Bridges


This is with the latest from svn, Revision: 773978

>From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 94 invalid date formats,

It seems that all of these have the stack trace of,

Caused by: java.io.IOException: Error converting date:Friday, July 11, 2008
	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:240)
	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
	at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:783)
	at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:50)

Some examples of invalid dates are,

20070430193647+713'00'
Tue Aug 21 10:35:22 2007
Tuesday, November 04, 2008
200712172:2:3 
Unknown
20090319 200122
9:47 5/12/2008

i don't think there is any hope of parsing all these date formats.  If would be nice if this was not a fatal error, and the parser could continue without a creation date. 

Is the policy of pdfbox to be as forgiving as possible when reading pdf documents?  Maybe toCalendar should return a new Calendar() if parsing fails, rather than throwing.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-465) invalid date formats

Posted by "Sean Bridges (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709455#action_12709455 ] 

Sean Bridges commented on PDFBOX-465:
-------------------------------------

I'm also getting,

Caused by: java.io.IOException: Error: Invalid date format 'P8‘'
	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:157)
	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
	at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:784)
	at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:63)
	... 2 more

The pdf is invalid, 


removing the length check fixes it,

                     date = date.substring( 2, date.length() );
                 }
                 if( date.length() < 4 )
-                {
-                    throw new IOException( "Error: Invalid date format '" + date + "'" );
+                {                    
+                    return null;
                 }
                 year = Integer.parseInt( date.substring( 0, 4 ) );
                 if( date.length() >= 6 )

I'm not attaching the diffs as a file since my copy of the code has so many changes now you won't be able to simply apply the diff.  Most of my changes are trivial.

> invalid date formats 
> ---------------------
>
>                 Key: PDFBOX-465
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-465
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Sean Bridges
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 94 invalid date formats,
> It seems that all of these have the stack trace of,
> Caused by: java.io.IOException: Error converting date:Friday, July 11, 2008
> 	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:240)
> 	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
> 	at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:783)
> 	at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
> 	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:50)
> Some examples of invalid dates are,
> 20070430193647+713'00'
> Tue Aug 21 10:35:22 2007
> Tuesday, November 04, 2008
> 200712172:2:3 
> Unknown
> 20090319 200122
> 9:47 5/12/2008
> i don't think there is any hope of parsing all these date formats.  If would be nice if this was not a fatal error, and the parser could continue without a creation date. 
> Is the policy of pdfbox to be as forgiving as possible when reading pdf documents?  Maybe toCalendar should return a new Calendar() if parsing fails, rather than throwing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-465) invalid date formats

Posted by "Daniel Wilson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708550#action_12708550 ] 

Daniel Wilson commented on PDFBOX-465:
--------------------------------------

>>Is the policy of pdfbox to be as forgiving as possible when reading pdf documents?

I won't claim to be able to state what the PDFBox policy may be.  But I will say we have been making it more and more forgiving in a LOT of areas.

>> Maybe toCalendar should return a new Calendar() if parsing fails, rather than throwing.

I like that idea, and unless other developers may present reasons against it, would happily implement it.

I would certainly be interested in seeing some test case PDF's with these formats ... and if possible some code to parse some of them.

> invalid date formats 
> ---------------------
>
>                 Key: PDFBOX-465
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-465
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Sean Bridges
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 94 invalid date formats,
> It seems that all of these have the stack trace of,
> Caused by: java.io.IOException: Error converting date:Friday, July 11, 2008
> 	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:240)
> 	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
> 	at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:783)
> 	at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
> 	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:50)
> Some examples of invalid dates are,
> 20070430193647+713'00'
> Tue Aug 21 10:35:22 2007
> Tuesday, November 04, 2008
> 200712172:2:3 
> Unknown
> 20090319 200122
> 9:47 5/12/2008
> i don't think there is any hope of parsing all these date formats.  If would be nice if this was not a fatal error, and the parser could continue without a creation date. 
> Is the policy of pdfbox to be as forgiving as possible when reading pdf documents?  Maybe toCalendar should return a new Calendar() if parsing fails, rather than throwing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (PDFBOX-465) invalid date formats

Posted by "Sean Bridges (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709455#action_12709455 ] 

Sean Bridges edited comment on PDFBOX-465 at 5/14/09 9:14 AM:
--------------------------------------------------------------

I'm also getting,

Caused by: java.io.IOException: Error: Invalid date format 'P8''
	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:157)
	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
	at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:784)
	at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:63)
	... 2 more

The pdf is invalid, 

/CreationDate (P8‘)

It looks like they are trying to utf-16 encode the meta data properties for some reason.


removing the length check fixes it,

                     date = date.substring( 2, date.length() );
                 }
                 if( date.length() < 4 )
-                {
-                    throw new IOException( "Error: Invalid date format '" + date + "'" );
+                {                    
+                    return null;
                 }
                 year = Integer.parseInt( date.substring( 0, 4 ) );
                 if( date.length() >= 6 )

I'm not attaching the diffs as a file since my copy of the code has so many changes now you won't be able to simply apply the diff.  Most of my changes are trivial.

      was (Author: sgbridges):
    I'm also getting,

Caused by: java.io.IOException: Error: Invalid date format 'P8‘'
	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:157)
	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
	at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:784)
	at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:63)
	... 2 more

The pdf is invalid, 


removing the length check fixes it,

                     date = date.substring( 2, date.length() );
                 }
                 if( date.length() < 4 )
-                {
-                    throw new IOException( "Error: Invalid date format '" + date + "'" );
+                {                    
+                    return null;
                 }
                 year = Integer.parseInt( date.substring( 0, 4 ) );
                 if( date.length() >= 6 )

I'm not attaching the diffs as a file since my copy of the code has so many changes now you won't be able to simply apply the diff.  Most of my changes are trivial.
  
> invalid date formats 
> ---------------------
>
>                 Key: PDFBOX-465
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-465
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Sean Bridges
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 94 invalid date formats,
> It seems that all of these have the stack trace of,
> Caused by: java.io.IOException: Error converting date:Friday, July 11, 2008
> 	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:240)
> 	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
> 	at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:783)
> 	at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
> 	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:50)
> Some examples of invalid dates are,
> 20070430193647+713'00'
> Tue Aug 21 10:35:22 2007
> Tuesday, November 04, 2008
> 200712172:2:3 
> Unknown
> 20090319 200122
> 9:47 5/12/2008
> i don't think there is any hope of parsing all these date formats.  If would be nice if this was not a fatal error, and the parser could continue without a creation date. 
> Is the policy of pdfbox to be as forgiving as possible when reading pdf documents?  Maybe toCalendar should return a new Calendar() if parsing fails, rather than throwing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.