You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Fred Hansen (JIRA)" <ji...@apache.org> on 2013/07/10 19:39:50 UTC

[jira] [Commented] (PDFBOX-1633) DateConverter needs to work

    [ https://issues.apache.org/jira/browse/PDFBOX-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704799#comment-13704799 ] 

Fred Hansen commented on PDFBOX-1633:
-------------------------------------

The attachments are revisions to
    pdfbox/src/main/java/org/apache/pdfbox/util/DateConverter.java
    pdfbox/src/test/java/org/apache/pdfbox/util/TestDateUtil.java
The revised DateConverter retains UNCHANGED the old API, adds an
additional method--toCalendar(String, String[])--with better error
handling, adds date parsing primitives, and is thread-safe.
There are now over a hundred test cases. Only a very few of which
are commented out, and those only because they cannot be implemented
consistently with the others.

The revision closes the following bugs

PDFBOX-1633 (this bug)    DateConverter needs to work
    All the issues mentioned are resolved.

PDFBOX-465    invalid date formats
   The proposed date formats are now included and recognized correctly.

PDFBOX-1054    DateConverter: allow for external adding of date formats
    The added toCalendar method lets the client supply a supplementary
    list of formats. To preserve thread-safety, the internal list is not modified.

PDFBOX-1219   org.apache.jempbox.impl.DateConverter unable to parse date
PDFBOX-145    PDDocumentInformation.getCreationDate() throws IOException
PDFBOX-823    NullPointerException in DateConverter.toISO8601

These three complaints raise two philosophic questions:
    1) If an argument is null should the result be null?
In some cases the original DateConverter treats a null argument by returning null.
This has the immediate effect of preventing DateConverter from throwing
NullPointerException. But it solves no problem, it just passes the error along
to the caller. Nor is DateConverter consistent: toString and toISO8601
differ in their treatment of a null argument.

I took the position that returning null for null was not useful and thus deprecated
the two existing versions of toCalendar. The new toCalendarl method
    toCalendar(String, String[])
does not test for null, but throws a NullPointerException instead.

   2b) How should failure be reported?
The existing versions of toCalendar signal failure by throwing an IOException.
To my mind this is overkill; bad dates are an input defect, not a programming
problem. Often the right approach to failure is to return null. For toCalendar,
the standard offers a good alternative: return Jan 1 in the year 0. Unfortunately,
GregorianCalendar does not allow the year zero. So the new toCalendar method signals failure by returning Jan 1 of 999. (999 is the value of the named
constant INVALID_YEAR. If one gets a PDF created in year 999, one inhabits
a unique universe!) The older toCalendar methods call the new one, test for
INVALID_YEAR, and throw an IOException when it appears.

Notes:

The principal date parser recognizes a superset of PDF format dates 
called big-endian. They are of the format
    year [ -/]* month [ -/]* dayofmonth [ T]* hour [:] min [:] sec [.secFraction]
Time zones are parsed for at the end of every date format.

 Non-compatible changes:
   -  The default sign for time zone values was minus and is now plus.
   -  The default time zone was UTC and is now the local default.
   -  All locale-dependent processing is now done in Locale.ENGLISH.

One cannot match both m/d/y and d/m/y. 
I arbitrarily and without enthusiasm chose
            "dd MMM yy HH:mm:ss",  // for 26 May 2000 11:25:00
            "dd MMM yy HH:mm",  // for 26 May 2000 11:25
            "yyyy MMM d",   // ambiguity resolved only by omitting time
            "M/d/yy HH:mm:ss",
            "M/d/yy HH:mm",
            "M/d/yy",

There are copies of older versions of DateConverter in both jempbox
and fontbox. This is unclean. My proposal is to move DateConverter
to org.apache.commons/lang3. This will add only a third of a mega-byte
to the 11 megabyte pdfbox.jar.


                
> DateConverter needs to work
> ---------------------------
>
>                 Key: PDFBOX-1633
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1633
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.8.2
>         Environment: all os and java platforms
>            Reporter: Fred Hansen
>             Fix For: 1.8.3, 2.0.0
>
>         Attachments: DateConverter.java, TestDateUtil.java
>
>
> Most of the tests for org/apache/pdfbox/util/DateConverter.java in src/test/java/org/apache/pdfbox/util/TestDateUtil.java have been commented out.  DateConverter was broken.
> The attached patch fixes the problems.  Extensive comments document the problems. Here's a copy:
> /* the former version of DateConverter had these bugs:
>  *   -  In toISO8601 the conversion from millis to minutes was with 1000/1000;
>  *      should have been 1000/60.
>  *   -  PDFBox-402 was not completely implemented. The calendar fields in the
>  *      POTENTIAL_FORMATS are shared among threads. Hence we must create new
>  *      SimpleThreadFormats for each test.  (Or synchronize somehow).
>  *   -  Some formats with hh did not have an a field. I changed them to HH.
>  * 
>  *  these questionable features:
>  *   -  A timezone with neither plus sign nor minus is assumed to be minus.
>  *      This seems wrong, but I have not changed it.
>  *   -  toCalendar() returned a value in the default Locale.
>  *      PDF files do not have locales (I think) and even if they do
>  *      there is no reason to assume the Java default.
>  *      I have switched to Locale.ENGLISH which was already assumed 
>  *      in the date formats and toString.
>  * 
>  * and these infelicities:
>  *   -  Constants 60 and 1000 appeared.
>  *   -  zeroAppend was not used where applicable.
>  *      In one case it was inapplicable only because TimeZone.getOffset 
>  *      was suspected of returning a long. It does not.
>  *   -  Manually computed constants were used to in date.substring
>  *      thus reducing flexibility and maintainability. 
>  *   -  The TimeZone name reported by toCalendar was always "Unknown"
>  *      It is easy enough to compute a name.
>  *   -  Time zones were not accepted with most of the alternate parsing formats.
>  *      The new code allows a timezone after any format.
>  */

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira