You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/11/02 12:32:27 UTC

[jira] [Commented] (PDFBOX-3068) Null metadata in 2.0 in some files that had metadata in 1.8.10 with old parser

    [ https://issues.apache.org/jira/browse/PDFBOX-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985081#comment-14985081 ] 

Tim Allison commented on PDFBOX-3068:
-------------------------------------

Thank you!

> Null metadata in 2.0 in some files that had metadata in 1.8.10 with old parser
> ------------------------------------------------------------------------------
>
>                 Key: PDFBOX-3068
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3068
>             Project: PDFBox
>          Issue Type: Sub-task
>          Components: Parsing
>    Affects Versions: 1.8.10, 1.8.11, 2.0.0
>            Reporter: Tim Allison
>            Assignee: Tilman Hausherr
>             Fix For: 1.8.11, 2.0.0
>
>         Attachments: NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU
>
>
> Tilman's observation on 'Microsoft' below revealed 1) that we should use our BodyContentHandler so that title metadata doesn't slip into the body content and 2) the title and all metadata values from PDDocumentInformation is null for at least: NZ/NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU
> {code}
>         Path p = Paths.get("..NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU");
>         PDDocument d = PDDocument.load(p.toFile());
>         assertNull(d.getDocumentInformation().getTitle());
>         assertEquals(8, d.getDocumentInformation().getMetadataKeys().size());
> {code} 
> Manually reviewing a handful of documents in the metadata/metadata_value_count_diffs.csv file [here|https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip], this looks to be quite pervasive...unless I'm botching the right way to load the documents and metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org