You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Robin Diederen <di...@nlcom.nl> on 2010/03/29 16:13:54 UTC

Problem extracting XMP metadata from PDF files

Hello all,

 

Since a few days I've been testing PDFbox 1.0. We use PDFbox to extract metadata and text from PDF files. 

 

Most PDFs work fine, but some give errors when exporting XMP metadata. Some typical errors are "[Fatal Error] :22:51: The reference to entity "bar" must end with the ';' delimiter." and "[Fatal Error] :22:56: The entity name must immediately follow the '&' in the entity reference.". 

 

A code fragment (somewhat simplified, good enough for illustrational purposes though):

pdfDocument = PDDocument.load(inputFile);


PDDocumentCatalog pdfCatalog = pdfDocument.getDocumentCatalog();
PDDocumentInformation pdfInfo = pdfDocument.getDocumentInformation();
PDMetadata metaData = pdfCatalog.getMetadata();
XMPMetadata metaDataXMP = metaData.exportXMPMetadata();

 

I'm using JempBox 1.0.0.

 

In case this is due to faulty XMP metadata (non-proper formatted XML for example), is there any way to loosen this up a bit? 

 

Best, Robin