You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Tilman Hausherr <TH...@t-online.de> on 2023/08/12 18:35:20 UTC

Re: Xmpbox metadata parsing issue

Hi,

https://issues.apache.org/jira/browse/PDFBOX-5649

This is similar to your own problem, coincidentally.

Tilman

On 11.07.2023 11:25, Sylvere Babin wrote:
>
> Hello,
>
> We use PDFBox to read the XMP metadata of PDF documents in the 
> Factur-X standard, a Franco-German e-invoicing standard.
>
> The XML schema corresponding to this metadata is quite simple, and 
> retrieving the values are perfectly working with the 
> org.apache.xmpbox.XMPMetadata.getSchema(String) method.
>
> By default, the prefix is fx :
>
> <rdf:Description 
> xmlns:fx="urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#" 
> rdf:about="">
>
>       <fx:DocumentType>INVOICE</fx:DocumentType>
>
>       <fx:DocumentFileName>factur-x.xml</fx:DocumentFileName>
>
>       <fx:Version>1.0</fx:Version>
>
>       <fx:ConformanceLevel>BASIC</fx:ConformanceLevel>
>
> </rdf:Description>
>
> In one case, there were a document with two schemas with the same 
> namespace URI, but different prefixes (fx and zf)
>
> I tried the org.apache.xmpbox.XMPMetadata.getSchema(String, String) 
> method, which according to the documentation seems to handle this case 
> by filtering by prefix.
>
> I got a NullPointerException from this method (line 268), because the 
> prefix of the Factur-x schema in the 
> org.apache.xmpbox.XMPMetadata.schemas collection was null.
>
> So, I've run tests with a hundred example files provided by the 
> Factur-X consortium, and it seems that for any file, the schema with 
> the Factur-X URI always gets a null prefix, regardless of whether one 
> or more schemas exist with this namespace.
>
> This raise two points :
>
>  1. If the prefix can be null, the getSchema(String, String) method
>     should handle it.
>  2. Is the Factur-X metadata specification a correct XMP standard, or
>     is there a bug in the prefix parsing ?
>
> Here’s the PDF document : Icône pdf pdfExemple.pdf 
> <https://cegidgroup-my.sharepoint.com/:b:/g/personal/sbabin_cegid_com/EVN8vpGbR1pEvaOuoIjyvfQBuhV1ZWFlYfAIKMfuAhd6Aw?e=cahEv2>
>
> Here’s the code I use to retrieve the Factur-X metadata values :
>
> import java.io.File;
>
> import java.io.IOException;
>
> import java.io.InputStream;
>
> import org.apache.pdfbox.pdmodel.PDDocument;
>
> import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
>
> import org.apache.pdfbox.pdmodel.common.PDMetadata;
>
> import org.apache.xmpbox.XMPMetadata;
>
> import org.apache.xmpbox.schema.XMPSchema;
>
> import org.apache.xmpbox.xml.DomXmpParser;
>
> import org.apache.xmpbox.xml.XmpParsingException;
>
> public class FacturX {
>
>        public static void main(String[] args) throws 
> XmpParsingException, IOException {
>
> try {
>
> File finputFile = new File(args[0]);
>
> PDDocument doc = PDDocument.load(finputFile);
>
> PDDocumentCatalog catalog = doc.getDocumentCatalog();
>
> PDMetadata m = catalog.getMetadata();
>
> InputStream xmlInputStream = m.createInputStream();
>
> DomXmpParser p = new DomXmpParser();
>
> p.setStrictParsing(false);
>
> XMPMetadata metadata = p.parse(xmlInputStream);
>
> // Getting the factur-x schema with the default "fx" prefix (case of 
> two factur-x schemas with different prefixes)
>
> XMPSchema fx = metadata.getSchema("fx", 
> "urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#");
>
> // If there is no schema with fx prefix, searching for the schema only 
> with the namespace URI
>
> if (fx == null) {
>
> fx = 
> metadata.getSchema("urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#");
>
> }
>
> if (fx == null) {
>
> System.out.println("This PDF document is not a valid Factur-X file");
>
> } else {
>
> String conformanceLevel = 
> fx.getUnqualifiedTextPropertyValue("ConformanceLevel");
>
>                 String documentType = 
> fx.getUnqualifiedTextPropertyValue("DocumentType");
>
>                 String version = 
> fx.getUnqualifiedTextPropertyValue("Version");
>
>                 String documentFileName = 
> fx.getUnqualifiedTextPropertyValue("DocumentFileName");
>
> }
>
>              } catch (XmpParsingException | IOException e) {
>
> e.printStackTrace();
>
>              }
>
>        }
>
> }
>
> Thanks for your help,
>
> *Sylvère Babin*
> Developer
>
>
>
> Cegid est susceptible d’effectuer un traitement sur vos données 
> personnelles à des fins de gestion de notre relation commerciale. Pour 
> plus d’information, consultez https://www.cegid.com/fr/privacy-policy
> Ce message et les pièces jointes sont confidentiels et établis à 
> l'attention exclusive de ses destinataires. Toute utilisation ou 
> diffusion, même partielle, non autorisée est interdite. Tout message 
> électronique est susceptible d'altération; Cegid décline donc toute 
> responsabilité au titre de ce message. Si vous n'êtes pas le 
> destinataire de ce message, merci de le détruire et d'avertir 
> l'expéditeur.
>
> Cegid may process your personal data for the purpose of our business 
> relationship management. For more information, please visit our 
> website https://www.cegid.com/en/privacy-policy
> This message and any attachments are confidential and intended solely 
> for the addressees. Any unauthorized use or disclosure, either whole 
> or partial is prohibited. E-mails are susceptible to alteration; Cegid 
> shall therefore not be liable for the content of this message. If you 
> are not the intended recipient of this message, please delete it and 
> notify the sender.