You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Tilman Hausherr <TH...@t-online.de> on 2023/08/12 18:35:20 UTC
Re: Xmpbox metadata parsing issue
Hi,
https://issues.apache.org/jira/browse/PDFBOX-5649
This is similar to your own problem, coincidentally.
Tilman
On 11.07.2023 11:25, Sylvere Babin wrote:
>
> Hello,
>
> We use PDFBox to read the XMP metadata of PDF documents in the
> Factur-X standard, a Franco-German e-invoicing standard.
>
> The XML schema corresponding to this metadata is quite simple, and
> retrieving the values are perfectly working with the
> org.apache.xmpbox.XMPMetadata.getSchema(String) method.
>
> By default, the prefix is fx :
>
> <rdf:Description
> xmlns:fx="urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#"
> rdf:about="">
>
> <fx:DocumentType>INVOICE</fx:DocumentType>
>
> <fx:DocumentFileName>factur-x.xml</fx:DocumentFileName>
>
> <fx:Version>1.0</fx:Version>
>
> <fx:ConformanceLevel>BASIC</fx:ConformanceLevel>
>
> </rdf:Description>
>
> In one case, there were a document with two schemas with the same
> namespace URI, but different prefixes (fx and zf)
>
> I tried the org.apache.xmpbox.XMPMetadata.getSchema(String, String)
> method, which according to the documentation seems to handle this case
> by filtering by prefix.
>
> I got a NullPointerException from this method (line 268), because the
> prefix of the Factur-x schema in the
> org.apache.xmpbox.XMPMetadata.schemas collection was null.
>
> So, I've run tests with a hundred example files provided by the
> Factur-X consortium, and it seems that for any file, the schema with
> the Factur-X URI always gets a null prefix, regardless of whether one
> or more schemas exist with this namespace.
>
> This raise two points :
>
> 1. If the prefix can be null, the getSchema(String, String) method
> should handle it.
> 2. Is the Factur-X metadata specification a correct XMP standard, or
> is there a bug in the prefix parsing ?
>
> Here’s the PDF document : Icône pdf pdfExemple.pdf
> <https://cegidgroup-my.sharepoint.com/:b:/g/personal/sbabin_cegid_com/EVN8vpGbR1pEvaOuoIjyvfQBuhV1ZWFlYfAIKMfuAhd6Aw?e=cahEv2>
>
> Here’s the code I use to retrieve the Factur-X metadata values :
>
> import java.io.File;
>
> import java.io.IOException;
>
> import java.io.InputStream;
>
> import org.apache.pdfbox.pdmodel.PDDocument;
>
> import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
>
> import org.apache.pdfbox.pdmodel.common.PDMetadata;
>
> import org.apache.xmpbox.XMPMetadata;
>
> import org.apache.xmpbox.schema.XMPSchema;
>
> import org.apache.xmpbox.xml.DomXmpParser;
>
> import org.apache.xmpbox.xml.XmpParsingException;
>
> public class FacturX {
>
> public static void main(String[] args) throws
> XmpParsingException, IOException {
>
> try {
>
> File finputFile = new File(args[0]);
>
> PDDocument doc = PDDocument.load(finputFile);
>
> PDDocumentCatalog catalog = doc.getDocumentCatalog();
>
> PDMetadata m = catalog.getMetadata();
>
> InputStream xmlInputStream = m.createInputStream();
>
> DomXmpParser p = new DomXmpParser();
>
> p.setStrictParsing(false);
>
> XMPMetadata metadata = p.parse(xmlInputStream);
>
> // Getting the factur-x schema with the default "fx" prefix (case of
> two factur-x schemas with different prefixes)
>
> XMPSchema fx = metadata.getSchema("fx",
> "urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#");
>
> // If there is no schema with fx prefix, searching for the schema only
> with the namespace URI
>
> if (fx == null) {
>
> fx =
> metadata.getSchema("urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#");
>
> }
>
> if (fx == null) {
>
> System.out.println("This PDF document is not a valid Factur-X file");
>
> } else {
>
> String conformanceLevel =
> fx.getUnqualifiedTextPropertyValue("ConformanceLevel");
>
> String documentType =
> fx.getUnqualifiedTextPropertyValue("DocumentType");
>
> String version =
> fx.getUnqualifiedTextPropertyValue("Version");
>
> String documentFileName =
> fx.getUnqualifiedTextPropertyValue("DocumentFileName");
>
> }
>
> } catch (XmpParsingException | IOException e) {
>
> e.printStackTrace();
>
> }
>
> }
>
> }
>
> Thanks for your help,
>
> *Sylvère Babin*
> Developer
>
>
>
> Cegid est susceptible d’effectuer un traitement sur vos données
> personnelles à des fins de gestion de notre relation commerciale. Pour
> plus d’information, consultez https://www.cegid.com/fr/privacy-policy
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de ses destinataires. Toute utilisation ou
> diffusion, même partielle, non autorisée est interdite. Tout message
> électronique est susceptible d'altération; Cegid décline donc toute
> responsabilité au titre de ce message. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'avertir
> l'expéditeur.
>
> Cegid may process your personal data for the purpose of our business
> relationship management. For more information, please visit our
> website https://www.cegid.com/en/privacy-policy
> This message and any attachments are confidential and intended solely
> for the addressees. Any unauthorized use or disclosure, either whole
> or partial is prohibited. E-mails are susceptible to alteration; Cegid
> shall therefore not be liable for the content of this message. If you
> are not the intended recipient of this message, please delete it and
> notify the sender.