You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by GitBox <gi...@apache.org> on 2021/06/23 12:14:23 UTC

[GitHub] [pdfbox] gunnar-ifp opened a new pull request #123: lenient DomXmpParser

gunnar-ifp opened a new pull request #123:
URL: https://github.com/apache/pdfbox/pull/123

The XMP box library is nice, but out in the wild are PDF files that fail parsing. For example dc.create is a Bag instead of a Seq.

Ideally the parser would have a mode where it tries to read as many properties as possible by simply discarding unreadable ones. This is not good if you want to write back a PDF but if you just want to extract Metadata, such a mode would be nice. In this case this invalid dc.creator value would be dropped. This would require doing some more work.

I've seen that there is a non strict parsing mode, which I don't think should be confused with this proposed lenient mode, but as the name suggests it should be less strict. So in this mode Sequences could be read fom Bags and vice versa. I left Alt cardinality as an error because it doesn't really fit in.

Maybe in one of the modes an element that should be an array but isn't could automagically be wrapped into one...

(I also believe that a Bag could always be read from a Sequence...)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org