You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2015/07/07 20:07:11 UTC

FW: xmp parsing issue -- xmp should start with a processing instruction

All,
  This is a separate issue than I raised in PDFBox-2855.  This, too, was initially noted by Jeremy Anderson on TIKA-1285.  I'm not sure if this is a problem with the way our xmp was generated or with the xmp parser.  I'm fairly confident the former, but wanted to check.

In our test suite, we have a file that is intended to test multi-lingual titles in xmp.  I _think_ we generated this file with an older version of PDFBox+jempbox (vintage 1.6???), but I can't remember any more.

The XMP is:

<x:xmpmeta xmlns:x="adobe:ns:meta/">
        <rdf:RDF xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
                <dc:creator rdf:resource="mailto:plindenbaum@yahoo.fr"/>
                <dc:title>
                                <rdf:Alt>
                                <rdf:li xml:lang="x-default">Hello World</rdf:li>
                                <rdf:li xml:lang="fr-ca">Bonjour World</rdf:li>
                <rdf:li xml:lang="zh-cn">????</rdf:li>
                                </rdf:Alt>
                </dc:title>
        <dc:date>2010-07-11</dc:date>
                </rdf:Description>
</rdf:RDF>
</x:xmpmeta>

The stacktrace is:
org.apache.xmpbox.xml.XmpParsingException: xmp should start with a processing instruction
                at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:135)
                at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:207)
                at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:127)


The original file PDF file is available here: http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/test/resources/test-documents/testPDFTripleLangTitle.pdf

This file was parsed without a problem by jempbox, but we get the exception in 2.0.0.  Should I open an issue for this or is this user error, and we need to regenerate our test file to yield correct xmp?

Thank you.

            Best,

                   Tim


RE: FW: xmp parsing issue -- xmp should start with a processing instruction

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you, Tilman.  Will regenerate new test file.

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Tuesday, July 07, 2015 2:11 PM
To: users@pdfbox.apache.org
Subject: Re: FW: xmp parsing issue -- xmp should start with a processing instruction

Hi,

We got more restrictive in 2.0 after doing the Bavaria pdfa tests. That 
file is missing "<?xpacket " at the beginning.

Tilman

Am 07.07.2015 um 20:07 schrieb Allison, Timothy B.:
> All,
>    This is a separate issue than I raised in PDFBox-2855.  This, too, was initially noted by Jeremy Anderson on TIKA-1285.  I'm not sure if this is a problem with the way our xmp was generated or with the xmp parser.  I'm fairly confident the former, but wanted to check.
>
> In our test suite, we have a file that is intended to test multi-lingual titles in xmp.  I _think_ we generated this file with an older version of PDFBox+jempbox (vintage 1.6???), but I can't remember any more.
>
> The XMP is:
>
> <x:xmpmeta xmlns:x="adobe:ns:meta/">
>          <rdf:RDF xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
>
> <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
>                  <dc:creator rdf:resource="mailto:plindenbaum@yahoo.fr"/>
>                  <dc:title>
>                                  <rdf:Alt>
>                                  <rdf:li xml:lang="x-default">Hello World</rdf:li>
>                                  <rdf:li xml:lang="fr-ca">Bonjour World</rdf:li>
>                  <rdf:li xml:lang="zh-cn">????</rdf:li>
>                                  </rdf:Alt>
>                  </dc:title>
>          <dc:date>2010-07-11</dc:date>
>                  </rdf:Description>
> </rdf:RDF>
> </x:xmpmeta>
>
> The stacktrace is:
> org.apache.xmpbox.xml.XmpParsingException: xmp should start with a processing instruction
>                  at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:135)
>                  at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:207)
>                  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:127)
>
>
> The original file PDF file is available here: http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/test/resources/test-documents/testPDFTripleLangTitle.pdf
>
> This file was parsed without a problem by jempbox, but we get the exception in 2.0.0.  Should I open an issue for this or is this user error, and we need to regenerate our test file to yield correct xmp?
>
> Thank you.
>
>              Best,
>
>                     Tim
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: FW: xmp parsing issue -- xmp should start with a processing instruction

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

We got more restrictive in 2.0 after doing the Bavaria pdfa tests. That 
file is missing "<?xpacket " at the beginning.

Tilman

Am 07.07.2015 um 20:07 schrieb Allison, Timothy B.:
> All,
>    This is a separate issue than I raised in PDFBox-2855.  This, too, was initially noted by Jeremy Anderson on TIKA-1285.  I'm not sure if this is a problem with the way our xmp was generated or with the xmp parser.  I'm fairly confident the former, but wanted to check.
>
> In our test suite, we have a file that is intended to test multi-lingual titles in xmp.  I _think_ we generated this file with an older version of PDFBox+jempbox (vintage 1.6???), but I can't remember any more.
>
> The XMP is:
>
> <x:xmpmeta xmlns:x="adobe:ns:meta/">
>          <rdf:RDF xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
>
> <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
>                  <dc:creator rdf:resource="mailto:plindenbaum@yahoo.fr"/>
>                  <dc:title>
>                                  <rdf:Alt>
>                                  <rdf:li xml:lang="x-default">Hello World</rdf:li>
>                                  <rdf:li xml:lang="fr-ca">Bonjour World</rdf:li>
>                  <rdf:li xml:lang="zh-cn">????</rdf:li>
>                                  </rdf:Alt>
>                  </dc:title>
>          <dc:date>2010-07-11</dc:date>
>                  </rdf:Description>
> </rdf:RDF>
> </x:xmpmeta>
>
> The stacktrace is:
> org.apache.xmpbox.xml.XmpParsingException: xmp should start with a processing instruction
>                  at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:135)
>                  at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:207)
>                  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:127)
>
>
> The original file PDF file is available here: http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/test/resources/test-documents/testPDFTripleLangTitle.pdf
>
> This file was parsed without a problem by jempbox, but we get the exception in 2.0.0.  Should I open an issue for this or is this user error, and we need to regenerate our test file to yield correct xmp?
>
> Thank you.
>
>              Best,
>
>                     Tim
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org