You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Matthew Clemente <mj...@gmail.com> on 2018/07/20 19:15:42 UTC

Overwriting Metadata

Forgive me if this question has an obvious answer; perhaps I’m not taking
the right approach to the problem.

My goal is to save a version of a pdf, with modified metadata. In most
cases, I’ll be removing metadata (setting the author, title, description to
blank), though in some cases I’ll be adding information to those fields.

I’ve tried both approaches from these StackOverflow answers:
https://stackoverflow.com/questions/40295264/how-to-add-metadata-to-pdf-document-using-pdfbox


That is, I’ve tried creating the metadata via XMPMetadata and using
importXMPMetadata(). I’ve also tried using the Document Information object
(inputDoc.getDocumentInformation().setCreator("Some meta”);).

In both cases, if the field is empty in the original document I’ve loaded,
the new value is set without issue. However, if the metadata field already
contains a value, the new value is not applied.

Is there a way for me to overwrite metadata, or am I approaching this all
wrong?

Here’s a pdf I was using while testing (it has a title and author set, but
no subject): https://www.dropbox.com/s/olk2zhnh47ohtpk/testing.pdf?dl=0

Thanks, in advance, for any insight.

-- 
Matthew Clemente

Re: Overwriting Metadata

Posted by Tilman Hausherr <TH...@t-online.de>.

And

  doc.getDocumentInformation().setSubject("new subject");

sets the subject only in the document information.

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Overwriting Metadata

Posted by Matthew Clemente <mj...@gmail.com>.

(I had accidentally sent this response to the wrong thread.)

First of all, I *really* appreciate all the time and help.

Particularly, thanks for pointing me to the Debugger. Foolishly, I’d just
been using Acrobat to look at the properties, which led to some
unpredictable results.

I *think* that my solution requires setting both the Document Information
and the Metadata, due to the behavior of Acrobat, Preview, and other PDF
readers.

For the sake of posterity (and feedback, if anyone has any further
experience or insights), this is what I’m seeing.

For an original pdf document with the metadata in the Document Information
and XMP matching, if I only set the XMP data, Acrobat seems to prefer the
Document Information when it displays properties. If I only set the
Document Information, Acrobat prefers/displays the XMP metadata.
Consequently, in both cases it appeared to me that my code was not
overwriting the existing values, when it really was. Using the Debugger I
was able to see the changes to the XMP metadata and/or the Document
information.

In summary, setting both the XMP metadata and the document information
overwrites the existing data (and sets new metadata) without issue and
creates predictable results in various pdf readers.

Again, thanks for the help!

-- 
Matthew Clemente

From: Tilman Hausherr <th...@t-online.de> <th...@t-online.de>
Reply: users@pdfbox.apache.org <us...@pdfbox.apache.org>
<us...@pdfbox.apache.org>
Date: July 20, 2018 at 10:46:48 PM
To: users@pdfbox.apache.org <us...@pdfbox.apache.org>
<us...@pdfbox.apache.org>
Subject:  Re: Overwriting Metadata

Am 20.07.2018 um 23:37 schrieb Matthew Clemente:
> Thanks Tilman.
>
> I set up my code to match yours (it was pretty similar), and I’m
> getting the same result. I can’t overwrite existing fields
> via XMPMetadata.
>
> For what it’s worth, I’m using version 2.0.11 of PDFBox and XMPBox;
> not sure if that would make a difference.
>
> I’m assuming, with the approach you’re using, that you are able to
> change the Author and Title?

Your question is somewhat unclear... or I misunderstood it ... you wrote
that you failed with both /Info and XMP /Metadata. With /Info (my small
reply) I was able to change just the subject and keep the rest.

Does this work or not?

With the larger code I replaced the whole metadata and didn't try to
replace just a single field.

Possible explanation: you looked at the PDFs with Adobe Reader. IIRC
that one displays what's in the XMP metadata first, i.e. if there is
/Info and /Metadata

What do you really want, replace an individual field or replace the
whole metadata?

To alter individual fields, this should work like this:

XMPMetadata xmp = xmpParser.parse(meta.createInputStream());
DublinCoreSchema dc = xmp.getDublinCoreSchema();
if (dc != null)
dc.setDescription("descr");
else
    /// do as before


I took that code from the ExtractMetadata example from the source code
download.

(I didn't test. It's in the middle of the night and I couldn't sleep)


Tilman


>
> --
> Matthew Clemente
>
> From: Tilman Hausherr <th...@t-online.de>
> <ma...@t-online.de>
> Reply: users@pdfbox.apache.org <ma...@pdfbox.apache.org>
> <us...@pdfbox.apache.org> <ma...@pdfbox.apache.org>
> Date: July 20, 2018 at 4:14:48 PM
> To: users@pdfbox.apache.org <ma...@pdfbox.apache.org>
> <us...@pdfbox.apache.org> <ma...@pdfbox.apache.org>
> Subject: Re: Overwriting Metadata
>
>> It works for me... here's my code:
>>
>>
>> import java.io.ByteArrayOutputStream;
>> import java.io.File;
>> import java.io.IOException;
>> import javax.xml.transform.TransformerException;
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.pdmodel.common.PDMetadata;
>> import org.apache.xmpbox.XMPMetadata;
>> import org.apache.xmpbox.schema.DublinCoreSchema;
>> import org.apache.xmpbox.xml.XmpSerializer;
>>
>> public class ChangeMeta
>> {
>>     public static void main(String[] args) throws IOException,
>> TransformerException
>>     {
>>         PDDocument doc = PDDocument.load(new File("testing.pdf"));
>>         XMPMetadata xmp = XMPMetadata.createXMPMetadata();
>>         DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
>> dc.setDescription("descr");
>>         XmpSerializer serializer = new XmpSerializer();
>>         ByteArrayOutputStream baos = new ByteArrayOutputStream();
>> serializer.serialize(xmp, baos, true);
>>         PDMetadata metadata = new PDMetadata(doc);
>> metadata.importXMPMetadata(baos.toByteArray());
>> doc.getDocumentCatalog().setMetadata(metadata);
>>         doc.save(new File("testing-new.pdf"));
>>     }
>> }
>>
>> And the proof that it worked:
>>
>>
>>
>>
>> Tilman
>>
>> Am 20.07.2018 um 21:15 schrieb Matthew Clemente:
>>> Forgive me if this question has an obvious answer; perhaps I’m not
taking
>>> the right approach to the problem.
>>>
>>> My goal is to save a version of a pdf, with modified metadata. In most
>>> cases, I’ll be removing metadata (setting the author, title,
description to
>>> blank), though in some cases I’ll be adding information to those
fields.
>>>
>>> I’ve tried both approaches from these StackOverflow answers:
>>>
https://stackoverflow.com/questions/40295264/how-to-add-metadata-to-pdf-document-using-pdfbox
>>>
>>>
>>> That is, I’ve tried creating the metadata via XMPMetadata and using
>>> importXMPMetadata(). I’ve also tried using the Document Information
object
>>> (inputDoc.getDocumentInformation().setCreator("Some meta”);).
>>>
>>> In both cases, if the field is empty in the original document I’ve
loaded,
>>> the new value is set without issue. However, if the metadata field
already
>>> contains a value, the new value is not applied.
>>>
>>> Is there a way for me to overwrite metadata, or am I approaching this
all
>>> wrong?
>>>
>>> Here’s a pdf I was using while testing (it has a title and author set,
but
>>> no subject):https://www.dropbox.com/s/olk2zhnh47ohtpk/testing.pdf?dl=0
>>>
>>> Thanks, in advance, for any insight.
>>>
>>

Re: Overwriting Metadata

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 20.07.2018 um 23:37 schrieb Matthew Clemente:
> Thanks Tilman.
>
> I set up my code to match yours (it was pretty similar), and I’m 
> getting the same result. I can’t overwrite existing fields 
> via XMPMetadata.
>
> For what it’s worth, I’m using version 2.0.11 of PDFBox and XMPBox; 
> not sure if that would make a difference.
>
> I’m assuming, with the approach you’re using, that you are able to 
> change the Author and Title?

Your question is somewhat unclear... or I misunderstood it ... you wrote 
that you failed with both /Info and XMP /Metadata. With /Info (my small 
reply) I was able to change just the subject and keep the rest.

Does this work or not?

With the larger code I replaced the whole metadata and didn't try to 
replace just a single field.

Possible explanation: you looked at the PDFs with Adobe Reader. IIRC 
that one displays what's in the XMP metadata first, i.e. if there is 
/Info and /Metadata

What do you really want, replace an individual field or replace the 
whole metadata?

To alter individual fields, this should work like this:

XMPMetadata xmp = xmpParser.parse(meta.createInputStream());
DublinCoreSchema dc = xmp.getDublinCoreSchema();
if (dc != null)
dc.setDescription("descr");
else
     /// do as before


I took that code from the ExtractMetadata example from the source code 
download.

(I didn't test. It's in the middle of the night and I couldn't sleep)


Tilman


>
> -- 
> Matthew Clemente
>
> From: Tilman Hausherr <th...@t-online.de> 
> <ma...@t-online.de>
> Reply: users@pdfbox.apache.org <ma...@pdfbox.apache.org> 
> <us...@pdfbox.apache.org> <ma...@pdfbox.apache.org>
> Date: July 20, 2018 at 4:14:48 PM
> To: users@pdfbox.apache.org <ma...@pdfbox.apache.org> 
> <us...@pdfbox.apache.org> <ma...@pdfbox.apache.org>
> Subject: Re: Overwriting Metadata
>
>> It works for me... here's my code:
>>
>>
>> import java.io.ByteArrayOutputStream;
>> import java.io.File;
>> import java.io.IOException;
>> import javax.xml.transform.TransformerException;
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.pdmodel.common.PDMetadata;
>> import org.apache.xmpbox.XMPMetadata;
>> import org.apache.xmpbox.schema.DublinCoreSchema;
>> import org.apache.xmpbox.xml.XmpSerializer;
>>
>> public class ChangeMeta
>> {
>>     public static void main(String[] args) throws IOException, 
>> TransformerException
>>     {
>>         PDDocument doc = PDDocument.load(new File("testing.pdf"));
>>         XMPMetadata xmp = XMPMetadata.createXMPMetadata();
>>         DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
>> dc.setDescription("descr");
>>         XmpSerializer serializer = new XmpSerializer();
>>         ByteArrayOutputStream baos = new ByteArrayOutputStream();
>> serializer.serialize(xmp, baos, true);
>>         PDMetadata metadata = new PDMetadata(doc);
>> metadata.importXMPMetadata(baos.toByteArray());
>> doc.getDocumentCatalog().setMetadata(metadata);
>>         doc.save(new File("testing-new.pdf"));
>>     }
>> }
>>
>> And the proof that it worked:
>>
>>
>>
>>
>> Tilman
>>
>> Am 20.07.2018 um 21:15 schrieb Matthew Clemente:
>>> Forgive me if this question has an obvious answer; perhaps I’m not taking
>>> the right approach to the problem.
>>>
>>> My goal is to save a version of a pdf, with modified metadata. In most
>>> cases, I’ll be removing metadata (setting the author, title, description to
>>> blank), though in some cases I’ll be adding information to those fields.
>>>
>>> I’ve tried both approaches from these StackOverflow answers:
>>> https://stackoverflow.com/questions/40295264/how-to-add-metadata-to-pdf-document-using-pdfbox
>>>
>>>
>>> That is, I’ve tried creating the metadata via XMPMetadata and using
>>> importXMPMetadata(). I’ve also tried using the Document Information object
>>> (inputDoc.getDocumentInformation().setCreator("Some meta”);).
>>>
>>> In both cases, if the field is empty in the original document I’ve loaded,
>>> the new value is set without issue. However, if the metadata field already
>>> contains a value, the new value is not applied.
>>>
>>> Is there a way for me to overwrite metadata, or am I approaching this all
>>> wrong?
>>>
>>> Here’s a pdf I was using while testing (it has a title and author set, but
>>> no subject):https://www.dropbox.com/s/olk2zhnh47ohtpk/testing.pdf?dl=0
>>>
>>> Thanks, in advance, for any insight.
>>>
>>

Re: Overwriting Metadata

Posted by Matthew Clemente <mj...@gmail.com>.

Thanks Tilman.

I set up my code to match yours (it was pretty similar), and I’m getting
the same result. I can’t overwrite existing fields via XMPMetadata.

For what it’s worth, I’m using version 2.0.11 of PDFBox and XMPBox; not
sure if that would make a difference.

I’m assuming, with the approach you’re using, that you are able to change
the Author and Title?

-- 
Matthew Clemente

From: Tilman Hausherr <th...@t-online.de> <th...@t-online.de>
Reply: users@pdfbox.apache.org <us...@pdfbox.apache.org>
<us...@pdfbox.apache.org>
Date: July 20, 2018 at 4:14:48 PM
To: users@pdfbox.apache.org <us...@pdfbox.apache.org>
<us...@pdfbox.apache.org>
Subject:  Re: Overwriting Metadata

It works for me... here's my code:


import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import javax.xml.transform.TransformerException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.common.PDMetadata;
import org.apache.xmpbox.XMPMetadata;
import org.apache.xmpbox.schema.DublinCoreSchema;
import org.apache.xmpbox.xml.XmpSerializer;

public class ChangeMeta
{
    public static void main(String[] args) throws IOException,
TransformerException
    {
        PDDocument doc = PDDocument.load(new File("testing.pdf"));
        XMPMetadata xmp = XMPMetadata.createXMPMetadata();
        DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
        dc.setDescription("descr");
        XmpSerializer serializer = new XmpSerializer();
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        serializer.serialize(xmp, baos, true);
        PDMetadata metadata = new PDMetadata(doc);
        metadata.importXMPMetadata(baos.toByteArray());
        doc.getDocumentCatalog().setMetadata(metadata);
        doc.save(new File("testing-new.pdf"));
    }
}

And the proof that it worked:




Tilman

Am 20.07.2018 um 21:15 schrieb Matthew Clemente:

Forgive me if this question has an obvious answer; perhaps I’m not taking
the right approach to the problem.

My goal is to save a version of a pdf, with modified metadata. In most
cases, I’ll be removing metadata (setting the author, title, description to
blank), though in some cases I’ll be adding information to those fields.

I’ve tried both approaches from these StackOverflow
answers:https://stackoverflow.com/questions/40295264/how-to-add-metadata-to-pdf-document-using-pdfbox


That is, I’ve tried creating the metadata via XMPMetadata and using
importXMPMetadata(). I’ve also tried using the Document Information object
(inputDoc.getDocumentInformation().setCreator("Some meta”);).

In both cases, if the field is empty in the original document I’ve loaded,
the new value is set without issue. However, if the metadata field already
contains a value, the new value is not applied.

Is there a way for me to overwrite metadata, or am I approaching this all
wrong?

Here’s a pdf I was using while testing (it has a title and author set, but
no subject): https://www.dropbox.com/s/olk2zhnh47ohtpk/testing.pdf?dl=0

Thanks, in advance, for any insight.

Re: Overwriting Metadata

Posted by Tilman Hausherr <TH...@t-online.de>.

It works for me... here's my code:


import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import javax.xml.transform.TransformerException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.common.PDMetadata;
import org.apache.xmpbox.XMPMetadata;
import org.apache.xmpbox.schema.DublinCoreSchema;
import org.apache.xmpbox.xml.XmpSerializer;

public class ChangeMeta
{
     public static void main(String[] args) throws IOException, 
TransformerException
     {
         PDDocument doc = PDDocument.load(new File("testing.pdf"));
         XMPMetadata xmp = XMPMetadata.createXMPMetadata();
         DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
         dc.setDescription("descr");
         XmpSerializer serializer = new XmpSerializer();
         ByteArrayOutputStream baos = new ByteArrayOutputStream();
         serializer.serialize(xmp, baos, true);
         PDMetadata metadata = new PDMetadata(doc);
         metadata.importXMPMetadata(baos.toByteArray());
         doc.getDocumentCatalog().setMetadata(metadata);
         doc.save(new File("testing-new.pdf"));
     }
}

And the proof that it worked:




Tilman

Am 20.07.2018 um 21:15 schrieb Matthew Clemente:
> Forgive me if this question has an obvious answer; perhaps I’m not taking
> the right approach to the problem.
>
> My goal is to save a version of a pdf, with modified metadata. In most
> cases, I’ll be removing metadata (setting the author, title, description to
> blank), though in some cases I’ll be adding information to those fields.
>
> I’ve tried both approaches from these StackOverflow answers:
> https://stackoverflow.com/questions/40295264/how-to-add-metadata-to-pdf-document-using-pdfbox
>
>
> That is, I’ve tried creating the metadata via XMPMetadata and using
> importXMPMetadata(). I’ve also tried using the Document Information object
> (inputDoc.getDocumentInformation().setCreator("Some meta”);).
>
> In both cases, if the field is empty in the original document I’ve loaded,
> the new value is set without issue. However, if the metadata field already
> contains a value, the new value is not applied.
>
> Is there a way for me to overwrite metadata, or am I approaching this all
> wrong?
>
> Here’s a pdf I was using while testing (it has a title and author set, but
> no subject): https://www.dropbox.com/s/olk2zhnh47ohtpk/testing.pdf?dl=0
>
> Thanks, in advance, for any insight.
>