You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Adam Retter <ad...@googlemail.com> on 2016/07/18 12:15:29 UTC
Mangled diacritic characters in metadata
Using pdf-box-2.0.2:
I am trying to set dc:publisher to "Çâmára Münícìpål de Matelâñdia" in
the metadata of my PDF however my diacritical characters seem to get
mangled when I try and read the PDF back.
My writing code looks like:
PDDocument doc = ...
PDDocumentCatalog catalog = ...
PDMetadata metadataStream = Optional.ofNullable(catalog.getMetadata())
.orElseGet(() -> new PDMetadata(doc));
XMPMetadata xmpMetadata = null;
try(COSInputStream is = metadataStream.createInputStream()) {
xmpMetadata = new DomXmpParser().parse(is);
} catch(XmpParsingException e) {
LOG.warn(e);
xmpMetadata = XMPMetadata.createXMPMetadata();
}
DublinCoreSchema dcMetadata = xmpMetadata.createAndAddDublinCoreSchema();
dcMetadata.addPublisher("Çâmára Münícìpål de Matelâñdia");
catalog.setMetadata(xmpMetadata);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
XmpSerializer serializer = new XmpSerializer();
serializer.serialize(xmpMetadata, baos, false);
metadataStream.importXMPMetadata(baos.toByteArray());
My reading code looks like:
PDDocment doc = PDDocument.load(is);
PDDocumentCatalog catalog = doc.getDocumentCatalog()
PDMetadata metadata = catalog.getMetadata()
try(InputStream is = metadata.createInputStream()) {
Files.copy(is, Paths.get("/tmp/metadata.xml"));
}
However in the output XML I am seeing this:
<dc:publisher>
<rdf:Bag>
<rdf:li>??m?ra M?n?c?p?l de Matel??dia</rdf:li>
</rdf:Bag>
</dc:publisher>
So I guess something is up with the character encoding somewhere? Is
this something I am doing incorrectly, perhaps I need to specify UTF-8
somewhere (my character set)? or is this a bug in pdf-box?
Cheers Adam.
--
Adam Retter
skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: Mangled diacritic characters in metadata
Posted by Adam Retter <ad...@googlemail.com>.
Thank you Maruan,
Apologies for the noise. I have now resolved this. I simplified my
code for the examples I gave in the email. The issue is not with PDF
Box, rather a 3rd party library which was processing the string
"Çâmára Münícìpål de Matelâñdia" before it reached PDFBox was mangling
it.
Thanks again.
Adam.
On 19 July 2016 at 08:13, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
> Hi,
>
>> Am 18.07.2016 um 14:15 schrieb Adam Retter <ad...@googlemail.com>:
>>
>> Using pdf-box-2.0.2:
>>
>> I am trying to set dc:publisher to "Çâmára Münícìpål de Matelâñdia" in
>> the metadata of my PDF however my diacritical characters seem to get
>> mangled when I try and read the PDF back.
>>
>> My writing code looks like:
>>
>> PDDocument doc = ...
>> PDDocumentCatalog catalog = ...
>>
>> PDMetadata metadataStream = Optional.ofNullable(catalog.getMetadata())
>> .orElseGet(() -> new PDMetadata(doc));
>> XMPMetadata xmpMetadata = null;
>> try(COSInputStream is = metadataStream.createInputStream()) {
>> xmpMetadata = new DomXmpParser().parse(is);
>> } catch(XmpParsingException e) {
>> LOG.warn(e);
>> xmpMetadata = XMPMetadata.createXMPMetadata();
>> }
>> DublinCoreSchema dcMetadata = xmpMetadata.createAndAddDublinCoreSchema();
>> dcMetadata.addPublisher("Çâmára Münícìpål de Matelâñdia");
>> catalog.setMetadata(xmpMetadata);
>> ByteArrayOutputStream baos = new ByteArrayOutputStream();
>> XmpSerializer serializer = new XmpSerializer();
>> serializer.serialize(xmpMetadata, baos, false);
>> metadataStream.importXMPMetadata(baos.toByteArray());
>>
>>
>> My reading code looks like:
>>
>> PDDocment doc = PDDocument.load(is);
>> PDDocumentCatalog catalog = doc.getDocumentCatalog()
>> PDMetadata metadata = catalog.getMetadata()
>> try(InputStream is = metadata.createInputStream()) {
>> Files.copy(is, Paths.get("/tmp/metadata.xml"));
>> }
>>
>>
>> However in the output XML I am seeing this:
>>
>> <dc:publisher>
>> <rdf:Bag>
>> <rdf:li>??m?ra M?n?c?p?l de Matel??dia</rdf:li>
>> </rdf:Bag>
>> </dc:publisher>
>>
>>
>
> I've tested various ways of saving the file, yours, serializing to FileOutputStream … and all work with when viewing the content in a browser ot a text editor.
>
>
> <dc:publisher>
> <rdf:Bag>
> <rdf:li>Çâmára Münícìpål de Matelâñdia</rdf:li>
> </rdf:Bag>
> </dc:publisher>
>
> Where do you see that string?
>
> BR
> Maruan
>
>
>
>> So I guess something is up with the character encoding somewhere? Is
>> this something I am doing incorrectly, perhaps I need to specify UTF-8
>> somewhere (my character set)? or is this a bug in pdf-box?
>>
>> Cheers Adam.
>>
>>
>>
>>
>>
>> --
>> Adam Retter
>>
>> skype: adam.retter
>> tweet: adamretter
>> http://www.adamretter.org.uk
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
--
Adam Retter
skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: Mangled diacritic characters in metadata
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,
> Am 18.07.2016 um 14:15 schrieb Adam Retter <ad...@googlemail.com>:
>
> Using pdf-box-2.0.2:
>
> I am trying to set dc:publisher to "Çâmára Münícìpål de Matelâñdia" in
> the metadata of my PDF however my diacritical characters seem to get
> mangled when I try and read the PDF back.
>
> My writing code looks like:
>
> PDDocument doc = ...
> PDDocumentCatalog catalog = ...
>
> PDMetadata metadataStream = Optional.ofNullable(catalog.getMetadata())
> .orElseGet(() -> new PDMetadata(doc));
> XMPMetadata xmpMetadata = null;
> try(COSInputStream is = metadataStream.createInputStream()) {
> xmpMetadata = new DomXmpParser().parse(is);
> } catch(XmpParsingException e) {
> LOG.warn(e);
> xmpMetadata = XMPMetadata.createXMPMetadata();
> }
> DublinCoreSchema dcMetadata = xmpMetadata.createAndAddDublinCoreSchema();
> dcMetadata.addPublisher("Çâmára Münícìpål de Matelâñdia");
> catalog.setMetadata(xmpMetadata);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> XmpSerializer serializer = new XmpSerializer();
> serializer.serialize(xmpMetadata, baos, false);
> metadataStream.importXMPMetadata(baos.toByteArray());
>
>
> My reading code looks like:
>
> PDDocment doc = PDDocument.load(is);
> PDDocumentCatalog catalog = doc.getDocumentCatalog()
> PDMetadata metadata = catalog.getMetadata()
> try(InputStream is = metadata.createInputStream()) {
> Files.copy(is, Paths.get("/tmp/metadata.xml"));
> }
>
>
> However in the output XML I am seeing this:
>
> <dc:publisher>
> <rdf:Bag>
> <rdf:li>??m?ra M?n?c?p?l de Matel??dia</rdf:li>
> </rdf:Bag>
> </dc:publisher>
>
>
I've tested various ways of saving the file, yours, serializing to FileOutputStream … and all work with when viewing the content in a browser ot a text editor.
<dc:publisher>
<rdf:Bag>
<rdf:li>Çâmára Münícìpål de Matelâñdia</rdf:li>
</rdf:Bag>
</dc:publisher>
Where do you see that string?
BR
Maruan
> So I guess something is up with the character encoding somewhere? Is
> this something I am doing incorrectly, perhaps I need to specify UTF-8
> somewhere (my character set)? or is this a bug in pdf-box?
>
> Cheers Adam.
>
>
>
>
>
> --
> Adam Retter
>
> skype: adam.retter
> tweet: adamretter
> http://www.adamretter.org.uk
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org