You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Adam Retter <ad...@googlemail.com> on 2016/07/18 12:15:29 UTC

Mangled diacritic characters in metadata

Using pdf-box-2.0.2:

I am trying to set dc:publisher to "Çâmára Münícìpål de Matelâñdia" in
the metadata of my PDF however my diacritical characters seem to get
mangled when I try and read the PDF back.

My writing code looks like:

PDDocument doc = ...
PDDocumentCatalog catalog = ...

PDMetadata metadataStream = Optional.ofNullable(catalog.getMetadata())
  .orElseGet(() -> new PDMetadata(doc));
XMPMetadata xmpMetadata = null;
try(COSInputStream is = metadataStream.createInputStream()) {
  xmpMetadata = new DomXmpParser().parse(is);
} catch(XmpParsingException e) {
  LOG.warn(e);
  xmpMetadata = XMPMetadata.createXMPMetadata();
}
DublinCoreSchema dcMetadata = xmpMetadata.createAndAddDublinCoreSchema();
dcMetadata.addPublisher("Çâmára Münícìpål de Matelâñdia");
catalog.setMetadata(xmpMetadata);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
XmpSerializer serializer = new XmpSerializer();
serializer.serialize(xmpMetadata, baos, false);
metadataStream.importXMPMetadata(baos.toByteArray());


My reading code looks like:

PDDocment doc = PDDocument.load(is);
PDDocumentCatalog catalog = doc.getDocumentCatalog()
PDMetadata metadata = catalog.getMetadata()
try(InputStream is = metadata.createInputStream()) {
   Files.copy(is, Paths.get("/tmp/metadata.xml"));
}


However in the output XML I am seeing this:

<dc:publisher>
    <rdf:Bag>
        <rdf:li>??m?ra M?n?c?p?l de Matel??dia</rdf:li>
    </rdf:Bag>
</dc:publisher>


So I guess something is up with the character encoding somewhere? Is
this something I am doing incorrectly, perhaps I need to specify UTF-8
somewhere (my character set)? or is this a bug in pdf-box?

Cheers Adam.





-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Mangled diacritic characters in metadata

Posted by Adam Retter <ad...@googlemail.com>.
Thank you Maruan,

Apologies for the noise. I have now resolved this. I simplified my
code for the examples I gave in the email. The issue is not with PDF
Box, rather a 3rd party library which was processing the string
"Çâmára Münícìpål de Matelâñdia" before it reached PDFBox was mangling
it.

Thanks again.

Adam.

On 19 July 2016 at 08:13, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
> Hi,
>
>> Am 18.07.2016 um 14:15 schrieb Adam Retter <ad...@googlemail.com>:
>>
>> Using pdf-box-2.0.2:
>>
>> I am trying to set dc:publisher to "Çâmára Münícìpål de Matelâñdia" in
>> the metadata of my PDF however my diacritical characters seem to get
>> mangled when I try and read the PDF back.
>>
>> My writing code looks like:
>>
>> PDDocument doc = ...
>> PDDocumentCatalog catalog = ...
>>
>> PDMetadata metadataStream = Optional.ofNullable(catalog.getMetadata())
>>  .orElseGet(() -> new PDMetadata(doc));
>> XMPMetadata xmpMetadata = null;
>> try(COSInputStream is = metadataStream.createInputStream()) {
>>  xmpMetadata = new DomXmpParser().parse(is);
>> } catch(XmpParsingException e) {
>>  LOG.warn(e);
>>  xmpMetadata = XMPMetadata.createXMPMetadata();
>> }
>> DublinCoreSchema dcMetadata = xmpMetadata.createAndAddDublinCoreSchema();
>> dcMetadata.addPublisher("Çâmára Münícìpål de Matelâñdia");
>> catalog.setMetadata(xmpMetadata);
>> ByteArrayOutputStream baos = new ByteArrayOutputStream();
>> XmpSerializer serializer = new XmpSerializer();
>> serializer.serialize(xmpMetadata, baos, false);
>> metadataStream.importXMPMetadata(baos.toByteArray());
>>
>>
>> My reading code looks like:
>>
>> PDDocment doc = PDDocument.load(is);
>> PDDocumentCatalog catalog = doc.getDocumentCatalog()
>> PDMetadata metadata = catalog.getMetadata()
>> try(InputStream is = metadata.createInputStream()) {
>>   Files.copy(is, Paths.get("/tmp/metadata.xml"));
>> }
>>
>>
>> However in the output XML I am seeing this:
>>
>> <dc:publisher>
>>    <rdf:Bag>
>>        <rdf:li>??m?ra M?n?c?p?l de Matel??dia</rdf:li>
>>    </rdf:Bag>
>> </dc:publisher>
>>
>>
>
> I've tested various ways of saving the file, yours, serializing to FileOutputStream … and all work with when viewing the content in a browser ot a text editor.
>
>
> <dc:publisher>
>         <rdf:Bag>
>           <rdf:li>Çâmára Münícìpål de Matelâñdia</rdf:li>
>         </rdf:Bag>
>       </dc:publisher>
>
> Where do you see that string?
>
> BR
> Maruan
>
>
>
>> So I guess something is up with the character encoding somewhere? Is
>> this something I am doing incorrectly, perhaps I need to specify UTF-8
>> somewhere (my character set)? or is this a bug in pdf-box?
>>
>> Cheers Adam.
>>
>>
>>
>>
>>
>> --
>> Adam Retter
>>
>> skype: adam.retter
>> tweet: adamretter
>> http://www.adamretter.org.uk
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>



-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Mangled diacritic characters in metadata

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,

> Am 18.07.2016 um 14:15 schrieb Adam Retter <ad...@googlemail.com>:
> 
> Using pdf-box-2.0.2:
> 
> I am trying to set dc:publisher to "Çâmára Münícìpål de Matelâñdia" in
> the metadata of my PDF however my diacritical characters seem to get
> mangled when I try and read the PDF back.
> 
> My writing code looks like:
> 
> PDDocument doc = ...
> PDDocumentCatalog catalog = ...
> 
> PDMetadata metadataStream = Optional.ofNullable(catalog.getMetadata())
>  .orElseGet(() -> new PDMetadata(doc));
> XMPMetadata xmpMetadata = null;
> try(COSInputStream is = metadataStream.createInputStream()) {
>  xmpMetadata = new DomXmpParser().parse(is);
> } catch(XmpParsingException e) {
>  LOG.warn(e);
>  xmpMetadata = XMPMetadata.createXMPMetadata();
> }
> DublinCoreSchema dcMetadata = xmpMetadata.createAndAddDublinCoreSchema();
> dcMetadata.addPublisher("Çâmára Münícìpål de Matelâñdia");
> catalog.setMetadata(xmpMetadata);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> XmpSerializer serializer = new XmpSerializer();
> serializer.serialize(xmpMetadata, baos, false);
> metadataStream.importXMPMetadata(baos.toByteArray());
> 
> 
> My reading code looks like:
> 
> PDDocment doc = PDDocument.load(is);
> PDDocumentCatalog catalog = doc.getDocumentCatalog()
> PDMetadata metadata = catalog.getMetadata()
> try(InputStream is = metadata.createInputStream()) {
>   Files.copy(is, Paths.get("/tmp/metadata.xml"));
> }
> 
> 
> However in the output XML I am seeing this:
> 
> <dc:publisher>
>    <rdf:Bag>
>        <rdf:li>??m?ra M?n?c?p?l de Matel??dia</rdf:li>
>    </rdf:Bag>
> </dc:publisher>
> 
> 

I've tested various ways of saving the file, yours, serializing to FileOutputStream … and all work with when viewing the content in a browser ot a text editor.


<dc:publisher>
        <rdf:Bag>
          <rdf:li>Çâmára Münícìpål de Matelâñdia</rdf:li>
        </rdf:Bag>
      </dc:publisher>

Where do you see that string?

BR
Maruan



> So I guess something is up with the character encoding somewhere? Is
> this something I am doing incorrectly, perhaps I need to specify UTF-8
> somewhere (my character set)? or is this a bug in pdf-box?
> 
> Cheers Adam.
> 
> 
> 
> 
> 
> -- 
> Adam Retter
> 
> skype: adam.retter
> tweet: adamretter
> http://www.adamretter.org.uk
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org