You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2017/07/28 15:21:02 UTC

[jira] [Commented] (PDFBOX-3881) Handling of Byte Order Mark with Metadata-Fields

    [ https://issues.apache.org/jira/browse/PDFBOX-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105097#comment-16105097 ] 

Tilman Hausherr commented on PDFBOX-3881:
-----------------------------------------

We're conforming to the PDF specification:
{quote}
Conforming readers that process PDF files containing Unicode text strings shall be prepared to handle supplementary characters; that is, characters requiring more than two bytes to represent.
{quote}
But considering that Adobe Reader does not display anything (even if I delete /Metadata and keep /Info only), I'll just change the ">" to ">=".

> Handling of Byte Order Mark with Metadata-Fields
> ------------------------------------------------
>
>                 Key: PDFBOX-3881
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3881
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.7
>         Environment: Windows
>            Reporter: Nico Prenzel
>            Assignee: Tilman Hausherr
>            Priority: Minor
>         Attachments: ERiCDruck_23776162_ESt_0_20170727_121644-pdfcreator.pdf
>
>
> PDDocumentInformation e.g. getAuthor() honors the byte order of the extracted string and removes the byte order mark signs.
> But if the extracted string does only contain the byte order mark signs the corresponding string "þÿ" is returned.
> Is this the intended solution?
> I'd appreciate to remove the byte order mark signs also, if the extracted string does only contain these signs.
> Problematic code:
> {code:java}
> public String getString()
>   {
>   if (this.bytes.length > 2)
>     {
>       if (((this.bytes[0] & 0xFF) == 254) && ((this.bytes[1] & 0xFF) == 255))
>       {
>         return new String(this.bytes, 2, this.bytes.length - 2, Charsets.UTF_16BE);
>       }
>       if (((this.bytes[0] & 0xFF) == 255) && ((this.bytes[1] & 0xFF) == 254))
>       {
>         return new String(this.bytes, 2, this.bytes.length - 2, Charsets.UTF_16LE);
>       }
>     }
>     
>     return PDFDocEncoding.toString(this.bytes);
>   }
> {code}
> Attachment has an example pdf



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org