You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2017/07/28 15:21:02 UTC
[jira] [Commented] (PDFBOX-3881) Handling of Byte Order Mark with
Metadata-Fields
[ https://issues.apache.org/jira/browse/PDFBOX-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105097#comment-16105097 ]
Tilman Hausherr commented on PDFBOX-3881:
-----------------------------------------
We're conforming to the PDF specification:
{quote}
Conforming readers that process PDF files containing Unicode text strings shall be prepared to handle supplementary characters; that is, characters requiring more than two bytes to represent.
{quote}
But considering that Adobe Reader does not display anything (even if I delete /Metadata and keep /Info only), I'll just change the ">" to ">=".
> Handling of Byte Order Mark with Metadata-Fields
> ------------------------------------------------
>
> Key: PDFBOX-3881
> URL: https://issues.apache.org/jira/browse/PDFBOX-3881
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.7
> Environment: Windows
> Reporter: Nico Prenzel
> Assignee: Tilman Hausherr
> Priority: Minor
> Attachments: ERiCDruck_23776162_ESt_0_20170727_121644-pdfcreator.pdf
>
>
> PDDocumentInformation e.g. getAuthor() honors the byte order of the extracted string and removes the byte order mark signs.
> But if the extracted string does only contain the byte order mark signs the corresponding string "þÿ" is returned.
> Is this the intended solution?
> I'd appreciate to remove the byte order mark signs also, if the extracted string does only contain these signs.
> Problematic code:
> {code:java}
> public String getString()
> {
> if (this.bytes.length > 2)
> {
> if (((this.bytes[0] & 0xFF) == 254) && ((this.bytes[1] & 0xFF) == 255))
> {
> return new String(this.bytes, 2, this.bytes.length - 2, Charsets.UTF_16BE);
> }
> if (((this.bytes[0] & 0xFF) == 255) && ((this.bytes[1] & 0xFF) == 254))
> {
> return new String(this.bytes, 2, this.bytes.length - 2, Charsets.UTF_16LE);
> }
> }
>
> return PDFDocEncoding.toString(this.bytes);
> }
> {code}
> Attachment has an example pdf
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org