You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Nico Prenzel (JIRA)" <ji...@apache.org> on 2017/07/27 11:00:05 UTC

[jira] [Created] (PDFBOX-3881) Handling of Byte Order Mark with Metadata-Fields

Nico Prenzel created PDFBOX-3881:
------------------------------------

             Summary: Handling of Byte Order Mark with Metadata-Fields
                 Key: PDFBOX-3881
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3881
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 2.0.7
         Environment: Windows
            Reporter: Nico Prenzel
            Priority: Minor
         Attachments: ERiCDruck_23776162_ESt_0_20170727_121644-pdfcreator.pdf

PDDocumentInformation e.g. getAuthor() honors the byte order of the extracted string and removes the byte order mark signs.

But if the extracted string does only contain the byte order mark signs the corresponding string "þÿ" is returned.

Is this the intended solution?
I'd appreciate to remove the byte order mark signs also, if the extracted string does only contain these signs.


public String getString()
  {
  {color:red}  if (this.bytes.length > 2){color}
    {
      if (((this.bytes[0] & 0xFF) == 254) && ((this.bytes[1] & 0xFF) == 255))
      {

        return new String(this.bytes, 2, this.bytes.length - 2, Charsets.UTF_16BE);
      }
      if (((this.bytes[0] & 0xFF) == 255) && ((this.bytes[1] & 0xFF) == 254))
      {

        return new String(this.bytes, 2, this.bytes.length - 2, Charsets.UTF_16LE);
      }
    }
    

    return PDFDocEncoding.toString(this.bytes);
  }

Attachment has an example pdf




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org