You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2008/08/04 19:36:44 UTC
[jira] Commented: (PDFBOX-283) Character encoding/appearance issues when filling forms

    [ https://issues.apache.org/jira/browse/PDFBOX-283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619602#action_12619602 ] 

Jukka Zitting commented on PDFBOX-283:
--------------------------------------

[Comment on SourceForge]
Date: 2008-06-27 11:31
Sender: nobody
Logged In: NO 

I'm not sure PrintWriter is a lot of problem, If I understand it right
(probably not), 
the chars written through PrintWriter should be US-ASCII anyway. 

I ran into the same problem and made a patch which more or less fixes it.
But my problem is with 
multiline text boxes. If the multiline flag is enabled, then
PDAppearance.setAppearanceValue 
will call convertToMultiline, and this will replace newlines in the value
with a little 
PDF code. I think this PDF code will get escaped and will show in the
rendered document with 
my changes. 

But then I wonder what happens, with or without my patch, if the field
value contains ")","\" or other chars
that should be escaped. In fact in some places in PDAppearance it seems it
considers PDAppearance.value
to be a clear unescaped value and in others it seems it should be escaped
PDF code.  

I kept it unescaped, and escaped it in insertGeneratedAppearance(), but
I was thinking of just storing in PDAppearance.value an escaped version,
and in case of multiline being on then 
escape it line by line before applying convertToMultiline, but that would
increase breakage
in the fontSize and line length calculations, and I don't know how to fix
that, because I'm not
sure how much rendering calculations are wanted in PDFBox, and size
calculations depend on 
rendering considerations. That is, I don't know which of the things that
don't work are 
really meant to be fixed or are a designed limitation to keep it
manageable.

The patch :

---
PDFBox-0.7.3-orig/src/org/pdfbox/pdmodel/interactive/form/PDAppearance.java
2006-09-26 21:14:58.000000000 +0200
+++ PDFBox-0.7.3/src/org/pdfbox/pdmodel/interactive/form/PDAppearance.java
     2008-06-27 13:15:24.000000000 +0200
@@ -408,7 +408,12 @@
         {
             throw new IOException( "Error: Unknown justification value:"
+ q );
         }
-        printWriter.println("(" + value + ") Tj");
+               COSString val = new COSString(value);
+               ByteArrayOutputStream valOutStream = new
ByteArrayOutputStream();
+               // writePDF only writes US-ASCII chars (if value has
anything else, uses
+               // hexadecimal representation, which is ascii)
+               val.writePDF(valOutStream);
+               printWriter.println(new String(valOutStream.toByteArray(),
"US-ASCII") + " Tj");
         printWriter.println("ET" );
         printWriter.flush();
     }


> Character encoding/appearance issues when filling forms
> -------------------------------------------------------
>
>                 Key: PDFBOX-283
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-283
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel.AcroForm
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1735902
> Originally submitted by scop on 2007-06-12 10:23.
> When filling a text field with non-ASCII characters such as in my surname "SkyttÃ¤" and saving the document in a UTF-8 environment, something goes wrong with the appearance of the text.
> The value itself seems to be stored correctly, but when opening the doc, the appearance of "Ã¤" is not that, but rather something which happens when UTF-8 is mistakenly treated as ISO-8859-1 (two garbage characters).
> PDAppearance uses the platform default encoding in quite a few places which apparently has potential to mess things up.  In particular, insertGeneratedAppearance() generates a PrintWriter from an OutputStream without specifying the encoding.  In fact, if I hack that to use ISO-8859-1, the appearance of my "Ã¤" case is correct, but that won't obviously work with anything else than chars that are valid ISO-8859-1.
> In which char encoding should the value be written to the appearance stream (at end of insertGeneratedAppearance())?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.