You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Steffen R." <ra...@googlemail.com> on 2013/06/12 10:18:19 UTC

Problem with Unicode text in PDF form text field

Hello,

I am facing a problem that might be a bug. This is the scenario: Loading a
PDF, filling in some form text fields and saving it back to PDF. When I do
this

PDDocument doc = null;
        try
        {
            doc = PDDocument.load( "Test.pdf" );

            PDAcroForm form = doc.getDocumentCatalog().getAcroForm();
            PDVariableText field = (PDVariableText)
form.getField("testField");
            field.setValue("Test it 123456789012345 äüö?ß! á Ф ф Й й
άγγελος");

            doc.save( "TestFilled.pdf" );
        } catch (COSVisitorException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        finally
        {
            if( doc != null )
            {
                try {
                    doc.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }

with the attached PDF file (created from scratch with Acrobat XI Standard),
the field is filled in the saved PDF file but the characters are not
presented as in code. And now the most curious thing: If you click into the
form field then the correct text will be shown. Very strange.

Is someone facing a similar problem? Is this a known bug? Does a workaround
or patch exist?

I took a look at the source code. It seems that beside the normal field
value an additional "appearence" for showing the field value is added which
maybe doesn't support unicode the way it is implemented atm.

Thanks in advance for any help,
Steffen Harbich

Re: Problem with Unicode text in PDF form text field

Posted by "Steffen R." <ra...@googlemail.com>.
Hi Mike,

unfortunately, this is not the problem I have. The form value is displayed
correctly when clicking in the saved pdf but not when the field is not
focused. In the meantime I found a bug in the pdfbox bug tracker which is
the same problem related to the encoding:

https://issues.apache.org/jira/browse/PDFBOX-1231

I tried the font_and_umlaut.patch which indeed causes the german Umlauts to
be displayed correctly but the russian and greek symbols are shown as
question marks.

The following line in the method private void insertGeneratedAppearance(
PDAnnotationWidget fieldWidget, OutputStream output, PDFont pdFont, List
tokens, PDAppearanceStream appearanceStream ) throws IOException seems to
be the problem:

 printWriter.println("(" + value + ") Tj");

The value which contains unicode characters is written directly without the
leading UTF-16BE indicator 0xFE 0xFF as it is done in COSString class. The
class should be used at this line too. Somehow.

The issue PDFBOX-1231 is present since version 1.6, is there any effort to
fix it soon?

Greetings,
Steffen


On Wed, Jun 12, 2013 at 1:19 PM, Bain, Michael <Mi...@mckesson.com>wrote:

> You may be in luck!  I had a similar problem with missing characters that
> weren’t really missing and I just figured it out last night.  I found this
> answer on StackOverflow that describes in detail the challenge of text
> replacement in a pre-existing PDF.
>
>
> http://stackoverflow.com/questions/15964704/java-pdfbox-reading-and-modifying-a-pdf-with-special-characters-diacritics
>
> See the answer by Plinth.  Basically what I found in mine was that any
> character that had not been previously used within the PDF when it was
> rendered disappeared.  Reading his post made me realize that only a subset
> of the font was being included within the embedded font in the file.  I
> ended up just adding a junk line with all of the characters to my file
> during rendering to test this, and it cleared up the problem.  The color
> and size of the line don’t seem to matter, it is just whether or not the
> rendering decides if the character is needed within the subset or not.
>  Hope this helps!
>
> Thanks...Mike
>
> From: Steffen R. [mailto:raubvogel87@googlemail.com]
> Sent: Wednesday, June 12, 2013 4:18 AM
> To: users@pdfbox.apache.org
> Subject: Problem with Unicode text in PDF form text field
>
> Hello,
> I am facing a problem that might be a bug. This is the scenario: Loading a
> PDF, filling in some form text fields and saving it back to PDF. When I do
> this
>
> PDDocument doc = null;
>         try
>         {
>             doc = PDDocument.load( "Test.pdf" );
>
>             PDAcroForm form = doc.getDocumentCatalog().getAcroForm();
>             PDVariableText field = (PDVariableText)
> form.getField("testField");
>             field.setValue("Test it 123456789012345 äüö?ß! á Ф ф Й й
> άγγελος");
>
>             doc.save( "TestFilled.pdf" );
>         } catch (COSVisitorException e) {
>             e.printStackTrace();
>         } catch (IOException e) {
>             e.printStackTrace();
>         }
>         finally
>         {
>             if( doc != null )
>             {
>                 try {
>                     doc.close();
>                 } catch (IOException e) {
>                     e.printStackTrace();
>                 }
>             }
>         }
> with the attached PDF file (created from scratch with Acrobat XI
> Standard), the field is filled in the saved PDF file but the characters are
> not presented as in code. And now the most curious thing: If you click into
> the form field then the correct text will be shown. Very strange.
> Is someone facing a similar problem? Is this a known bug? Does a
> workaround or patch exist?
> I took a look at the source code. It seems that beside the normal field
> value an additional "appearence" for showing the field value is added which
> maybe doesn't support unicode the way it is implemented atm.
>
> Thanks in advance for any help,
> Steffen Harbich
>
>

RE: Problem with Unicode text in PDF form text field

Posted by "Bain, Michael" <Mi...@McKesson.com>.
You may be in luck!  I had a similar problem with missing characters that weren’t really missing and I just figured it out last night.  I found this answer on StackOverflow that describes in detail the challenge of text replacement in a pre-existing PDF.

http://stackoverflow.com/questions/15964704/java-pdfbox-reading-and-modifying-a-pdf-with-special-characters-diacritics

See the answer by Plinth.  Basically what I found in mine was that any character that had not been previously used within the PDF when it was rendered disappeared.  Reading his post made me realize that only a subset of the font was being included within the embedded font in the file.  I ended up just adding a junk line with all of the characters to my file during rendering to test this, and it cleared up the problem.  The color and size of the line don’t seem to matter, it is just whether or not the rendering decides if the character is needed within the subset or not.  Hope this helps!

Thanks...Mike

From: Steffen R. [mailto:raubvogel87@googlemail.com]
Sent: Wednesday, June 12, 2013 4:18 AM
To: users@pdfbox.apache.org
Subject: Problem with Unicode text in PDF form text field

Hello,
I am facing a problem that might be a bug. This is the scenario: Loading a PDF, filling in some form text fields and saving it back to PDF. When I do this

PDDocument doc = null;
        try
        {
            doc = PDDocument.load( "Test.pdf" );

            PDAcroForm form = doc.getDocumentCatalog().getAcroForm();
            PDVariableText field = (PDVariableText) form.getField("testField");
            field.setValue("Test it 123456789012345 äüö?ß! á Ф ф Й й άγγελος");

            doc.save( "TestFilled.pdf" );
        } catch (COSVisitorException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        finally
        {
            if( doc != null )
            {
                try {
                    doc.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
with the attached PDF file (created from scratch with Acrobat XI Standard), the field is filled in the saved PDF file but the characters are not presented as in code. And now the most curious thing: If you click into the form field then the correct text will be shown. Very strange.
Is someone facing a similar problem? Is this a known bug? Does a workaround or patch exist?
I took a look at the source code. It seems that beside the normal field value an additional "appearence" for showing the field value is added which maybe doesn't support unicode the way it is implemented atm.

Thanks in advance for any help,
Steffen Harbich