You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Andrea Vacondio <an...@gmail.com> on 2017/07/10 17:22:12 UTC

UTF16 encoded string to PDFDocEncoding

Hi, we came across this case where we are basically cloning outline items
where the original outline title is a UTF16BE encoded text string
containing the value 00A0 (non break space). We later use the string to
assign the title in a new outline item and the A0 is recognised as a € sign.
Here is a simple test:

        COSString victim = COSString
                .parseHex("FEFF004300680061007000740065007200A0");
        PDOutlineItem node = new PDOutlineItem();
        node.setTitle(victim.getString());

If you look at the node dictionary you'll see that the title value is
Chapter€

Re: UTF16 encoded string to PDFDocEncoding

Posted by Tilman Hausherr <TH...@t-online.de>.

fixed in https://issues.apache.org/jira/browse/PDFBOX-3864

Tilman

Am 11.07.2017 um 16:06 schrieb Tilman Hausherr:
> The cause are "gaps" in the PDFDocEncoding specification that have 
> been missed in the implementation. I'll create an issue later.
>
> Tilman
>
> Am 10.07.2017 um 19:22 schrieb Andrea Vacondio:
>> Hi, we came across this case where we are basically cloning outline 
>> items
>> where the original outline title is a UTF16BE encoded text string
>> containing the value 00A0 (non break space). We later use the string to
>> assign the title in a new outline item and the A0 is recognised as a 
>> € sign.
>> Here is a simple test:
>>
>>          COSString victim = COSString
>> .parseHex("FEFF004300680061007000740065007200A0");
>>          PDOutlineItem node = new PDOutlineItem();
>>          node.setTitle(victim.getString());
>>
>> If you look at the node dictionary you'll see that the title value is
>> Chapter€
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: UTF16 encoded string to PDFDocEncoding

Posted by Tilman Hausherr <TH...@t-online.de>.

The cause are "gaps" in the PDFDocEncoding specification that have been 
missed in the implementation. I'll create an issue later.

Tilman

Am 10.07.2017 um 19:22 schrieb Andrea Vacondio:
> Hi, we came across this case where we are basically cloning outline items
> where the original outline title is a UTF16BE encoded text string
> containing the value 00A0 (non break space). We later use the string to
> assign the title in a new outline item and the A0 is recognised as a € sign.
> Here is a simple test:
>
>          COSString victim = COSString
>                  .parseHex("FEFF004300680061007000740065007200A0");
>          PDOutlineItem node = new PDOutlineItem();
>          node.setTitle(victim.getString());
>
> If you look at the node dictionary you'll see that the title value is
> Chapter€
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: UTF16 encoded string to PDFDocEncoding

Posted by Andrea Vacondio <an...@gmail.com>.

I'm talking about the node dictionary, try adding this:
System.out.println(node.getTitle());

On Tue, Jul 11, 2017 at 12:20 PM, Andreas Lehmkühler <an...@lehmi.de>
wrote:

>
> > Andreas Lehmkühler <an...@lehmi.de> hat am 11. Juli 2017 um 12:17
> geschrieben:
> >
> >
> >
> > > Andrea Vacondio <an...@gmail.com> hat am 10. Juli 2017 um
> 19:22 geschrieben:
> > >
> > >
> > > Hi, we came across this case where we are basically cloning outline
> items
> > > where the original outline title is a UTF16BE encoded text string
> > > containing the value 00A0 (non break space). We later use the string to
> > > assign the title in a new outline item and the A0 is recognised as a €
> sign.
> > > Here is a simple test:
> > >
> > >         COSString victim = COSString
> > >                 .parseHex("FEFF004300680061007000740065007200A0");
> > >         PDOutlineItem node = new PDOutlineItem();
> > >         node.setTitle(victim.getString());
> > >
> > > If you look at the node dictionary you'll see that the title value is
> > > Chapter€
> > How do you look at the dictionary?
> >
> > The following code:
> > COSString victim = COSString.parseHex( "FEFF004300680061007000740065007200A0"
> );
> >                       System.out.println( victim.toHexString() );
> >                       System.out.println( victim.getString() );
> Ups, something is missing ....
>
> The output looks good to me:
> FEFF004300680061007000740065007200A0
> Chapter
> Note the second line ends with a space
>
>
> Andreas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: UTF16 encoded string to PDFDocEncoding

Posted by Andreas Lehmkühler <an...@lehmi.de>.

> Andreas Lehmkühler <an...@lehmi.de> hat am 11. Juli 2017 um 12:17 geschrieben:
> 
> 
> 
> > Andrea Vacondio <an...@gmail.com> hat am 10. Juli 2017 um 19:22 geschrieben:
> > 
> > 
> > Hi, we came across this case where we are basically cloning outline items
> > where the original outline title is a UTF16BE encoded text string
> > containing the value 00A0 (non break space). We later use the string to
> > assign the title in a new outline item and the A0 is recognised as a € sign.
> > Here is a simple test:
> > 
> >         COSString victim = COSString
> >                 .parseHex("FEFF004300680061007000740065007200A0");
> >         PDOutlineItem node = new PDOutlineItem();
> >         node.setTitle(victim.getString());
> > 
> > If you look at the node dictionary you'll see that the title value is
> > Chapter€
> How do you look at the dictionary?
> 
> The following code:
> COSString victim = COSString.parseHex( "FEFF004300680061007000740065007200A0" );
> 			System.out.println( victim.toHexString() );
> 			System.out.println( victim.getString() );
Ups, something is missing ....

The output looks good to me:
FEFF004300680061007000740065007200A0
Chapter 
Note the second line ends with a space


Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: UTF16 encoded string to PDFDocEncoding

Posted by Andreas Lehmkühler <an...@lehmi.de>.

> Andrea Vacondio <an...@gmail.com> hat am 10. Juli 2017 um 19:22 geschrieben:
> 
> 
> Hi, we came across this case where we are basically cloning outline items
> where the original outline title is a UTF16BE encoded text string
> containing the value 00A0 (non break space). We later use the string to
> assign the title in a new outline item and the A0 is recognised as a € sign.
> Here is a simple test:
> 
>         COSString victim = COSString
>                 .parseHex("FEFF004300680061007000740065007200A0");
>         PDOutlineItem node = new PDOutlineItem();
>         node.setTitle(victim.getString());
> 
> If you look at the node dictionary you'll see that the title value is
> Chapter€
How do you look at the dictionary?

The following code:
COSString victim = COSString.parseHex( "FEFF004300680061007000740065007200A0" );
			System.out.println( victim.toHexString() );
			System.out.println( victim.getString() );

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org