You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Andrea Vacondio <an...@gmail.com> on 2017/07/10 17:22:12 UTC
UTF16 encoded string to PDFDocEncoding
Hi, we came across this case where we are basically cloning outline items
where the original outline title is a UTF16BE encoded text string
containing the value 00A0 (non break space). We later use the string to
assign the title in a new outline item and the A0 is recognised as a € sign.
Here is a simple test:
COSString victim = COSString
.parseHex("FEFF004300680061007000740065007200A0");
PDOutlineItem node = new PDOutlineItem();
node.setTitle(victim.getString());
If you look at the node dictionary you'll see that the title value is
Chapter€
Re: UTF16 encoded string to PDFDocEncoding
Posted by Tilman Hausherr <TH...@t-online.de>.
fixed in https://issues.apache.org/jira/browse/PDFBOX-3864
Tilman
Am 11.07.2017 um 16:06 schrieb Tilman Hausherr:
> The cause are "gaps" in the PDFDocEncoding specification that have
> been missed in the implementation. I'll create an issue later.
>
> Tilman
>
> Am 10.07.2017 um 19:22 schrieb Andrea Vacondio:
>> Hi, we came across this case where we are basically cloning outline
>> items
>> where the original outline title is a UTF16BE encoded text string
>> containing the value 00A0 (non break space). We later use the string to
>> assign the title in a new outline item and the A0 is recognised as a
>> € sign.
>> Here is a simple test:
>>
>> COSString victim = COSString
>> .parseHex("FEFF004300680061007000740065007200A0");
>> PDOutlineItem node = new PDOutlineItem();
>> node.setTitle(victim.getString());
>>
>> If you look at the node dictionary you'll see that the title value is
>> Chapter€
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: UTF16 encoded string to PDFDocEncoding
Posted by Tilman Hausherr <TH...@t-online.de>.
The cause are "gaps" in the PDFDocEncoding specification that have been
missed in the implementation. I'll create an issue later.
Tilman
Am 10.07.2017 um 19:22 schrieb Andrea Vacondio:
> Hi, we came across this case where we are basically cloning outline items
> where the original outline title is a UTF16BE encoded text string
> containing the value 00A0 (non break space). We later use the string to
> assign the title in a new outline item and the A0 is recognised as a € sign.
> Here is a simple test:
>
> COSString victim = COSString
> .parseHex("FEFF004300680061007000740065007200A0");
> PDOutlineItem node = new PDOutlineItem();
> node.setTitle(victim.getString());
>
> If you look at the node dictionary you'll see that the title value is
> Chapter€
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: UTF16 encoded string to PDFDocEncoding
Posted by Andrea Vacondio <an...@gmail.com>.
I'm talking about the node dictionary, try adding this:
System.out.println(node.getTitle());
On Tue, Jul 11, 2017 at 12:20 PM, Andreas Lehmkühler <an...@lehmi.de>
wrote:
>
> > Andreas Lehmkühler <an...@lehmi.de> hat am 11. Juli 2017 um 12:17
> geschrieben:
> >
> >
> >
> > > Andrea Vacondio <an...@gmail.com> hat am 10. Juli 2017 um
> 19:22 geschrieben:
> > >
> > >
> > > Hi, we came across this case where we are basically cloning outline
> items
> > > where the original outline title is a UTF16BE encoded text string
> > > containing the value 00A0 (non break space). We later use the string to
> > > assign the title in a new outline item and the A0 is recognised as a €
> sign.
> > > Here is a simple test:
> > >
> > > COSString victim = COSString
> > > .parseHex("FEFF004300680061007000740065007200A0");
> > > PDOutlineItem node = new PDOutlineItem();
> > > node.setTitle(victim.getString());
> > >
> > > If you look at the node dictionary you'll see that the title value is
> > > Chapter€
> > How do you look at the dictionary?
> >
> > The following code:
> > COSString victim = COSString.parseHex( "FEFF004300680061007000740065007200A0"
> );
> > System.out.println( victim.toHexString() );
> > System.out.println( victim.getString() );
> Ups, something is missing ....
>
> The output looks good to me:
> FEFF004300680061007000740065007200A0
> Chapter
> Note the second line ends with a space
>
>
> Andreas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
Re: UTF16 encoded string to PDFDocEncoding
Posted by Andreas Lehmkühler <an...@lehmi.de>.
> Andreas Lehmkühler <an...@lehmi.de> hat am 11. Juli 2017 um 12:17 geschrieben:
>
>
>
> > Andrea Vacondio <an...@gmail.com> hat am 10. Juli 2017 um 19:22 geschrieben:
> >
> >
> > Hi, we came across this case where we are basically cloning outline items
> > where the original outline title is a UTF16BE encoded text string
> > containing the value 00A0 (non break space). We later use the string to
> > assign the title in a new outline item and the A0 is recognised as a € sign.
> > Here is a simple test:
> >
> > COSString victim = COSString
> > .parseHex("FEFF004300680061007000740065007200A0");
> > PDOutlineItem node = new PDOutlineItem();
> > node.setTitle(victim.getString());
> >
> > If you look at the node dictionary you'll see that the title value is
> > Chapter€
> How do you look at the dictionary?
>
> The following code:
> COSString victim = COSString.parseHex( "FEFF004300680061007000740065007200A0" );
> System.out.println( victim.toHexString() );
> System.out.println( victim.getString() );
Ups, something is missing ....
The output looks good to me:
FEFF004300680061007000740065007200A0
Chapter
Note the second line ends with a space
Andreas
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: UTF16 encoded string to PDFDocEncoding
Posted by Andreas Lehmkühler <an...@lehmi.de>.
> Andrea Vacondio <an...@gmail.com> hat am 10. Juli 2017 um 19:22 geschrieben:
>
>
> Hi, we came across this case where we are basically cloning outline items
> where the original outline title is a UTF16BE encoded text string
> containing the value 00A0 (non break space). We later use the string to
> assign the title in a new outline item and the A0 is recognised as a € sign.
> Here is a simple test:
>
> COSString victim = COSString
> .parseHex("FEFF004300680061007000740065007200A0");
> PDOutlineItem node = new PDOutlineItem();
> node.setTitle(victim.getString());
>
> If you look at the node dictionary you'll see that the title value is
> Chapter€
How do you look at the dictionary?
The following code:
COSString victim = COSString.parseHex( "FEFF004300680061007000740065007200A0" );
System.out.println( victim.toHexString() );
System.out.println( victim.getString() );
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org