You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-dev@xmlgraphics.apache.org by Craig Ringer <cr...@postnewspapers.com.au> on 2012/03/29 01:24:49 UTC

PDFName.getName() returns escaped name?!

Hi all

I've been working with PDFName in my code and have run into a bit of an 
oddity I was hoping for comments on.

For any given string `fred', the operation:

    ( new PDFName(fred) ).getName().equals(fred)

isn't guaranteed to be true, because PDFName.getName() returns the 
*escaped* name. It strips the leading slash added by 
PDFName.escapeName(), so most of the time the returned name will be the 
same, but it's a good candidate for creating exciting bugs.

I'd like to be able to use PDFName instead of String as a map key (for 
clarity, mostly), but need to be able to get the original encapsulated 
string quickly (without decoding) and reliably.

I'd like to change PDFName so that it keeps a reference to the original 
name string and returns that from getName(). It should encode the name 
on the first call to the new getEncodedName() method, storing it in a 
local member, so short-lived PDFName objects don't waste time encoding 
strings. I'd also like to have getEncodedName() return a byte[] not a 
String, since an encoded PDF name isn't actually text data.

BTW, is there any reason Fop's PDF library uses java.lang.String when 
working with sequences of PDF data bytes? For example, the output of 
PDFName.escapeName(...) isn't really a "string" at all, in that it's not 
meaningful text in any encoding, it's just a byte sequence jammed into 
the lower 8 bits of unicode code points. It's pretty confusing having it 
as a String (logically an array of unicode characters) rather than as a 
byte[]. Right now, fop also writes 8-bit characters in names incorrectly 
- the toHex(...) and PDFName.escapeName(...) methods translate values 
between 128 and 255 inclusive of each *unicode* *character* in a String 
to hex and write that out. This is incorrect, because PDF names should 
be UTF-8, so it should be encoding to a UTF-8 byte sequence then escaping.

--
Craig Ringer

POST Newspapers
276 Onslow Rd, Shenton Park
Ph: 08 9381 3088     Fax: 08 9388 2258
ABN: 50 008 917 717
http://www.postnewspapers.com.au/

Re: PDFName.getName() returns escaped name?!

Posted by Craig Ringer <ri...@ringerc.id.au>.

On 04/04/12 18:02, Vincent Hennebert wrote:
> Hi Craig,
>
> Thanks for your extensive study!
>
> On 03/04/12 10:31, Craig Ringer wrote:
>> On 03/04/12 01:16, Vincent Hennebert wrote:
>>
>>>> From a quick look that sounds about right. Are you developing with a 1.7 JDK?
>>>> You would have to make your code 1.5-compatible. Also, it would be good if
>>>> you could back your optimizations with profiling data. If code
>>>> safety/readability have to be compromised there has to be a good reason
>>>> for it.
>> Yep, I'm on 1.7, but was building for a 1.5 target and source level.
> Careful, because that doesn’t ensure you that your code will run on
> a 1.5 JVM. You may still be using elements of the standard library that
> appeared only in 1.6+.
Thanks - I forgot about that. Sigh. The glacial pace at which people
update JREs and JDKs drives me quite nuts sometimes - but at least fop
has moved to 1.5!

Time to do some JDK archaeology...

--
Craig Ringer

Re: PDFName.getName() returns escaped name?!

Posted by Vincent Hennebert <vh...@gmail.com>.

Hi Craig,

Thanks for your extensive study!

On 03/04/12 10:31, Craig Ringer wrote:
> On 03/04/12 01:16, Vincent Hennebert wrote:
> 
>>> From a quick look that sounds about right. Are you developing with a 1.7 JDK?
>>> You would have to make your code 1.5-compatible. Also, it would be good if
>>> you could back your optimizations with profiling data. If code
>>> safety/readability have to be compromised there has to be a good reason
>>> for it.
> 
> Yep, I'm on 1.7, but was building for a 1.5 target and source level.

Careful, because that doesn’t ensure you that your code will run on
a 1.5 JVM. You may still be using elements of the standard library that
appeared only in 1.6+.


<snip/>
>> That said, please do feel free to give it a try if that route appeals to
>> you. We would certainly consider a switch if it looks more promising in
>> the long term.
> 
> I'm wondering how practical it'd be to progressively adopt PDFBox,
> actually, rather than doing an abrupt and total switch.
> 
> The PDF primitives in PDFBox (COSName, COSNumber, COSDictionary, etc)
> are modeled extremely similarlines as those in FOP's PDF library, and
> while they won't be drop-in replacements they behave very similarly,
> just with different method names and in some cases different datatype
> assumptions (PDFName instead of String dictionary keys for example). The
> only truly big difference looks to be in the handling of indirect
> objects, where FOP uses one class that may be direct or indirect, and
> PDFBox uses a dedicated `COSObject' class that wraps other objects for
> indirect objects.

Yep, I think we want to do the same in FOP. As a first step during our
work on object streams, we had a chance to factorize into PDFDocument
the outputting of obj/endobj for indirect objects. Before, there were
‘if (hasObjectNumber())’ statements spread all over the PDF classes.


> If it proves possible to do a largely 1:1 substitution of PDFBox
> primitive PDF classes for FOP ones that'd *greatly* simplify the job of
> moving to using the PDFBox `PD' model for document output, and would let
> the job be done in a couple of distinct and separate chunks. PDFFont,
> PDFXObject, etc would build on top of the COS types like COSDictionary,
> COSBase and COSObject, rather than on top of PDFObject, PDFDictionary, etc.

I’m not too sure about this one? Even if you manage to convert the FOP
higher-level PDF classes to use PDFBox primitives, what does that bring
you? IIUC you would still have to make major changes to move to the PD
model.


<snip/>

>  If I get time (big if, at the moment) I'll see if I can have a play and
> determine what kind of work is involved in doing this.

Let us know how you’re getting on.


> --
> Craig Ringer

Thanks,
Vincent

Re: PDFName.getName() returns escaped name?!

Posted by Craig Ringer <ri...@ringerc.id.au>.

On 03/04/12 01:16, Vincent Hennebert wrote:

>> From a quick look that sounds about right. Are you developing with a 1.7 JDK?
>> You would have to make your code 1.5-compatible. Also, it would be good if
>> you could back your optimizations with profiling data. If code
>> safety/readability have to be compromised there has to be a good reason
>> for it.

Yep, I'm on 1.7, but was building for a 1.5 target and source level.

The posted PDFName is really an example of what I'm thinking of; it's
not something that can be seriously merged without a *lot* of changes
across the rest of the tree to convert String name handling to PDFName,
to check for leading "/"s being added, to check for use of toString() in
name output to PDF data, etc. It's more a starting point for discussion
and to see what kind of interest there is in working on the pdf lib.

It looks like folks here are pretty receptive to significantly reworking
the PDF lib if there's someone willing to stump up the time. That might
well be me if it turns out porting the whole renderer to PDFBox instead
turns out to be lots harder than reworking the fop pdf lib. I suspect not.

>
> That said, please do feel free to give it a try if that route appeals to
> you. We would certainly consider a switch if it looks more promising in
> the long term.

I'm wondering how practical it'd be to progressively adopt PDFBox,
actually, rather than doing an abrupt and total switch.

The PDF primitives in PDFBox (COSName, COSNumber, COSDictionary, etc)
are modeled extremely similarlines as those in FOP's PDF library, and
while they won't be drop-in replacements they behave very similarly,
just with different method names and in some cases different datatype
assumptions (PDFName instead of String dictionary keys for example). The
only truly big difference looks to be in the handling of indirect
objects, where FOP uses one class that may be direct or indirect, and
PDFBox uses a dedicated `COSObject' class that wraps other objects for
indirect objects.

If it proves possible to do a largely 1:1 substitution of PDFBox
primitive PDF classes for FOP ones that'd *greatly* simplify the job of
moving to using the PDFBox `PD' model for document output, and would let
the job be done in a couple of distinct and separate chunks. PDFFont,
PDFXObject, etc would build on top of the COS types like COSDictionary,
COSBase and COSObject, rather than on top of PDFObject, PDFDictionary, etc.

It's not a perfect 1:1 mapping; notable imperfect matches where one or
more classes map to one or more different classes include:

  FOP                 PDFBOX
  ---------------------------------------------------
  java.lang.String,   COSName (for PDF names)
  FOP PDFText         java.lang.String (for Unicode text)
                      byte[] or a binbuf/binstr class (for binary data)
  ---------------------------------------------------
  PDFNumber           COSNumber
                      COSInteger
                      COSFloat
  ---------------------------------------------------
  ???
  (j.l.Boolean?)      COSBoolean
  ---------------------------------------------------
  PDFObject           COSBase  (direct)
                      COSObject  (wraps a COSBase, indirect)
  ----------------------------------------------------
  PDFDocument         COSDocument
                      [+a new class with fop-only functionality]
                      PDDocument

Of those, the first already needs a cleanup as noted previously; fop
uses String for at least two distinct and incompatible things (Unicode
text and raw binary data) that must be separated anyway. The second
doesn't look to be a big deal as COSNumber has factory methods to take
care of it, it's pretty transparent. I'm not too worried about boolean
handling either. The bigger ones are the different model for indirect
objects, and the non-trivial work to move PDFDocument to extend/wrap
COSDocument. I haven't looked into this in enough depth to have a
meaningful assessment of how hard either of those would be yet.

OTOH, some of the important ones map pretty directly, allowing for the
differences in handling of indirect and direct objects higher up the
inheritance tree:

 FOP                 PDFBOX
 --------------------------------------------------
 PDFDictionary       COSDictionary
 PDFArray            COSArray
 PDFStream           COSStream
 PDFNull             COSNull

 If I get time (big if, at the moment) I'll see if I can have a play and
determine what kind of work is involved in doing this.

--
Craig Ringer

Re: PDFName.getName() returns escaped name?!

Posted by Vincent Hennebert <vh...@gmail.com>.

Hi Craig,

On 30/03/12 01:23, Craig Ringer wrote:
> On 03/30/2012 05:09 AM, J.Pietschmann wrote:
>> Am 29.03.2012 01:24, schrieb Craig Ringer:
>>> I'd also like to have getEncodedName() return a byte[] not a
>>> String, since an encoded PDF name isn't actually text data.
>> Sounds like a reasonable idea.
>>
>>> BTW, is there any reason Fop's PDF library uses java.lang.String when
>>> working with sequences of PDF data bytes?
>> I'd chalk this up to "historical reasons", as usual. Fell free to
>> provide a patch which cleans this up.
>>
>> J.Pietschmann
> 
> Here's how I'd like to rewrite PDFName; untested code as an example of what
> I'm getting at. This is just a standalone file; a patch that incorporates it
> into the main sources will be a lot more work that I'm holding off on until I
> know folks here agree with the approach.

>From a quick look that sounds about right. Are you developing with a 1.7 JDK?
You would have to make your code 1.5-compatible. Also, it would be good if
you could back your optimizations with profiling data. If code
safety/readability have to be compromised there has to be a good reason
for it.


> In any case, after reading more of the PDF library I'm rethinking the wisdom
> of trying to make this change. The change its self is correct, but it'll be
> really hard to safely integrate into the rest of the PDF library because of
> the difficulty of auditing every site to ensure nothing breaks. Java likes to
> call `toString' automatically in places, meaning that anywhere that doesn't
> use the proper PDFWritable output methods PDFName inherits will break by
> producing bad PDF data that might be quite hard to spot. I'd start by making
> PDFName.toString() throw (for testing), but that'd only catch issues in code
> that test paths actually hit.
> 
> Given the number of these kinds of issues in fop's pdf library I'm more and
> more inclined to wonder if it should just be replaced with PDFBox. It's *full*
> of text encoding issues, it crams 8-bit binary data into the lower 8 bits of
> Unicode strings, etc. Most of the classes that extend basics like
> PDFDictionary act like the base class isn't public API and break if anyone
> else changes the dictionary in ways they don't expect, too; they should
> "have-a" PDFDictionary not "be-a" PDFDictionary really.
> 
> PDFBox is far from perfect, but it has a clean separation between the model
> classes (PDxxxx) and the basic PDF data types (COSxxx); it has a clean
> PDFName, PDFString, etc; it has a good PDF parser already, etc. Maybe it'd be
> easier for me to whip up a port of FOP's PDF output code to PDFBox? I suspect
> I'm insane to mention the possibility of doing that without evaluating the
> amount of work involved first, so I'm not promising anything, but by the looks
> it might be easier than doing the cleanups I'd like to do in fop.

We’ve already talked about that before:
http://markmail.org/message/r6pjwwmxgaawhzcp

There is a /lot/ of work to do on the PDF package to turn it into
a proper library, still it may be more practical to progressively
refactor it rather than starting from scratch a new renderer based on
PDFBox.

That said, please do feel free to give it a try if that route appeals to
you. We would certainly consider a switch if it looks more promising in
the long term.

> Thoughts?
> 
> -- 
> Craig Ringer

Thanks,
Vincent

Re: PDFName.getName() returns escaped name?!

Posted by "J.Pietschmann" <j3...@yahoo.de>.

Am 30.03.2012 02:23, schrieb Craig Ringer:
> Given the number of these kinds of issues in fop's pdf library I'm more
> and more inclined to wonder if it should just be replaced with PDFBox.

This has already been proposed, and there is a general agreement this
would be one of the better ways to proceed. Unfortunately, nobody could
provide the resources to actually do this. You can certainly help here,
a good start would be to create a Wiki page on
http://wiki.apache.org/xmlgraphics-fop
and outline a plan for a transition.

J.Pietschmann

Re: PDFName.getName() returns escaped name?!

Posted by Craig Ringer <ri...@ringerc.id.au>.

On 03/30/2012 05:09 AM, J.Pietschmann wrote:
> Am 29.03.2012 01:24, schrieb Craig Ringer:
>> I'd also like to have getEncodedName() return a byte[] not a
>> String, since an encoded PDF name isn't actually text data.
> Sounds like a reasonable idea.
>
>> BTW, is there any reason Fop's PDF library uses java.lang.String when
>> working with sequences of PDF data bytes?
> I'd chalk this up to "historical reasons", as usual. Fell free to
> provide a patch which cleans this up.
>
> J.Pietschmann

Here's how I'd like to rewrite PDFName; untested code as an example of 
what I'm getting at. This is just a standalone file; a patch that 
incorporates it into the main sources will be a lot more work that I'm 
holding off on until I know folks here agree with the approach.

In any case, after reading more of the PDF library I'm rethinking the 
wisdom of trying to make this change. The change its self is correct, 
but it'll be really hard to safely integrate into the rest of the PDF 
library because of the difficulty of auditing every site to ensure 
nothing breaks. Java likes to call `toString' automatically in places, 
meaning that anywhere that doesn't use the proper PDFWritable output 
methods PDFName inherits will break by producing bad PDF data that might 
be quite hard to spot. I'd start by making PDFName.toString() throw (for 
testing), but that'd only catch issues in code that test paths actually hit.

Given the number of these kinds of issues in fop's pdf library I'm more 
and more inclined to wonder if it should just be replaced with PDFBox. 
It's *full* of text encoding issues, it crams 8-bit binary data into the 
lower 8 bits of Unicode strings, etc. Most of the classes that extend 
basics like PDFDictionary act like the base class isn't public API and 
break if anyone else changes the dictionary in ways they don't expect, 
too; they should "have-a" PDFDictionary not "be-a" PDFDictionary really.

PDFBox is far from perfect, but it has a clean separation between the 
model classes (PDxxxx) and the basic PDF data types (COSxxx); it has a 
clean PDFName, PDFString, etc; it has a good PDF parser already, etc. 
Maybe it'd be easier for me to whip up a port of FOP's PDF output code 
to PDFBox? I suspect I'm insane to mention the possibility of doing that 
without evaluating the amount of work involved first, so I'm not 
promising anything, but by the looks it might be easier than doing the 
cleanups I'd like to do in fop.

Thoughts?

--
Craig Ringer

Re: PDFName.getName() returns escaped name?!

Posted by "J.Pietschmann" <j3...@yahoo.de>.

Am 29.03.2012 01:24, schrieb Craig Ringer:
> I'd also like to have getEncodedName() return a byte[] not a
> String, since an encoded PDF name isn't actually text data.

Sounds like a reasonable idea.

> BTW, is there any reason Fop's PDF library uses java.lang.String when
> working with sequences of PDF data bytes?

I'd chalk this up to "historical reasons", as usual. Fell free to
provide a patch which cleans this up.

J.Pietschmann