You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-dev@xmlgraphics.apache.org by Craig Ringer <cr...@postnewspapers.com.au> on 2012/03/29 01:24:49 UTC
PDFName.getName() returns escaped name?!
Hi all
I've been working with PDFName in my code and have run into a bit of an
oddity I was hoping for comments on.
For any given string `fred', the operation:
( new PDFName(fred) ).getName().equals(fred)
isn't guaranteed to be true, because PDFName.getName() returns the
*escaped* name. It strips the leading slash added by
PDFName.escapeName(), so most of the time the returned name will be the
same, but it's a good candidate for creating exciting bugs.
I'd like to be able to use PDFName instead of String as a map key (for
clarity, mostly), but need to be able to get the original encapsulated
string quickly (without decoding) and reliably.
I'd like to change PDFName so that it keeps a reference to the original
name string and returns that from getName(). It should encode the name
on the first call to the new getEncodedName() method, storing it in a
local member, so short-lived PDFName objects don't waste time encoding
strings. I'd also like to have getEncodedName() return a byte[] not a
String, since an encoded PDF name isn't actually text data.
BTW, is there any reason Fop's PDF library uses java.lang.String when
working with sequences of PDF data bytes? For example, the output of
PDFName.escapeName(...) isn't really a "string" at all, in that it's not
meaningful text in any encoding, it's just a byte sequence jammed into
the lower 8 bits of unicode code points. It's pretty confusing having it
as a String (logically an array of unicode characters) rather than as a
byte[]. Right now, fop also writes 8-bit characters in names incorrectly
- the toHex(...) and PDFName.escapeName(...) methods translate values
between 128 and 255 inclusive of each *unicode* *character* in a String
to hex and write that out. This is incorrect, because PDF names should
be UTF-8, so it should be encoding to a UTF-8 byte sequence then escaping.
--
Craig Ringer
POST Newspapers
276 Onslow Rd, Shenton Park
Ph: 08 9381 3088 Fax: 08 9388 2258
ABN: 50 008 917 717
http://www.postnewspapers.com.au/
Re: PDFName.getName() returns escaped name?!
Posted by Craig Ringer <ri...@ringerc.id.au>.
On 04/04/12 18:02, Vincent Hennebert wrote:
> Hi Craig,
>
> Thanks for your extensive study!
>
> On 03/04/12 10:31, Craig Ringer wrote:
>> On 03/04/12 01:16, Vincent Hennebert wrote:
>>
>>>> From a quick look that sounds about right. Are you developing with a 1.7 JDK?
>>>> You would have to make your code 1.5-compatible. Also, it would be good if
>>>> you could back your optimizations with profiling data. If code
>>>> safety/readability have to be compromised there has to be a good reason
>>>> for it.
>> Yep, I'm on 1.7, but was building for a 1.5 target and source level.
> Careful, because that doesn’t ensure you that your code will run on
> a 1.5 JVM. You may still be using elements of the standard library that
> appeared only in 1.6+.
Thanks - I forgot about that. Sigh. The glacial pace at which people
update JREs and JDKs drives me quite nuts sometimes - but at least fop
has moved to 1.5!
Time to do some JDK archaeology...
--
Craig Ringer
Re: PDFName.getName() returns escaped name?!
Posted by Vincent Hennebert <vh...@gmail.com>.
Hi Craig,
Thanks for your extensive study!
On 03/04/12 10:31, Craig Ringer wrote:
> On 03/04/12 01:16, Vincent Hennebert wrote:
>
>>> From a quick look that sounds about right. Are you developing with a 1.7 JDK?
>>> You would have to make your code 1.5-compatible. Also, it would be good if
>>> you could back your optimizations with profiling data. If code
>>> safety/readability have to be compromised there has to be a good reason
>>> for it.
>
> Yep, I'm on 1.7, but was building for a 1.5 target and source level.
Careful, because that doesn’t ensure you that your code will run on
a 1.5 JVM. You may still be using elements of the standard library that
appeared only in 1.6+.
<snip/>
>> That said, please do feel free to give it a try if that route appeals to
>> you. We would certainly consider a switch if it looks more promising in
>> the long term.
>
> I'm wondering how practical it'd be to progressively adopt PDFBox,
> actually, rather than doing an abrupt and total switch.
>
> The PDF primitives in PDFBox (COSName, COSNumber, COSDictionary, etc)
> are modeled extremely similarlines as those in FOP's PDF library, and
> while they won't be drop-in replacements they behave very similarly,
> just with different method names and in some cases different datatype
> assumptions (PDFName instead of String dictionary keys for example). The
> only truly big difference looks to be in the handling of indirect
> objects, where FOP uses one class that may be direct or indirect, and
> PDFBox uses a dedicated `COSObject' class that wraps other objects for
> indirect objects.
Yep, I think we want to do the same in FOP. As a first step during our
work on object streams, we had a chance to factorize into PDFDocument
the outputting of obj/endobj for indirect objects. Before, there were
‘if (hasObjectNumber())’ statements spread all over the PDF classes.
> If it proves possible to do a largely 1:1 substitution of PDFBox
> primitive PDF classes for FOP ones that'd *greatly* simplify the job of
> moving to using the PDFBox `PD' model for document output, and would let
> the job be done in a couple of distinct and separate chunks. PDFFont,
> PDFXObject, etc would build on top of the COS types like COSDictionary,
> COSBase and COSObject, rather than on top of PDFObject, PDFDictionary, etc.
I’m not too sure about this one? Even if you manage to convert the FOP
higher-level PDF classes to use PDFBox primitives, what does that bring
you? IIUC you would still have to make major changes to move to the PD
model.
<snip/>
> If I get time (big if, at the moment) I'll see if I can have a play and
> determine what kind of work is involved in doing this.
Let us know how you’re getting on.
> --
> Craig Ringer
Thanks,
Vincent
Re: PDFName.getName() returns escaped name?!
Posted by Craig Ringer <ri...@ringerc.id.au>.
On 03/04/12 01:16, Vincent Hennebert wrote:
>> From a quick look that sounds about right. Are you developing with a 1.7 JDK?
>> You would have to make your code 1.5-compatible. Also, it would be good if
>> you could back your optimizations with profiling data. If code
>> safety/readability have to be compromised there has to be a good reason
>> for it.
Yep, I'm on 1.7, but was building for a 1.5 target and source level.
The posted PDFName is really an example of what I'm thinking of; it's
not something that can be seriously merged without a *lot* of changes
across the rest of the tree to convert String name handling to PDFName,
to check for leading "/"s being added, to check for use of toString() in
name output to PDF data, etc. It's more a starting point for discussion
and to see what kind of interest there is in working on the pdf lib.
It looks like folks here are pretty receptive to significantly reworking
the PDF lib if there's someone willing to stump up the time. That might
well be me if it turns out porting the whole renderer to PDFBox instead
turns out to be lots harder than reworking the fop pdf lib. I suspect not.
>
> That said, please do feel free to give it a try if that route appeals to
> you. We would certainly consider a switch if it looks more promising in
> the long term.
I'm wondering how practical it'd be to progressively adopt PDFBox,
actually, rather than doing an abrupt and total switch.
The PDF primitives in PDFBox (COSName, COSNumber, COSDictionary, etc)
are modeled extremely similarlines as those in FOP's PDF library, and
while they won't be drop-in replacements they behave very similarly,
just with different method names and in some cases different datatype
assumptions (PDFName instead of String dictionary keys for example). The
only truly big difference looks to be in the handling of indirect
objects, where FOP uses one class that may be direct or indirect, and
PDFBox uses a dedicated `COSObject' class that wraps other objects for
indirect objects.
If it proves possible to do a largely 1:1 substitution of PDFBox
primitive PDF classes for FOP ones that'd *greatly* simplify the job of
moving to using the PDFBox `PD' model for document output, and would let
the job be done in a couple of distinct and separate chunks. PDFFont,
PDFXObject, etc would build on top of the COS types like COSDictionary,
COSBase and COSObject, rather than on top of PDFObject, PDFDictionary, etc.
It's not a perfect 1:1 mapping; notable imperfect matches where one or
more classes map to one or more different classes include:
FOP PDFBOX
---------------------------------------------------
java.lang.String, COSName (for PDF names)
FOP PDFText java.lang.String (for Unicode text)
byte[] or a binbuf/binstr class (for binary data)
---------------------------------------------------
PDFNumber COSNumber
COSInteger
COSFloat
---------------------------------------------------
???
(j.l.Boolean?) COSBoolean
---------------------------------------------------
PDFObject COSBase (direct)
COSObject (wraps a COSBase, indirect)
----------------------------------------------------
PDFDocument COSDocument
[+a new class with fop-only functionality]
PDDocument
Of those, the first already needs a cleanup as noted previously; fop
uses String for at least two distinct and incompatible things (Unicode
text and raw binary data) that must be separated anyway. The second
doesn't look to be a big deal as COSNumber has factory methods to take
care of it, it's pretty transparent. I'm not too worried about boolean
handling either. The bigger ones are the different model for indirect
objects, and the non-trivial work to move PDFDocument to extend/wrap
COSDocument. I haven't looked into this in enough depth to have a
meaningful assessment of how hard either of those would be yet.
OTOH, some of the important ones map pretty directly, allowing for the
differences in handling of indirect and direct objects higher up the
inheritance tree:
FOP PDFBOX
--------------------------------------------------
PDFDictionary COSDictionary
PDFArray COSArray
PDFStream COSStream
PDFNull COSNull
If I get time (big if, at the moment) I'll see if I can have a play and
determine what kind of work is involved in doing this.
--
Craig Ringer
Re: PDFName.getName() returns escaped name?!
Posted by Vincent Hennebert <vh...@gmail.com>.
Hi Craig,
On 30/03/12 01:23, Craig Ringer wrote:
> On 03/30/2012 05:09 AM, J.Pietschmann wrote:
>> Am 29.03.2012 01:24, schrieb Craig Ringer:
>>> I'd also like to have getEncodedName() return a byte[] not a
>>> String, since an encoded PDF name isn't actually text data.
>> Sounds like a reasonable idea.
>>
>>> BTW, is there any reason Fop's PDF library uses java.lang.String when
>>> working with sequences of PDF data bytes?
>> I'd chalk this up to "historical reasons", as usual. Fell free to
>> provide a patch which cleans this up.
>>
>> J.Pietschmann
>
> Here's how I'd like to rewrite PDFName; untested code as an example of what
> I'm getting at. This is just a standalone file; a patch that incorporates it
> into the main sources will be a lot more work that I'm holding off on until I
> know folks here agree with the approach.
>From a quick look that sounds about right. Are you developing with a 1.7 JDK?
You would have to make your code 1.5-compatible. Also, it would be good if
you could back your optimizations with profiling data. If code
safety/readability have to be compromised there has to be a good reason
for it.
> In any case, after reading more of the PDF library I'm rethinking the wisdom
> of trying to make this change. The change its self is correct, but it'll be
> really hard to safely integrate into the rest of the PDF library because of
> the difficulty of auditing every site to ensure nothing breaks. Java likes to
> call `toString' automatically in places, meaning that anywhere that doesn't
> use the proper PDFWritable output methods PDFName inherits will break by
> producing bad PDF data that might be quite hard to spot. I'd start by making
> PDFName.toString() throw (for testing), but that'd only catch issues in code
> that test paths actually hit.
>
> Given the number of these kinds of issues in fop's pdf library I'm more and
> more inclined to wonder if it should just be replaced with PDFBox. It's *full*
> of text encoding issues, it crams 8-bit binary data into the lower 8 bits of
> Unicode strings, etc. Most of the classes that extend basics like
> PDFDictionary act like the base class isn't public API and break if anyone
> else changes the dictionary in ways they don't expect, too; they should
> "have-a" PDFDictionary not "be-a" PDFDictionary really.
>
> PDFBox is far from perfect, but it has a clean separation between the model
> classes (PDxxxx) and the basic PDF data types (COSxxx); it has a clean
> PDFName, PDFString, etc; it has a good PDF parser already, etc. Maybe it'd be
> easier for me to whip up a port of FOP's PDF output code to PDFBox? I suspect
> I'm insane to mention the possibility of doing that without evaluating the
> amount of work involved first, so I'm not promising anything, but by the looks
> it might be easier than doing the cleanups I'd like to do in fop.
We’ve already talked about that before:
http://markmail.org/message/r6pjwwmxgaawhzcp
There is a /lot/ of work to do on the PDF package to turn it into
a proper library, still it may be more practical to progressively
refactor it rather than starting from scratch a new renderer based on
PDFBox.
That said, please do feel free to give it a try if that route appeals to
you. We would certainly consider a switch if it looks more promising in
the long term.
> Thoughts?
>
> --
> Craig Ringer
Thanks,
Vincent
Re: PDFName.getName() returns escaped name?!
Posted by "J.Pietschmann" <j3...@yahoo.de>.
Am 30.03.2012 02:23, schrieb Craig Ringer:
> Given the number of these kinds of issues in fop's pdf library I'm more
> and more inclined to wonder if it should just be replaced with PDFBox.
This has already been proposed, and there is a general agreement this
would be one of the better ways to proceed. Unfortunately, nobody could
provide the resources to actually do this. You can certainly help here,
a good start would be to create a Wiki page on
http://wiki.apache.org/xmlgraphics-fop
and outline a plan for a transition.
J.Pietschmann
Re: PDFName.getName() returns escaped name?!
Posted by Craig Ringer <ri...@ringerc.id.au>.
On 03/30/2012 05:09 AM, J.Pietschmann wrote:
> Am 29.03.2012 01:24, schrieb Craig Ringer:
>> I'd also like to have getEncodedName() return a byte[] not a
>> String, since an encoded PDF name isn't actually text data.
> Sounds like a reasonable idea.
>
>> BTW, is there any reason Fop's PDF library uses java.lang.String when
>> working with sequences of PDF data bytes?
> I'd chalk this up to "historical reasons", as usual. Fell free to
> provide a patch which cleans this up.
>
> J.Pietschmann
Here's how I'd like to rewrite PDFName; untested code as an example of
what I'm getting at. This is just a standalone file; a patch that
incorporates it into the main sources will be a lot more work that I'm
holding off on until I know folks here agree with the approach.
In any case, after reading more of the PDF library I'm rethinking the
wisdom of trying to make this change. The change its self is correct,
but it'll be really hard to safely integrate into the rest of the PDF
library because of the difficulty of auditing every site to ensure
nothing breaks. Java likes to call `toString' automatically in places,
meaning that anywhere that doesn't use the proper PDFWritable output
methods PDFName inherits will break by producing bad PDF data that might
be quite hard to spot. I'd start by making PDFName.toString() throw (for
testing), but that'd only catch issues in code that test paths actually hit.
Given the number of these kinds of issues in fop's pdf library I'm more
and more inclined to wonder if it should just be replaced with PDFBox.
It's *full* of text encoding issues, it crams 8-bit binary data into the
lower 8 bits of Unicode strings, etc. Most of the classes that extend
basics like PDFDictionary act like the base class isn't public API and
break if anyone else changes the dictionary in ways they don't expect,
too; they should "have-a" PDFDictionary not "be-a" PDFDictionary really.
PDFBox is far from perfect, but it has a clean separation between the
model classes (PDxxxx) and the basic PDF data types (COSxxx); it has a
clean PDFName, PDFString, etc; it has a good PDF parser already, etc.
Maybe it'd be easier for me to whip up a port of FOP's PDF output code
to PDFBox? I suspect I'm insane to mention the possibility of doing that
without evaluating the amount of work involved first, so I'm not
promising anything, but by the looks it might be easier than doing the
cleanups I'd like to do in fop.
Thoughts?
--
Craig Ringer
Re: PDFName.getName() returns escaped name?!
Posted by "J.Pietschmann" <j3...@yahoo.de>.
Am 29.03.2012 01:24, schrieb Craig Ringer:
> I'd also like to have getEncodedName() return a byte[] not a
> String, since an encoded PDF name isn't actually text data.
Sounds like a reasonable idea.
> BTW, is there any reason Fop's PDF library uses java.lang.String when
> working with sequences of PDF data bytes?
I'd chalk this up to "historical reasons", as usual. Fell free to
provide a patch which cleans this up.
J.Pietschmann