You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-dev@xmlgraphics.apache.org by Alexander Kiel <al...@gmx.net> on 2009/10/01 18:47:24 UTC

Writing PDF Documents and other source code parts

Hi,

I know my goal is to implement basic OpenType support for FOP. But from
font subsetting/embedding my eyes touched the actual PDF output
routines.

I think, that this module needs refactoring. If you have a look at the
PDFWritable interface, there is a really ugly method. The method
outputInline takes an OutputStream and a Writer, which are related to
each other. The comment says, that the writer is buffered and every time
out want to write something to the OutputStream, you have to flush the
Writer first. Thats crude.

What is really needed is some output interface which is able to do both,
write chars and write bytes.

I had also a look at PDFBox regarding writing PDF's. Maybe we shouldn't
refactor FOP's own, maybe a bit legacy PDF code. But I don't like PDFBox
code either.

So I'm a bit helpless now. The problem is, regardless of what code I
see, let it be:

TTFSubSetFile 

    Which is all about, reading a TrueType file, taking account of
    some glyph mapping (the glyphs used) and returning a byte array,
    which contains the bytes of a TrueType file with the subset of
    glyphs. This thing extends TTFFile which is about representing a
    TrueType file mixed with all the reading stuff. Here, reading,
    writing and representing some real world object is mixed in a
    really ugly way.

PDFFactory

    This class does two things: creating and registering PDF objects.
    A factory should only create objects. Than this class has nearly
    1800 lines of code. Maybe it is a factory of to much things?

    If I look at the method which interests me "makeFontFile" the
    comment says: "Embeds a font.", but the method name is
    "makeFontFile". "makeFontFile" makes sense in a factory. But
    "Embeds a font." hints that this created font file is actually
    embedded in the PDF document. Than this method has nearly 100
    lines of code, which does all sorts of things that I can't
    understand fast. In some line the TTFSubSetFile is created and
    the resulting bytes go into some PDFTTFStream - okay.

    So do not wonder about memory problems. Here you have whole
    300 kb+ fonts sitting in arrays.

MultiByteFont

    It seems to me that the MuliByteFont tracks the glyph usage. 
    "getUsedGlyphs", "mapChar", "subSet". I always thought that
    fonts are immutable objects, representing a font program which
    can be used shared all over the application. Enjoy building
    a common font source in FOP!

I don't know how I should integrate my own code into it. I think here is
a lot of refactoring necessary in order to get the FOP parts into some
state here I can integrate new code. 

But I'm not sure where to start, not sure if here are enough tests. I
don't know the overall structure. I'm simply a bit helpless.

I have a nice fonts.opentype package here with 155 classes and 279 tests
covering 93 % of the classes and 80 % of the lines. I can already read
all of the TrueType metrics and OpenType kerning info. I have a class of
every entity of the OpenType spec and a Reader for every such class.
That means you can test reading every substructure alone. I think that
this is a really nice API for reading OpenType files.

So now as I saw what TTFSubSetFile really has to do, I will start adding
write support for OpenType files. Than I will write some manipulation
routine which can build a subset of a file. But I don't like so get the
glyph mapping info for this manipulation from a MultiByteFont which
should be really immutable.

I found it sufficient to write a KerningMapBuilder which stuffs kerning
pairs into a really nice double nested Map construction. As the comment
on CustomFont#replaceKerningMap says:

    the kerning map (Map<Integer, Map<Integer, Integer>, 
    the integers are character codes)

Such a high specialized, self explaining, problem-oriented data
structure is spread all over the font system. Know your tools!

So where to start?

Best Regards
Alex

-- 
e-mail: alexanderkiel@gmx.net
web:    www.alexanderkiel.net

Re: Writing PDF Documents and other source code parts

Posted by Alexander Kiel <al...@gmx.net>.

Hi Vincent,

> I can’t really help I’m afraid, as I personally don’t have the necessary
> knowledge. It’s probably time to submit what you already have as a patch
> attached to a Bugzilla entry:
> https://issues.apache.org/bugzilla/enter_bug.cgi?product=Fop
> That will allow us to have a look and maybe provide some additional
> guidance.

Okay. I'll work towards a round patch, which includes just the new
classes without integration. 

> How feasible would it be to write a thin layer on top of your library
> that would bridge the gap between it and the current one? That would be
> a temporary layer until the PDF code is in turn refactored, allowing you
> to keep the new library clean (do we really want write support for
> OpenType files??). Refactoring the PDF code now will lead you too far.
> Keep concentrated on fonts (as much as possible) for now.

It will be hard to write such a layer, but lets see. 

I think we need OpenType write support, because if we want to embedd
subsets of fonts, we need to manipulate the font program and write it
back into a byte stream. TTFSubSetFile does this already. From its class
comment:

    Reads a TrueType file and generates a subset that can be used
    to embed a TrueType CID font. TrueType tables needed for embedded
    CID fonts are: "head", "hhea", "loca", "maxp", "cvt ", "prep",
    "glyf", "hmtx" and "fpgm".


> BTW, have you submitted your ICLA? 155 new classes... We’re gonna need
> one :-)

No not yet.

Best Regards
Alex

> Alexander Kiel wrote:
> > Hi,
> > 
> > I know my goal is to implement basic OpenType support for FOP. But from
> > font subsetting/embedding my eyes touched the actual PDF output
> > routines.
> > 
> > I think, that this module needs refactoring. If you have a look at the
> > PDFWritable interface, there is a really ugly method. The method
> > outputInline takes an OutputStream and a Writer, which are related to
> > each other. The comment says, that the writer is buffered and every time
> > out want to write something to the OutputStream, you have to flush the
> > Writer first. Thats crude.
> > 
> > What is really needed is some output interface which is able to do both,
> > write chars and write bytes.
> > 
> > I had also a look at PDFBox regarding writing PDF's. Maybe we shouldn't
> > refactor FOP's own, maybe a bit legacy PDF code. But I don't like PDFBox
> > code either.
> > 
> > So I'm a bit helpless now. The problem is, regardless of what code I
> > see, let it be:
> > 
> > TTFSubSetFile 
> > 
> >     Which is all about, reading a TrueType file, taking account of
> >     some glyph mapping (the glyphs used) and returning a byte array,
> >     which contains the bytes of a TrueType file with the subset of
> >     glyphs. This thing extends TTFFile which is about representing a
> >     TrueType file mixed with all the reading stuff. Here, reading,
> >     writing and representing some real world object is mixed in a
> >     really ugly way.
> > 
> > PDFFactory
> > 
> >     This class does two things: creating and registering PDF objects.
> >     A factory should only create objects. Than this class has nearly
> >     1800 lines of code. Maybe it is a factory of to much things?
> > 
> >     If I look at the method which interests me "makeFontFile" the
> >     comment says: "Embeds a font.", but the method name is
> >     "makeFontFile". "makeFontFile" makes sense in a factory. But
> >     "Embeds a font." hints that this created font file is actually
> >     embedded in the PDF document. Than this method has nearly 100
> >     lines of code, which does all sorts of things that I can't
> >     understand fast. In some line the TTFSubSetFile is created and
> >     the resulting bytes go into some PDFTTFStream - okay.
> > 
> >     So do not wonder about memory problems. Here you have whole
> >     300 kb+ fonts sitting in arrays.
> > 
> > MultiByteFont
> > 
> >     It seems to me that the MuliByteFont tracks the glyph usage. 
> >     "getUsedGlyphs", "mapChar", "subSet". I always thought that
> >     fonts are immutable objects, representing a font program which
> >     can be used shared all over the application. Enjoy building
> >     a common font source in FOP!
> > 
> > I don't know how I should integrate my own code into it. I think here is
> > a lot of refactoring necessary in order to get the FOP parts into some
> > state here I can integrate new code. 
> > 
> > But I'm not sure where to start, not sure if here are enough tests. I
> > don't know the overall structure. I'm simply a bit helpless.
> > 
> > I have a nice fonts.opentype package here with 155 classes and 279 tests
> > covering 93 % of the classes and 80 % of the lines. I can already read
> > all of the TrueType metrics and OpenType kerning info. I have a class of
> > every entity of the OpenType spec and a Reader for every such class.
> > That means you can test reading every substructure alone. I think that
> > this is a really nice API for reading OpenType files.
> > 
> > So now as I saw what TTFSubSetFile really has to do, I will start adding
> > write support for OpenType files. Than I will write some manipulation
> > routine which can build a subset of a file. But I don't like so get the
> > glyph mapping info for this manipulation from a MultiByteFont which
> > should be really immutable.
> > 
> > I found it sufficient to write a KerningMapBuilder which stuffs kerning
> > pairs into a really nice double nested Map construction. As the comment
> > on CustomFont#replaceKerningMap says:
> > 
> >     the kerning map (Map<Integer, Map<Integer, Integer>, 
> >     the integers are character codes)
> > 
> > Such a high specialized, self explaining, problem-oriented data
> > structure is spread all over the font system. Know your tools!
> > 
> > So where to start?
> > 
> > Best Regards
> > Alex
> > 
> 
> 
-- 
mobile: +49 (0) 176 6152 9741
phone:  +49 (0) 341 2289 188
e-mail: alexanderkiel@gmx.net
web:    www.alexanderkiel.net

Re: Writing PDF Documents and other source code parts

Posted by Vincent Hennebert <vh...@gmail.com>.

Hi Alexander,

I can’t really help I’m afraid, as I personally don’t have the necessary
knowledge. It’s probably time to submit what you already have as a patch
attached to a Bugzilla entry:
https://issues.apache.org/bugzilla/enter_bug.cgi?product=Fop
That will allow us to have a look and maybe provide some additional
guidance.

How feasible would it be to write a thin layer on top of your library
that would bridge the gap between it and the current one? That would be
a temporary layer until the PDF code is in turn refactored, allowing you
to keep the new library clean (do we really want write support for
OpenType files??). Refactoring the PDF code now will lead you too far.
Keep concentrated on fonts (as much as possible) for now.

BTW, have you submitted your ICLA? 155 new classes... We’re gonna need
one :-)

Thanks,
Vincent


Alexander Kiel wrote:
> Hi,
> 
> I know my goal is to implement basic OpenType support for FOP. But from
> font subsetting/embedding my eyes touched the actual PDF output
> routines.
> 
> I think, that this module needs refactoring. If you have a look at the
> PDFWritable interface, there is a really ugly method. The method
> outputInline takes an OutputStream and a Writer, which are related to
> each other. The comment says, that the writer is buffered and every time
> out want to write something to the OutputStream, you have to flush the
> Writer first. Thats crude.
> 
> What is really needed is some output interface which is able to do both,
> write chars and write bytes.
> 
> I had also a look at PDFBox regarding writing PDF's. Maybe we shouldn't
> refactor FOP's own, maybe a bit legacy PDF code. But I don't like PDFBox
> code either.
> 
> So I'm a bit helpless now. The problem is, regardless of what code I
> see, let it be:
> 
> TTFSubSetFile 
> 
>     Which is all about, reading a TrueType file, taking account of
>     some glyph mapping (the glyphs used) and returning a byte array,
>     which contains the bytes of a TrueType file with the subset of
>     glyphs. This thing extends TTFFile which is about representing a
>     TrueType file mixed with all the reading stuff. Here, reading,
>     writing and representing some real world object is mixed in a
>     really ugly way.
> 
> PDFFactory
> 
>     This class does two things: creating and registering PDF objects.
>     A factory should only create objects. Than this class has nearly
>     1800 lines of code. Maybe it is a factory of to much things?
> 
>     If I look at the method which interests me "makeFontFile" the
>     comment says: "Embeds a font.", but the method name is
>     "makeFontFile". "makeFontFile" makes sense in a factory. But
>     "Embeds a font." hints that this created font file is actually
>     embedded in the PDF document. Than this method has nearly 100
>     lines of code, which does all sorts of things that I can't
>     understand fast. In some line the TTFSubSetFile is created and
>     the resulting bytes go into some PDFTTFStream - okay.
> 
>     So do not wonder about memory problems. Here you have whole
>     300 kb+ fonts sitting in arrays.
> 
> MultiByteFont
> 
>     It seems to me that the MuliByteFont tracks the glyph usage. 
>     "getUsedGlyphs", "mapChar", "subSet". I always thought that
>     fonts are immutable objects, representing a font program which
>     can be used shared all over the application. Enjoy building
>     a common font source in FOP!
> 
> I don't know how I should integrate my own code into it. I think here is
> a lot of refactoring necessary in order to get the FOP parts into some
> state here I can integrate new code. 
> 
> But I'm not sure where to start, not sure if here are enough tests. I
> don't know the overall structure. I'm simply a bit helpless.
> 
> I have a nice fonts.opentype package here with 155 classes and 279 tests
> covering 93 % of the classes and 80 % of the lines. I can already read
> all of the TrueType metrics and OpenType kerning info. I have a class of
> every entity of the OpenType spec and a Reader for every such class.
> That means you can test reading every substructure alone. I think that
> this is a really nice API for reading OpenType files.
> 
> So now as I saw what TTFSubSetFile really has to do, I will start adding
> write support for OpenType files. Than I will write some manipulation
> routine which can build a subset of a file. But I don't like so get the
> glyph mapping info for this manipulation from a MultiByteFont which
> should be really immutable.
> 
> I found it sufficient to write a KerningMapBuilder which stuffs kerning
> pairs into a really nice double nested Map construction. As the comment
> on CustomFont#replaceKerningMap says:
> 
>     the kerning map (Map<Integer, Map<Integer, Integer>, 
>     the integers are character codes)
> 
> Such a high specialized, self explaining, problem-oriented data
> structure is spread all over the font system. Know your tools!
> 
> So where to start?
> 
> Best Regards
> Alex
>

Re: Writing PDF Documents and other source code parts

Posted by Adrian Cumiskey <de...@cumiskey.com>.

Hi Alexander,

I feel your pain bro [1] - I recommend you have a good read through the 
mail archives.  This is unfortunately a very familiar story when trying 
to add new features to the FOP code base.  Fonts are probably the one 
area that is most in need of refactoring.  I get a reassuring feeling 
that you are really trying to do things the right way, and I'm sure any 
improvements you are able to make there will be greatly appreciated.  
Good luck!

Adrian.

[1] http://fop-dev.markmail.org/message/re7w3pwvc64amuwq?q=god+object

Alexander Kiel wrote:
> Hi,
>
> I know my goal is to implement basic OpenType support for FOP. But from
> font subsetting/embedding my eyes touched the actual PDF output
> routines.
>
> I think, that this module needs refactoring. If you have a look at the
> PDFWritable interface, there is a really ugly method. The method
> outputInline takes an OutputStream and a Writer, which are related to
> each other. The comment says, that the writer is buffered and every time
> out want to write something to the OutputStream, you have to flush the
> Writer first. Thats crude.
>
> What is really needed is some output interface which is able to do both,
> write chars and write bytes.
>
> I had also a look at PDFBox regarding writing PDF's. Maybe we shouldn't
> refactor FOP's own, maybe a bit legacy PDF code. But I don't like PDFBox
> code either.
>
> So I'm a bit helpless now. The problem is, regardless of what code I
> see, let it be:
>
> TTFSubSetFile 
>
>     Which is all about, reading a TrueType file, taking account of
>     some glyph mapping (the glyphs used) and returning a byte array,
>     which contains the bytes of a TrueType file with the subset of
>     glyphs. This thing extends TTFFile which is about representing a
>     TrueType file mixed with all the reading stuff. Here, reading,
>     writing and representing some real world object is mixed in a
>     really ugly way.
>
> PDFFactory
>
>     This class does two things: creating and registering PDF objects.
>     A factory should only create objects. Than this class has nearly
>     1800 lines of code. Maybe it is a factory of to much things?
>
>     If I look at the method which interests me "makeFontFile" the
>     comment says: "Embeds a font.", but the method name is
>     "makeFontFile". "makeFontFile" makes sense in a factory. But
>     "Embeds a font." hints that this created font file is actually
>     embedded in the PDF document. Than this method has nearly 100
>     lines of code, which does all sorts of things that I can't
>     understand fast. In some line the TTFSubSetFile is created and
>     the resulting bytes go into some PDFTTFStream - okay.
>
>     So do not wonder about memory problems. Here you have whole
>     300 kb+ fonts sitting in arrays.
>
> MultiByteFont
>
>     It seems to me that the MuliByteFont tracks the glyph usage. 
>     "getUsedGlyphs", "mapChar", "subSet". I always thought that
>     fonts are immutable objects, representing a font program which
>     can be used shared all over the application. Enjoy building
>     a common font source in FOP!
>
> I don't know how I should integrate my own code into it. I think here is
> a lot of refactoring necessary in order to get the FOP parts into some
> state here I can integrate new code. 
>
> But I'm not sure where to start, not sure if here are enough tests. I
> don't know the overall structure. I'm simply a bit helpless.
>
> I have a nice fonts.opentype package here with 155 classes and 279 tests
> covering 93 % of the classes and 80 % of the lines. I can already read
> all of the TrueType metrics and OpenType kerning info. I have a class of
> every entity of the OpenType spec and a Reader for every such class.
> That means you can test reading every substructure alone. I think that
> this is a really nice API for reading OpenType files.
>
> So now as I saw what TTFSubSetFile really has to do, I will start adding
> write support for OpenType files. Than I will write some manipulation
> routine which can build a subset of a file. But I don't like so get the
> glyph mapping info for this manipulation from a MultiByteFont which
> should be really immutable.
>
> I found it sufficient to write a KerningMapBuilder which stuffs kerning
> pairs into a really nice double nested Map construction. As the comment
> on CustomFont#replaceKerningMap says:
>
>     the kerning map (Map<Integer, Map<Integer, Integer>, 
>     the integers are character codes)
>
> Such a high specialized, self explaining, problem-oriented data
> structure is spread all over the font system. Know your tools!
>
> So where to start?
>
> Best Regards
> Alex
>
>