You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-dev@xmlgraphics.apache.org by Alexander Kiel <al...@gmx.net> on 2009/09/24 17:53:29 UTC

Best Interface for reading OpenType Files

Hi,

I currently thinking about the interface to use for reading OpenType
files.

There are two possibilities:

 - reading on top of an InputStream or
 - reading on top of a RandomAccessFile or FileChannel.

Currently the implementation in FOP uses the class FontFileReader which
expects an InputStream. But it immediately calls IOUtils.toByteArray(in)
and works on that byte array instead. So it needs to hold the file
completely in memory.

FontBox which is part of PDFBox uses some abstract class called
TTFDataStream with template methods which has two implementations, one
called RAFDataStream which operates on top of a RandomAccessFile and one
called MemoryTTFDataStream which operates on top of a byte array.

I started using pure InputStreams. That means I implemented the whole
OpenType file reading using a hierarchy of FilterInputStreams. At the
lowest level I have a DataInputStream which takes every Inputstream and
provides methods to read the basic data types of OpenType just like
java.io.DataInputStream does for java data types. On top of that, I have
streams that can read some small scale data structures, than streams
which can read whole tables and finally a stream which can read the
whole OpenType file.

To read an OpenType file, all you have to write is:

    InputStream in = ...
    OpenTypeFileInputStream otfIn = new OpenTypeFileInputStream(in);
    OpenTypeFile otf = otfIn.readOpenTypeFile();

In my opinion this system works really good. You can take every
InputStream, the reading is decoupled from the OpenType classes itself
and you can test peaces of OpenType structure using only the individual
streams.

But! My approach has one flaw. I need to seek extensively while reading
an OpenType file. The whole file format consists of headers with offsets
and data structures which one has to read from that offsets.

To get this seeking work with streams, I use mark(), reset() and skip().
My common approach at the beginning of such a structure is to mark, than
read the header and for every part, reset to the start, mark again, skip
to the offset and read the part.

But with this approach I'm ending up to hold the whole file in memory.

To make it worse, this mark(), reset(), skip() interface doesn't support
hierarchical marking. If I seek inside smaller scale structures the mark
position of the larger scale structure is overwritten. I don't think
that it is possible to build hierarchical mark support on top of any
markable InputStream. (Oh look I did it later as I wrote this longish
mail.) I think, one have to reimplement BufferedInputStream holding ones
own byte array. In fact I did this on top of ByteArrayInputStream. The
key problem is that one can't get a position out of an InputStream which
does not surprise as the concept of streams doesn't have a position. 

It is possible to read the parts in offset order. But there are
duplicated offsets (more than one offset pointing to the same part) and
parts that have to go into an array in a semantic order which doesn't
have to be the offset order. So I have to first reorder the offsets to
read the parts in offset order and than I have to reorder the read parts
again to get them back into the semantic order. That said - it is still
possible that the offsets are in fact in the semantic order of the
parts, but the spec doesn't say this.

I don't want to depend on RandomAccessFile or FileChannel, because I
need to be able to test reading of substructures out of byte arrays.
What I need is an Interface from which I can read bytes and which allows
multiple relative seeks. With multiple relative seeks I mean something
like multiple marks. As I wrote this, I implemented such a thing inside
my DataInputStream. There is now a method:

    public SkipHandle mark();

and the SkipHandle class looks like this:

    public class SkipHandle {
        
        private final long relativePos;

        public void skipTo(long offset);
    }

SkipHandle is a non-static inner class of DataInputStream.
DataInputStream counts the bytes read and skipped to get an idea of its
actual position. The SkipHandle gets the actual stream position on
creation so that it is able to skip on DataInputStream relative to its
creation position. If the skip would be negative, SkipHandle resets the
whole stream to the start (on creation of DataInputStream, a normal mark
is set) and skips afterwards.

It works, but I find it a little but ugly. First I have to set a
mark(Integer.MAX_VALUE) on DataInputStream creation, because I want
always be able to reset the whole stream, but I don't have any
information about how many bytes are on the road. Than I have to disable
markSupport on my DataInputStream so that nobody kills my own mark.

But the biggest problem is that DataInputStream has now a non-standard
mark(), skipTo() API. Its not like a normal FilterInputStream anymore.
You can't use normal marking, because it's disabled and you have to
learn this new API instead. 

Streams simply aren't the right API for reading stuff like OpenType
files which require massive seeking. But all the seekable API's are
tight on files. 

The TTFDataStream API of FontBox is completely custom. I would like to
avoid such things. 

So I simply don't know a standard Java API which allows byte reading and
seeking over an arbitrary source and throws IOExceptions on its methods.
What about NIO? I don't see any skipping or seeking on channels.

Any idea is welcome.


Best Regards
Alex
 
-  
e-mail: alexanderkiel@gmx.net
web:    www.alexanderkiel.net


Re: Best Interface for reading OpenType Files

Posted by Vincent Hennebert <vh...@gmail.com>.
Hi Alexander,

Alexander Kiel wrote:
> Hi Vincent,
> 
>> I see. I had in mind to use OpenTypeDataInputStream as the common
>> interface. It actually makes sense to use ImageInputStream instead.
>> Simpler and just as flexible. That will add a direct dependency on
>> a class in the javax.imageio package, but this is not a problem as it is
>> part of the standard library. That ImageInputStream interface is
>> unfortunately named really.
> 
> What did you mean with your last sentence? That ImageInputStream isn't
> named good?

Yes. AFAICT its methods have nothing to do with images. This interface
should probably have been given a more neutral name.

<snip/>
>>>>>> - does the use of serializable objects make sense? What would be more
>>>>>>   efficient: re-parsing font data all the time or re-loading
>>>>>>   serializable object representation of them?
>>>>> You mean the font metrics XML files? I've alwas asking me for what
>>>>> propose they are there. No, I don't think, we need this. I really don't
>>>>> want to serialize the Advanced OpenType Features! It took me already a
>>>>> good amount of code to parse just a bit of it.
>>>> What I meant was to use the java.io.Serializable interface. I don’t
>>>> indeed think XML representations are any useful, apart maybe for
>>>> debugging purpose or to have a more human-readable version of the font
>>>> file.
>>>> IIC there would be next to nothing to do to cache Serializable objects
>>>> on the hard drive and retrieve them?
>>> Hmmm. Ok. But if we want to use Serializable for that, your classes have
>>> to be very stable. Versioning the Serializable stuff is a real burden in
>>> my opinion. So we will need a cache which detects version changes and
>>> invalidate the objects if so. Do you know such a lib?
>> I was thinking that just catching the InvalidClassException when reading
>> the object would be enough to conclude that the cache is no longer valid
>> and must be re-created. Maybe I’m wrong? I must confess that I have no
>> experience with serialization.
> 
> Yes this could work. But I find it always difficult and time consuming
> to design classes for serialization. And reading the serialized version
> is most likely not much faster than reading the actual OpenType file. So
> I would really want to wait until we have a real performance problem.

Sure. Nothing wrong with that.


Thanks,
Vincent

Re: Best Interface for reading OpenType Files

Posted by Alexander Kiel <al...@gmx.net>.
Hi Vincent,

> I see. I had in mind to use OpenTypeDataInputStream as the common
> interface. It actually makes sense to use ImageInputStream instead.
> Simpler and just as flexible. That will add a direct dependency on
> a class in the javax.imageio package, but this is not a problem as it is
> part of the standard library. That ImageInputStream interface is
> unfortunately named really.

What did you mean with your last sentence? That ImageInputStream isn't
named good?

> > So if I should vote, it would properly vote for spring.
> 
> Well I’m not sure I like the abundance of XML in spring actually. POJOs
> powaaa! Also, spring may be overkill to just deploy FOP. Anyway, this is
> probably a bit early to discuss that. (What do you think of the
> following though: http://code.google.com/p/google-guice/ ?)

I heard of it before, but didn't inform myself about it. So I took your
pointer as motivation to have a look at it. I watched the Google I/O -
Big Modular Java with Guice [1] talk on youtube. It looks very
promising. I'm not agains this XML config stuff, but if I can get the
same with annotations and standard Java code - why not. Of course I like
this whole type safety stuff, but with Intellij I get this in Spring XML
too. 

[1]: <http://www.youtube.com/watch?v=hBVJbzAagfs>

> >>>> - does the use of serializable objects make sense? What would be more
> >>>>   efficient: re-parsing font data all the time or re-loading
> >>>>   serializable object representation of them?
> >>> You mean the font metrics XML files? I've alwas asking me for what
> >>> propose they are there. No, I don't think, we need this. I really don't
> >>> want to serialize the Advanced OpenType Features! It took me already a
> >>> good amount of code to parse just a bit of it.
> >> What I meant was to use the java.io.Serializable interface. I don’t
> >> indeed think XML representations are any useful, apart maybe for
> >> debugging purpose or to have a more human-readable version of the font
> >> file.
> >> IIC there would be next to nothing to do to cache Serializable objects
> >> on the hard drive and retrieve them?
> > 
> > Hmmm. Ok. But if we want to use Serializable for that, your classes have
> > to be very stable. Versioning the Serializable stuff is a real burden in
> > my opinion. So we will need a cache which detects version changes and
> > invalidate the objects if so. Do you know such a lib?
> 
> I was thinking that just catching the InvalidClassException when reading
> the object would be enough to conclude that the cache is no longer valid
> and must be re-created. Maybe I’m wrong? I must confess that I have no
> experience with serialization.

Yes this could work. But I find it always difficult and time consuming
to design classes for serialization. And reading the serialized version
is most likely not much faster than reading the actual OpenType file. So
I would really want to wait until we have a real performance problem.

Best Regards
Alex


Re: Best Interface for reading OpenType Files

Posted by Vincent Hennebert <vh...@gmail.com>.
Hi Alexander,

Alexander Kiel wrote:
> Hi Vincent,
> 
>>>> Here are my two cents: if you make use of classes in javax.imagio at
>>>> only one place in your font library, then there’s no need to worry about
>>>> creating a more neutral layer. If OTOH you need to use those classes
>>>> everywhere, then it makes sense to use a simplified abstraction layer.
>>>> That abstraction layer could be shipped as a separate module and evolve
>>>> separately. An implementation could be based on imageIO, Apache Commons
>>>> IO (?), your own implementation based on byte arrays for testing
>>>> purpose, etc.
>>> Thanks for that. I think, I will write a OpenTypeDataInputStream which
>>> is not a FilterInputStream, but takes a ImageInputStream as constructor
>>> argument like a FilterInputStream would take a InputStream. This
>>> OpenTypeDataInputStream will be the API for all the Streams on top of
>>> it. So I would have only one point which depends on ImageInputStream.
>> You may want to use a factory a la SAXParserFactory. Although that may
>> go a bit far.
> 
> Hmmm. I don't see the benefit of such a factory here. The
> OpenTypeDataInputStream would look like this:
> 
> public class OpenTypeDataInputStream {
snip/>
> }
> 
> This is the common FilterInputStream pattern. OpenTypeDataInputStream
> only depends on ImageInputStream which is an interface.
> OpenTypeDataInputStream is really simple and straitforward, so that I
> can't imagine different implementations. Except implementations on top
> of other things as ImageInputStream. But than we are at the question, if
> we want ImageInputStream the common interface for different
> implementations (on top of files, streams, byte arrays) or if we want
> OpenTypeDataInputStream to do that. I think that ImageInputStream is the
> right place, because it abstracts from getting bytes and be able to
> seek. OpenTypeDataInputStream on the other hand implements the semantics
> of the common OpenType data types, which are well defined in the
> specification.

I see. I had in mind to use OpenTypeDataInputStream as the common
interface. It actually makes sense to use ImageInputStream instead.
Simpler and just as flexible. That will add a direct dependency on
a class in the javax.imageio package, but this is not a problem as it is
part of the standard library. That ImageInputStream interface is
unfortunately named really.

<snip/>
>> There’s no such thing as IoC container in FOP. I’m not sure how easy it
>> would be to introduce one. Although that would probably be A Good Thing.
>> So do design your font library with IoC in mind.
> 
> Yes, I will. We can use IoC even without a container. And if we want to
> choose one, I have plenty experience with spring.

Good!


> So if I should vote, it would properly vote for spring.

Well I’m not sure I like the abundance of XML in spring actually. POJOs
powaaa! Also, spring may be overkill to just deploy FOP. Anyway, this is
probably a bit early to discuss that. (What do you think of the
following though: http://code.google.com/p/google-guice/ ?)


>>>> - does the use of serializable objects make sense? What would be more
>>>>   efficient: re-parsing font data all the time or re-loading
>>>>   serializable object representation of them?
>>> You mean the font metrics XML files? I've alwas asking me for what
>>> propose they are there. No, I don't think, we need this. I really don't
>>> want to serialize the Advanced OpenType Features! It took me already a
>>> good amount of code to parse just a bit of it.
>> What I meant was to use the java.io.Serializable interface. I don’t
>> indeed think XML representations are any useful, apart maybe for
>> debugging purpose or to have a more human-readable version of the font
>> file.
>> IIC there would be next to nothing to do to cache Serializable objects
>> on the hard drive and retrieve them?
> 
> Hmmm. Ok. But if we want to use Serializable for that, your classes have
> to be very stable. Versioning the Serializable stuff is a real burden in
> my opinion. So we will need a cache which detects version changes and
> invalidate the objects if so. Do you know such a lib?

I was thinking that just catching the InvalidClassException when reading
the object would be enough to conclude that the cache is no longer valid
and must be re-created. Maybe I’m wrong? I must confess that I have no
experience with serialization.


HTH,
Vincent

Re: Best Interface for reading OpenType Files

Posted by Alexander Kiel <al...@gmx.net>.
Hi Vincent,

> >> Here are my two cents: if you make use of classes in javax.imagio at
> >> only one place in your font library, then there’s no need to worry about
> >> creating a more neutral layer. If OTOH you need to use those classes
> >> everywhere, then it makes sense to use a simplified abstraction layer.
> >> That abstraction layer could be shipped as a separate module and evolve
> >> separately. An implementation could be based on imageIO, Apache Commons
> >> IO (?), your own implementation based on byte arrays for testing
> >> purpose, etc.
> > 
> > Thanks for that. I think, I will write a OpenTypeDataInputStream which
> > is not a FilterInputStream, but takes a ImageInputStream as constructor
> > argument like a FilterInputStream would take a InputStream. This
> > OpenTypeDataInputStream will be the API for all the Streams on top of
> > it. So I would have only one point which depends on ImageInputStream.
> 
> You may want to use a factory a la SAXParserFactory. Although that may
> go a bit far.

Hmmm. I don't see the benefit of such a factory here. The
OpenTypeDataInputStream would look like this:

public class OpenTypeDataInputStream {

    private final ImageInputStream in;

    public OpenTypeDataInputStream(ImageInputStream in) {
        this.in = in;
    }

    public final int readUnsignedShort() throws IOException {
        [...]
    }
    
    public final Tag readTag() throws IOException {
        [...]
    }

}

This is the common FilterInputStream pattern. OpenTypeDataInputStream
only depends on ImageInputStream which is an interface.
OpenTypeDataInputStream is really simple and straitforward, so that I
can't imagine different implementations. Except implementations on top
of other things as ImageInputStream. But than we are at the question, if
we want ImageInputStream the common interface for different
implementations (on top of files, streams, byte arrays) or if we want
OpenTypeDataInputStream to do that. I think that ImageInputStream is the
right place, because it abstracts from getting bytes and be able to
seek. OpenTypeDataInputStream on the other hand implements the semantics
of the common OpenType data types, which are well defined in the
specification.

> > If you only need the metrics, parsing the glyf or CFF table would be
> > really unnecessary. So maybe a TableFilter interface would be useful.
> > Like this:
> > 
> > public class OpenTypeFileInputStream {
> > 
> >     private TableFilter tableFilter = TableFilter.NO_FILTERING;
> > 
> >     public OpenTypeFileInputStream(OpenTypeDataInputStream in) {}
> > 
> >     public void setTableFilter(TableFilter tableFilter) {}
> > }
> > 
> > public interface TableFilter {
> > 
> >     public static final TableFilter NO_FILTERING = new TableFilter() {
> >         public doReadTable(Tag tableTag) { return true; }
> >     }
> > 
> >     boolean doReadTable(Tag tableTag);
> > }
> > 
> > A client which isn't aware of TableFilter would not notice any burden
> > using the API. And the implementation in OpenTypeFileInputStream isn't
> > so difficult.
> 
> This is an interesting idea. But how would you combine filters?
> I’d suggest to keep it aside for the moment, and implement it if we are
> actually running into performance issues. After all, if some caching is
> done, the font should be parsed only once.

The idea of TableFilter is borrowed from java.io.FileFilter. If you look
at org.apache.commons.io.filefilter.AndFileFilter and so on, you get an
Idea how one could combine such filters.

Sure we had to implement some sort of dependencies between tables, if we
want to save the user from surprises.

> There’s no such thing as IoC container in FOP. I’m not sure how easy it
> would be to introduce one. Although that would probably be A Good Thing.
> So do design your font library with IoC in mind.

Yes, I will. We can use IoC even without a container. And if we want to
choose one, I have plenty experience with spring. So if I should vote,
it would properly vote for spring.

> >> - does the use of serializable objects make sense? What would be more
> >>   efficient: re-parsing font data all the time or re-loading
> >>   serializable object representation of them?
> > 
> > You mean the font metrics XML files? I've alwas asking me for what
> > propose they are there. No, I don't think, we need this. I really don't
> > want to serialize the Advanced OpenType Features! It took me already a
> > good amount of code to parse just a bit of it.
> 
> What I meant was to use the java.io.Serializable interface. I don’t
> indeed think XML representations are any useful, apart maybe for
> debugging purpose or to have a more human-readable version of the font
> file.
> IIC there would be next to nothing to do to cache Serializable objects
> on the hard drive and retrieve them?

Hmmm. Ok. But if we want to use Serializable for that, your classes have
to be very stable. Versioning the Serializable stuff is a real burden in
my opinion. So we will need a cache which detects version changes and
invalidate the objects if so. Do you know such a lib?


Best Regards
Alex

Re: Best Interface for reading OpenType Files

Posted by Vincent Hennebert <vh...@gmail.com>.
Hi Alexander,

Alexander Kiel wrote:
> Hi Vincent,
> 
>>> I had a look at SeekableStream and I can imagine how the needs resulted
>>> in the ImageInputStream interface. I haven't decided yet if I should use
>>> ImageInputStream directly. Maybe someone else can throw it's two cents
>>> in here.
>> Here are my two cents: if you make use of classes in javax.imagio at
>> only one place in your font library, then there’s no need to worry about
>> creating a more neutral layer. If OTOH you need to use those classes
>> everywhere, then it makes sense to use a simplified abstraction layer.
>> That abstraction layer could be shipped as a separate module and evolve
>> separately. An implementation could be based on imageIO, Apache Commons
>> IO (?), your own implementation based on byte arrays for testing
>> purpose, etc.
> 
> Thanks for that. I think, I will write a OpenTypeDataInputStream which
> is not a FilterInputStream, but takes a ImageInputStream as constructor
> argument like a FilterInputStream would take a InputStream. This
> OpenTypeDataInputStream will be the API for all the Streams on top of
> it. So I would have only one point which depends on ImageInputStream.

You may want to use a factory a la SAXParserFactory. Although that may
go a bit far.


>> - is memory consumption that much of a problem anyway? I mean, fonts 
>> are
>>   intrinsically big, complex objects and there’s not much we can do
>>   about that. Many scripts in the world can’t do without advanced
>>   features. Making the parsing of some tables optional doesn’t look to
>>   me like the right way to optimise things. That would unnecessarily
>>   complicate the code.
> 
> If you only need the metrics, parsing the glyf or CFF table would be
> really unnecessary. So maybe a TableFilter interface would be useful.
> Like this:
> 
> public class OpenTypeFileInputStream {
> 
>     private TableFilter tableFilter = TableFilter.NO_FILTERING;
> 
>     public OpenTypeFileInputStream(OpenTypeDataInputStream in) {}
> 
>     public void setTableFilter(TableFilter tableFilter) {}
> }
> 
> public interface TableFilter {
> 
>     public static final TableFilter NO_FILTERING = new TableFilter() {
>         public doReadTable(Tag tableTag) { return true; }
>     }
> 
>     boolean doReadTable(Tag tableTag);
> }
> 
> A client which isn't aware of TableFilter would not notice any burden
> using the API. And the implementation in OpenTypeFileInputStream isn't
> so difficult.

This is an interesting idea. But how would you combine filters?
I’d suggest to keep it aside for the moment, and implement it if we are
actually running into performance issues. After all, if some caching is
done, the font should be parsed only once.


>> - instead of seekable streams, what about a filter that would re-order
>>   the font stream, caching whatever is necessary before re-sending it to
>>   the consumer object?
> 
> I don't want to do this. In the OpenType GPOS and GSUB tables you have
> maybe 5 levels of nested structures with headers and offsets. It gets
> really complex there.

I see... All right then.


>> - what about giving the font library a “playground” directory by
>>   inversion of control, that it can use to cache things? And if no
>>   directory is given it would use the memory. Maybe a common interface
>>   could be used for that, targeting either the hard drive or the memory.
> 
> Sure. By the way - is there any IoC container used in FOP? I did not see
> one so far. How is the bootstrapping done? This could be important for a
> central FontSource or such thing.

There’s no such thing as IoC container in FOP. I’m not sure how easy it
would be to introduce one. Although that would probably be A Good Thing.
So do design your font library with IoC in mind.


>> - does the use of serializable objects make sense? What would be more
>>   efficient: re-parsing font data all the time or re-loading
>>   serializable object representation of them?
> 
> You mean the font metrics XML files? I've alwas asking me for what
> propose they are there. No, I don't think, we need this. I really don't
> want to serialize the Advanced OpenType Features! It took me already a
> good amount of code to parse just a bit of it.

What I meant was to use the java.io.Serializable interface. I don’t
indeed think XML representations are any useful, apart maybe for
debugging purpose or to have a more human-readable version of the font
file.
IIC there would be next to nothing to do to cache Serializable objects
on the hard drive and retrieve them?


>> - what about looking at how fontconfig [1] (a font configuration library
>>   for Linux systems) does things? I know it makes use of a cache to
>>   speed up things. Maybe there are good ideas to borrow from there.
>>
>> [1] http://www.fontconfig.org/wiki/
> 
> I don't see speed a a problem as long as we parse every font only once.
> Parsing the OpenType font "Old Standard Regular" and converting it into
> a CustomFont is currently about 100 ms. 
> 
> 
> Best Regards
> Alex

HTH,
Vincent

Re: Best Interface for reading OpenType Files

Posted by Alexander Kiel <al...@gmx.net>.
Hi Vincent,

> > I had a look at SeekableStream and I can imagine how the needs resulted
> > in the ImageInputStream interface. I haven't decided yet if I should use
> > ImageInputStream directly. Maybe someone else can throw it's two cents
> > in here.
> 
> Here are my two cents: if you make use of classes in javax.imagio at
> only one place in your font library, then there’s no need to worry about
> creating a more neutral layer. If OTOH you need to use those classes
> everywhere, then it makes sense to use a simplified abstraction layer.
> That abstraction layer could be shipped as a separate module and evolve
> separately. An implementation could be based on imageIO, Apache Commons
> IO (?), your own implementation based on byte arrays for testing
> purpose, etc.

Thanks for that. I think, I will write a OpenTypeDataInputStream which
is not a FilterInputStream, but takes a ImageInputStream as constructor
argument like a FilterInputStream would take a InputStream. This
OpenTypeDataInputStream will be the API for all the Streams on top of
it. So I would have only one point which depends on ImageInputStream.

> And another bunch of thoughts and questions:
> - I think priority should be given to having a sound API that can be
>   re-used by other projects than FOP, rather than memory optimization.

Agree.

> - is memory consumption that much of a problem anyway? I mean, fonts are
>   intrinsically big, complex objects and there’s not much we can do
>   about that. Many scripts in the world can’t do without advanced
>   features. Making the parsing of some tables optional doesn’t look to
>   me like the right way to optimise things. That would unnecessarily
>   complicate the code.

If you only need the metrics, parsing the glyf or CFF table would be
really unnecessary. So maybe a TableFilter interface would be useful.
Like this:

public class OpenTypeFileInputStream {

    private TableFilter tableFilter = TableFilter.NO_FILTERING;

    public OpenTypeFileInputStream(OpenTypeDataInputStream in) {}

    public void setTableFilter(TableFilter tableFilter) {}
}

public interface TableFilter {

    public static final TableFilter NO_FILTERING = new TableFilter() {
        public doReadTable(Tag tableTag) { return true; }
    }

    boolean doReadTable(Tag tableTag);
}

A client which isn't aware of TableFilter would not notice any burden
using the API. And the implementation in OpenTypeFileInputStream isn't
so difficult.

> - instead of seekable streams, what about a filter that would re-order
>   the font stream, caching whatever is necessary before re-sending it to
>   the consumer object?

I don't want to do this. In the OpenType GPOS and GSUB tables you have
maybe 5 levels of nested structures with headers and offsets. It gets
really complex there.

> - what about giving the font library a “playground” directory by
>   inversion of control, that it can use to cache things? And if no
>   directory is given it would use the memory. Maybe a common interface
>   could be used for that, targeting either the hard drive or the memory.

Sure. By the way - is there any IoC container used in FOP? I did not see
one so far. How is the bootstrapping done? This could be important for a
central FontSource or such thing.

> - does the use of serializable objects make sense? What would be more
>   efficient: re-parsing font data all the time or re-loading
>   serializable object representation of them?

You mean the font metrics XML files? I've alwas asking me for what
propose they are there. No, I don't think, we need this. I really don't
want to serialize the Advanced OpenType Features! It took me already a
good amount of code to parse just a bit of it.

> - what about looking at how fontconfig [1] (a font configuration library
>   for Linux systems) does things? I know it makes use of a cache to
>   speed up things. Maybe there are good ideas to borrow from there.
> 
> [1] http://www.fontconfig.org/wiki/

I don't see speed a a problem as long as we parse every font only once.
Parsing the OpenType font "Old Standard Regular" and converting it into
a CustomFont is currently about 100 ms. 


Best Regards
Alex

-- 
e-mail: alexanderkiel@gmx.net
web:    www.alexanderkiel.net


Re: Best Interface for reading OpenType Files

Posted by Vincent Hennebert <vh...@gmail.com>.
Hi Alexander,

Knowing just about nothing of font file formats and libraries, I’ll play
the ignorant guy asking naive questions, hoping that it might give you
good ideas.


Alexander Kiel wrote:
> Hi Jeremias,
> 
> On Fri, 2009-09-25 at 08:37 +0200, Jeremias Maerki wrote:
>> I don't think that relying directly on the ImageIO API is a problem
>> since it's been part of the core Java class library since Java 1.4. It's
>> available in all JVMs that claim to be at least Java 1.4 compliant. I
>> don't really see the benefit in hiding the API behind an additional
>> layer. ImageIO is here to stay. But that's just my opinion.
>>
>> Please note that SeekableStream is a predecessor of the ImageIO
>> ImageInputStream as the image codecs in XML Graphics Commons originally
>> came from JAI via Batik. It's not something we built specifically for
>> our project here.
> 
> I had a look at SeekableStream and I can imagine how the needs resulted
> in the ImageInputStream interface. I haven't decided yet if I should use
> ImageInputStream directly. Maybe someone else can throw it's two cents
> in here.

Here are my two cents: if you make use of classes in javax.imagio at
only one place in your font library, then there’s no need to worry about
creating a more neutral layer. If OTOH you need to use those classes
everywhere, then it makes sense to use a simplified abstraction layer.
That abstraction layer could be shipped as a separate module and evolve
separately. An implementation could be based on imageIO, Apache Commons
IO (?), your own implementation based on byte arrays for testing
purpose, etc.

And another bunch of thoughts and questions:
- I think priority should be given to having a sound API that can be
  re-used by other projects than FOP, rather than memory optimization.
- is memory consumption that much of a problem anyway? I mean, fonts are
  intrinsically big, complex objects and there’s not much we can do
  about that. Many scripts in the world can’t do without advanced
  features. Making the parsing of some tables optional doesn’t look to
  me like the right way to optimise things. That would unnecessarily
  complicate the code.
- instead of seekable streams, what about a filter that would re-order
  the font stream, caching whatever is necessary before re-sending it to
  the consumer object?
- what about giving the font library a “playground” directory by
  inversion of control, that it can use to cache things? And if no
  directory is given it would use the memory. Maybe a common interface
  could be used for that, targeting either the hard drive or the memory.
- does the use of serializable objects make sense? What would be more
  efficient: re-parsing font data all the time or re-loading
  serializable object representation of them?
- what about looking at how fontconfig [1] (a font configuration library
  for Linux systems) does things? I know it makes use of a cache to
  speed up things. Maybe there are good ideas to borrow from there.

[1] http://www.fontconfig.org/wiki/

Hoping my questions aren’t too stupid...

Vincent


>> An inquiry on fop-users [1] reminded me to just briefly mention an
>> important point about the font subsystem: the fact that some font data
>> is loaded again and again for each rendering run. We've discussed this 
>> (and possible solution approaches: "font sources") in the past (see
>> mailing list archives, particularly [2]). Unfortunately, this hasn't
>> been realized, yet. Some improvements were made in the last couple of
>> years, but we're not quite there, yet. So I'm happy that you've started
>> working in this area. This will surely be at least a big step in the
>> right direction.
>>
>> [1] http://markmail.org/thread/r6etkcadyaahgyhe
>> [2] http://markmail.org/message/4cmbj5x3zkvflrax
> 
> I read the FOPFontSubsystemDesign [1] wiki page. At the moment I don't
> understand the whole system good enough to see whats needed by the rest
> of FOP. I think a more deeply discussion about the font subsystem would
> be out of this discussions subject. So maybe we should start a new
> thread on the list. But before this, I should get my OpenType reading
> finished and submit the patch.
> 
> Best Regards
> Alex
> 
> [1] http://wiki.apache.org/xmlgraphics-fop/FOPFontSubsystemDesign
> 
> 
> e-mail: alexanderkiel@gmx.net
> web:    www.alexanderkiel.net
> 

Re: Best Interface for reading OpenType Files

Posted by Alexander Kiel <al...@gmx.net>.
Hi Jeremias,

On Fri, 2009-09-25 at 08:37 +0200, Jeremias Maerki wrote:
> I don't think that relying directly on the ImageIO API is a problem
> since it's been part of the core Java class library since Java 1.4. It's
> available in all JVMs that claim to be at least Java 1.4 compliant. I
> don't really see the benefit in hiding the API behind an additional
> layer. ImageIO is here to stay. But that's just my opinion.
> 
> Please note that SeekableStream is a predecessor of the ImageIO
> ImageInputStream as the image codecs in XML Graphics Commons originally
> came from JAI via Batik. It's not something we built specifically for
> our project here.

I had a look at SeekableStream and I can imagine how the needs resulted
in the ImageInputStream interface. I haven't decided yet if I should use
ImageInputStream directly. Maybe someone else can throw it's two cents
in here.

> An inquiry on fop-users [1] reminded me to just briefly mention an
> important point about the font subsystem: the fact that some font data
> is loaded again and again for each rendering run. We've discussed this 
> (and possible solution approaches: "font sources") in the past (see
> mailing list archives, particularly [2]). Unfortunately, this hasn't
> been realized, yet. Some improvements were made in the last couple of
> years, but we're not quite there, yet. So I'm happy that you've started
> working in this area. This will surely be at least a big step in the
> right direction.
> 
> [1] http://markmail.org/thread/r6etkcadyaahgyhe
> [2] http://markmail.org/message/4cmbj5x3zkvflrax

I read the FOPFontSubsystemDesign [1] wiki page. At the moment I don't
understand the whole system good enough to see whats needed by the rest
of FOP. I think a more deeply discussion about the font subsystem would
be out of this discussions subject. So maybe we should start a new
thread on the list. But before this, I should get my OpenType reading
finished and submit the patch.

Best Regards
Alex

[1] http://wiki.apache.org/xmlgraphics-fop/FOPFontSubsystemDesign


e-mail: alexanderkiel@gmx.net
web:    www.alexanderkiel.net


Re: Best Interface for reading OpenType Files

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
Alexander,

I don't think that relying directly on the ImageIO API is a problem
since it's been part of the core Java class library since Java 1.4. It's
available in all JVMs that claim to be at least Java 1.4 compliant. I
don't really see the benefit in hiding the API behind an additional
layer. ImageIO is here to stay. But that's just my opinion.

Please note that SeekableStream is a predecessor of the ImageIO
ImageInputStream as the image codecs in XML Graphics Commons originally
came from JAI via Batik. It's not something we built specifically for
our project here.

I'm not aware of any implementation of ImageInputStream or
SeekableStream that takes a byte array as constructor parameter.

An inquiry on fop-users [1] reminded me to just briefly mention an
important point about the font subsystem: the fact that some font data
is loaded again and again for each rendering run. We've discussed this 
(and possible solution approaches: "font sources") in the past (see
mailing list archives, particularly [2]). Unfortunately, this hasn't
been realized, yet. Some improvements were made in the last couple of
years, but we're not quite there, yet. So I'm happy that you've started
working in this area. This will surely be at least a big step in the
right direction.

[1] http://markmail.org/thread/r6etkcadyaahgyhe
[2] http://markmail.org/message/4cmbj5x3zkvflrax

On 24.09.2009 22:18:21 Alexander Kiel wrote:
> Hi Jeremias,
> 
> On Thu, 2009-09-24 at 21:06 +0200, Jeremias Maerki wrote:
> > Right, and that accounts for a pretty large portion of FOP's memory
> > consumption problem nowadays. With the use of OpenType fonts, this gets
> > worse as they can be quite big. I'm glad you noticed that.
> 
> Yes, but currently I read all OpenType tables I'm aware of. The Java
> data structures are quite bigger than the original file. The biggest
> fonts I saw have a size of 400 kb. I don't know the Java structure size
> at the moment. If its really a problem, I can profile this later. So
> maybe I should add config options which select the tables to read.
> Currently the data is moved into a CustomFont and the TTFFile is thrown
> away (I hope so). But CustomFont doesn't have the power of advanced
> OpenType features. So I think we will end up with some interfaces
> instead which may be implemented by the TTFFile itself or by classes
> using the TTFFile in the background. That said, we will end up with some
> amount of data in memory anyway. 
> 
> > May I suggest to use ImageIO's ImageInputStream? That already has an
> > implementation that buffers the stream in a temporary file (if allowed)
> > so you basically have random access. I've used that extensively in the
> > image loading framework in XML Graphics Commons and it seem to be
> > ideal for what you need to do. You even get the hierarchical mark/reset.
> 
> Thanks for that! I did not know of ImageIO's ImageInputStream before.
> It's an interface, which is good. It is capable of all the stuff I need.
> The only thing to complain about is, that it has more functionality as
> needed and that its named a bit odd for fonts. Maybe I should specify an
> interface which is a subset of ImageInputStream and provide a simple
> wrapper to ImageInputStream so that I can use the implementations.
> 
> > I don't think NIO will help much here. I'd really suggest
> > ImageInputStream which should have everything you need. You can probably
> > even reuse some utility code I've written for the image loading
> > framework:
> > http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/image/loader/util/ImageUtil.java?view=markup
> > http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/image/loader/util/ImageInputStreamAdapter.java?view=markup
> > 
> > The following class has some code to get an ImageInputStream from a URI.
> > If it's a file URL it tries to get an ImageInputStream with random
> > access. In all other cases, the content is buffered by ImageIO's default
> > buffering implementations (depending on the settings).
> > http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/image/loader/impl/AbstractImageSessionContext.java?view=markup
> > That could might even be extracted to be useful to you.
> > 
> > See also: http://java.sun.com/j2se/1.4.2/docs/api/javax/imageio/ImageIO.html
> > (methods setUseCache() and setCacheDirectory)
> 
> Thanks for that pointers. What would you think? Should I specify my own
> SeekableInputStream which isn't able to do all that bit operations and
> some of the DataInput operations I don't need, or should I use
> ImageInputStream directly? Is there a simple implementation on top of
> byte arrays for unit testing? Ok I could use ImageInputStreamImpl for
> that...
> 
> If I think of it more deeply, it would not so clever for a font API to
> depend on javax.imageio. There is the "x" in javax and the "image" in
> imageio which I don't like.
> 
> It's a pity that there is no common byte-only random access input source
> interface in Java, isn't it?
> 
> Best Regards
> Alex
> 
> -- 
> e-mail: alexanderkiel@gmx.net
> web:    www.alexanderkiel.net
> 




Jeremias Maerki


Re: Best Interface for reading OpenType Files

Posted by Alexander Kiel <al...@gmx.net>.
Hi Jeremias,

On Thu, 2009-09-24 at 21:06 +0200, Jeremias Maerki wrote:
> Right, and that accounts for a pretty large portion of FOP's memory
> consumption problem nowadays. With the use of OpenType fonts, this gets
> worse as they can be quite big. I'm glad you noticed that.

Yes, but currently I read all OpenType tables I'm aware of. The Java
data structures are quite bigger than the original file. The biggest
fonts I saw have a size of 400 kb. I don't know the Java structure size
at the moment. If its really a problem, I can profile this later. So
maybe I should add config options which select the tables to read.
Currently the data is moved into a CustomFont and the TTFFile is thrown
away (I hope so). But CustomFont doesn't have the power of advanced
OpenType features. So I think we will end up with some interfaces
instead which may be implemented by the TTFFile itself or by classes
using the TTFFile in the background. That said, we will end up with some
amount of data in memory anyway. 

> May I suggest to use ImageIO's ImageInputStream? That already has an
> implementation that buffers the stream in a temporary file (if allowed)
> so you basically have random access. I've used that extensively in the
> image loading framework in XML Graphics Commons and it seem to be
> ideal for what you need to do. You even get the hierarchical mark/reset.

Thanks for that! I did not know of ImageIO's ImageInputStream before.
It's an interface, which is good. It is capable of all the stuff I need.
The only thing to complain about is, that it has more functionality as
needed and that its named a bit odd for fonts. Maybe I should specify an
interface which is a subset of ImageInputStream and provide a simple
wrapper to ImageInputStream so that I can use the implementations.

> I don't think NIO will help much here. I'd really suggest
> ImageInputStream which should have everything you need. You can probably
> even reuse some utility code I've written for the image loading
> framework:
> http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/image/loader/util/ImageUtil.java?view=markup
> http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/image/loader/util/ImageInputStreamAdapter.java?view=markup
> 
> The following class has some code to get an ImageInputStream from a URI.
> If it's a file URL it tries to get an ImageInputStream with random
> access. In all other cases, the content is buffered by ImageIO's default
> buffering implementations (depending on the settings).
> http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/image/loader/impl/AbstractImageSessionContext.java?view=markup
> That could might even be extracted to be useful to you.
> 
> See also: http://java.sun.com/j2se/1.4.2/docs/api/javax/imageio/ImageIO.html
> (methods setUseCache() and setCacheDirectory)

Thanks for that pointers. What would you think? Should I specify my own
SeekableInputStream which isn't able to do all that bit operations and
some of the DataInput operations I don't need, or should I use
ImageInputStream directly? Is there a simple implementation on top of
byte arrays for unit testing? Ok I could use ImageInputStreamImpl for
that...

If I think of it more deeply, it would not so clever for a font API to
depend on javax.imageio. There is the "x" in javax and the "image" in
imageio which I don't like.

It's a pity that there is no common byte-only random access input source
interface in Java, isn't it?

Best Regards
Alex

-- 
e-mail: alexanderkiel@gmx.net
web:    www.alexanderkiel.net


Re: Best Interface for reading OpenType Files

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
On 24.09.2009 17:53:29 Alexander Kiel wrote:
> Hi,
> 
> I currently thinking about the interface to use for reading OpenType
> files.
> 
> There are two possibilities:
> 
>  - reading on top of an InputStream or
>  - reading on top of a RandomAccessFile or FileChannel.
> 
> Currently the implementation in FOP uses the class FontFileReader which
> expects an InputStream. But it immediately calls IOUtils.toByteArray(in)
> and works on that byte array instead. So it needs to hold the file
> completely in memory.

Right, and that accounts for a pretty large portion of FOP's memory
consumption problem nowadays. With the use of OpenType fonts, this gets
worse as they can be quite big. I'm glad you noticed that.

> FontBox which is part of PDFBox uses some abstract class called
> TTFDataStream with template methods which has two implementations, one
> called RAFDataStream which operates on top of a RandomAccessFile and one
> called MemoryTTFDataStream which operates on top of a byte array.

So if you access the font via a URL that is not a file URL, you still
get a memory problem.

> I started using pure InputStreams. That means I implemented the whole
> OpenType file reading using a hierarchy of FilterInputStreams. At the
> lowest level I have a DataInputStream which takes every Inputstream and
> provides methods to read the basic data types of OpenType just like
> java.io.DataInputStream does for java data types. On top of that, I have
> streams that can read some small scale data structures, than streams
> which can read whole tables and finally a stream which can read the
> whole OpenType file.

Yeah, that's the ideal world.

> To read an OpenType file, all you have to write is:
> 
>     InputStream in = ...
>     OpenTypeFileInputStream otfIn = new OpenTypeFileInputStream(in);
>     OpenTypeFile otf = otfIn.readOpenTypeFile();
> 
> In my opinion this system works really good. You can take every
> InputStream, the reading is decoupled from the OpenType classes itself
> and you can test peaces of OpenType structure using only the individual
> streams.
> 
> But! My approach has one flaw. I need to seek extensively while reading
> an OpenType file. The whole file format consists of headers with offsets
> and data structures which one has to read from that offsets.
> 
> To get this seeking work with streams, I use mark(), reset() and skip().
> My common approach at the beginning of such a structure is to mark, than
> read the header and for every part, reset to the start, mark again, skip
> to the offset and read the part.
> 
> But with this approach I'm ending up to hold the whole file in memory.
> 
> To make it worse, this mark(), reset(), skip() interface doesn't support
> hierarchical marking. If I seek inside smaller scale structures the mark
> position of the larger scale structure is overwritten. I don't think
> that it is possible to build hierarchical mark support on top of any
> markable InputStream. (Oh look I did it later as I wrote this longish
> mail.) I think, one have to reimplement BufferedInputStream holding ones
> own byte array. In fact I did this on top of ByteArrayInputStream. The
> key problem is that one can't get a position out of an InputStream which
> does not surprise as the concept of streams doesn't have a position. 

May I suggest to use ImageIO's ImageInputStream? That already has an
implementation that buffers the stream in a temporary file (if allowed)
so you basically have random access. I've used that extensively in the
image loading framework in XML Graphics Commons and it seem to be
ideal for what you need to do. You even get the hierarchical mark/reset.

> It is possible to read the parts in offset order. But there are
> duplicated offsets (more than one offset pointing to the same part) and
> parts that have to go into an array in a semantic order which doesn't
> have to be the offset order. So I have to first reorder the offsets to
> read the parts in offset order and than I have to reorder the read parts
> again to get them back into the semantic order. That said - it is still
> possible that the offsets are in fact in the semantic order of the
> parts, but the spec doesn't say this.
> 
> I don't want to depend on RandomAccessFile or FileChannel, because I
> need to be able to test reading of substructures out of byte arrays.

Good decision IMO.

> What I need is an Interface from which I can read bytes and which allows
> multiple relative seeks. With multiple relative seeks I mean something
> like multiple marks. As I wrote this, I implemented such a thing inside
> my DataInputStream. There is now a method:
> 
>     public SkipHandle mark();
> 
> and the SkipHandle class looks like this:
> 
>     public class SkipHandle {
>         
>         private final long relativePos;
> 
>         public void skipTo(long offset);
>     }
> 
> SkipHandle is a non-static inner class of DataInputStream.
> DataInputStream counts the bytes read and skipped to get an idea of its
> actual position. The SkipHandle gets the actual stream position on
> creation so that it is able to skip on DataInputStream relative to its
> creation position. If the skip would be negative, SkipHandle resets the
> whole stream to the start (on creation of DataInputStream, a normal mark
> is set) and skips afterwards.
> 
> It works, but I find it a little but ugly. First I have to set a
> mark(Integer.MAX_VALUE) on DataInputStream creation, because I want
> always be able to reset the whole stream, but I don't have any
> information about how many bytes are on the road. Than I have to disable
> markSupport on my DataInputStream so that nobody kills my own mark.
> 
> But the biggest problem is that DataInputStream has now a non-standard
> mark(), skipTo() API. Its not like a normal FilterInputStream anymore.
> You can't use normal marking, because it's disabled and you have to
> learn this new API instead. 
> 
> Streams simply aren't the right API for reading stuff like OpenType
> files which require massive seeking. But all the seekable API's are
> tight on files. 
> 
> The TTFDataStream API of FontBox is completely custom. I would like to
> avoid such things. 
> 
> So I simply don't know a standard Java API which allows byte reading and
> seeking over an arbitrary source and throws IOExceptions on its methods.
> What about NIO? I don't see any skipping or seeking on channels.

I don't think NIO will help much here. I'd really suggest
ImageInputStream which should have everything you need. You can probably
even reuse some utility code I've written for the image loading
framework:
http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/image/loader/util/ImageUtil.java?view=markup
http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/image/loader/util/ImageInputStreamAdapter.java?view=markup

The following class has some code to get an ImageInputStream from a URI.
If it's a file URL it tries to get an ImageInputStream with random
access. In all other cases, the content is buffered by ImageIO's default
buffering implementations (depending on the settings).
http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/image/loader/impl/AbstractImageSessionContext.java?view=markup
That could might even be extracted to be useful to you.

See also: http://java.sun.com/j2se/1.4.2/docs/api/javax/imageio/ImageIO.html
(methods setUseCache() and setCacheDirectory)

> Any idea is welcome.
> 
> 
> Best Regards
> Alex
>  
> -  
> e-mail: alexanderkiel@gmx.net
> web:    www.alexanderkiel.net
> 




Jeremias Maerki