You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-users@xmlgraphics.apache.org by Ted Young <ty...@ncoastsoft.com> on 2008/05/18 06:10:02 UTC

Making a Large Dictionary

Greetings to the list,

 

I am using FOP version 0 .94 under Java 1.6 update 6.

 

I am using FOP to create a PDF dictionary from a source XML file.  There are
only 12,000 entries in this dictionary.  The tricky bit is that the words
being defined are in Ancient Egyptian Hieroglyphs.  Each word (or phrase) is
stored in SVG format.  So ultimately my FO document contains 12,000
instream-foreign-object tags each containing SVG.  This configuration alone
taxes the 1 GB memory limit I am able to give to my virtual machine (a known
issue with running Java under Windows).

 

Each dictionary entry would like to contain zero or more alternative words
and phrases (think thesaurus).  This increases the number of
instream-foreign-object tags containing SVG to the order of 30,000 or
40,000.  Even breaking this up into individual chapters I have a very hard
time rendering these documents.  They consume vast amounts of memory and
bring my system to a halt even under Linux.

 

So I was wondering if anyone had any suggestions on how I could optimize my
FO document and use of SVG.  Since all of the "see also" words and phrases
can be found elsewhere in my document is there a way to generate a PDF-layer
reference to content located at another part of a document? I am thinking of
something analogous to symbols in Flash; a way of having FOP render the SVG
word or phrase once and have it instruct the PDF to reuse that content in
many locations.

 

Converting these to images is less than ideal since ancient Egyptian
hieroglyphs contain a lot of fine detail that would be lost or blurred if
rasterized.  One of my main reasons for choosing this approach was that I
knew FOP would preserve the hieroglyphs in vector format.

 

For what it is worth without the SVG content FOP runs fantastically.  So
this seems to be due to the sheer volume of vector data added to the
rendering process by the inclusion of all of these SVG elements.

 

Since this is my first time ever mailing and FOP related mailing list when
they take this opportunity to say that I have been using FOP for over seven
years now and have enjoyed every minute of it.  This is a fantastic product
and I think the improvements in this new branch in performance and API are
nothing less than spectacular!  Thank you to everyone contributing to this
project.

 

Ted young

Re: Making a Large Dictionary

Posted by "J.Pietschmann" <j3...@yahoo.de>.

Andreas Delmelle wrote:
> I'm not 100% certain, but I think using i-f-o currently means that
> inserting the same SVG 1000 times would lead to 1000 separate SVG DOMs
> as foreign nodes in the FO tree (?)

I think so. The advantage is that these should be reclaimed once the
page sequence ends, while the SVG used by external-graphics persists
in the image cache.
Therefore: If it is possible to break up the document in many small
page sequences (without cross references), and processing time is
not much of a concern, then using inline-foreign-object should be
an advantage, otherwise external-graphics is probably better.

In any case, I'm always amazed by the applications people find
for XSLFO in general and FOP in particular.

J.Pietschmann

---------------------------------------------------------------------
To unsubscribe, e-mail: fop-users-unsubscribe@xmlgraphics.apache.org
For additional commands, e-mail: fop-users-help@xmlgraphics.apache.org

Re: Making a Large Dictionary

Posted by Vincent Hennebert <vi...@anyware-tech.com>.

Hi Ted,

Thanks for your warm feedback :-) This is great to know that FOP is
successfully used in big real-life projects. You might want to add an
entry to the SuccessStories wiki page:
http://wiki.apache.org/xmlgraphics-fop/SuccessStories
Or simply add a link to your own webspace where you would document your
process.

Thanks again,
Vincent


Ted Young wrote:
> Thank you to all who have replied.  The answer by Andreas Delmelle was right
> on!  I switched over to using external-graphic and not only can I render
> within my 1 GB memory limit, my PDF files seem to display faster.
>
> Today I was thinking that if this worked, it would be neat to be able to use
> the same technique to embed PDF data; effectively creating content which can
> be efficiently repeated throughout the document.  That is when I came home
> to find Peter Coppens's email regarding his extension to do just that:
>
> http://www.jeremias-maerki.ch/development/fop/
>
> You seem to be on the right track my friend.
>
> And finally to Mr. Pietschmann's statement:
>
>> In any case, I'm always amazed by the applications people find
>> for XSLFO in general and FOP in particular.
>
> Oh yeah!  This is a fantastic tool and anyone with a little programming
> experience and a little imagination can really go to town.
>
> I once had to generate API documentation for a large system of code (many
> languages) that was documented by several developers (it took 5.5 man-years
> to document).  I took their documentation, plus some software analysis
> tools, graphviz, and FOP and ended up with over 5,000 pages of documentation
> (which the customer required to be printed).
>
> I love the fact that I can take a Java object modal, and use JAX-B to
> serialize it to XML, which is then forwarded to an XSLT transformer which
> converts it into XSLFO, which is rendered to PDF by FOP.  And all of this is
> done entirely in SAX!!!  Using the Collection and Iterator interfaces, one
> can load large datasets into JAX-B lazily.  So, at no point in time is the
> entire dataset (in any format) ever in memory.  And all in only in a dozen
> lines of code!!!
>
> Adding in the Ancient Egyptian is a synch with the JSesh library that
> converts MDC (a standard transcription system for AE hieroglyphs) to SVG.
>
> The only difficult part was taking the source of the dictionary which laid
> out the hieroglyphs using absolute coordinates and converting that to MDC
> (which doesn't use coordinates at all, but rather logical associations
> between glyphs; such as A is on top of B, etc.).  This required the
> development of a layout analysis engine, which will need some more tweaking
> apparently.
>
> Anyway, when I put the dictionary online I hope to document the whole
> process (including code) when I get done with this (this is all
> open-source/free-to-abuse stuff).  I think there are some really good FOP
> tutorials in this.
>
> Ted

-- 
Vincent Hennebert                            Anyware Technologies
http://people.apache.org/~vhennebert         http://www.anyware-tech.com
Apache FOP Committer                         FOP Development/Consulting

---------------------------------------------------------------------
To unsubscribe, e-mail: fop-users-unsubscribe@xmlgraphics.apache.org
For additional commands, e-mail: fop-users-help@xmlgraphics.apache.org

RE: Making a Large Dictionary

Posted by Ted Young <ty...@tx.rr.com>.

Thank you to all who have replied.  The answer by Andreas Delmelle was right
on!  I switched over to using external-graphic and not only can I render
within my 1 GB memory limit, my PDF files seem to display faster.

Today I was thinking that if this worked, it would be neat to be able to use
the same technique to embed PDF data; effectively creating content which can
be efficiently repeated throughout the document.  That is when I came home
to find Peter Coppens's email regarding his extension to do just that:

http://www.jeremias-maerki.ch/development/fop/

You seem to be on the right track my friend.

And finally to Mr. Pietschmann's statement:

> In any case, I'm always amazed by the applications people find
> for XSLFO in general and FOP in particular.

Oh yeah!  This is a fantastic tool and anyone with a little programming
experience and a little imagination can really go to town.

I once had to generate API documentation for a large system of code (many
languages) that was documented by several developers (it took 5.5 man-years
to document).  I took their documentation, plus some software analysis
tools, graphviz, and FOP and ended up with over 5,000 pages of documentation
(which the customer required to be printed). 

I love the fact that I can take a Java object modal, and use JAX-B to
serialize it to XML, which is then forwarded to an XSLT transformer which
converts it into XSLFO, which is rendered to PDF by FOP.  And all of this is
done entirely in SAX!!!  Using the Collection and Iterator interfaces, one
can load large datasets into JAX-B lazily.  So, at no point in time is the
entire dataset (in any format) ever in memory.  And all in only in a dozen
lines of code!!!

Adding in the Ancient Egyptian is a synch with the JSesh library that
converts MDC (a standard transcription system for AE hieroglyphs) to SVG.

The only difficult part was taking the source of the dictionary which laid
out the hieroglyphs using absolute coordinates and converting that to MDC
(which doesn't use coordinates at all, but rather logical associations
between glyphs; such as A is on top of B, etc.).  This required the
development of a layout analysis engine, which will need some more tweaking
apparently.

Anyway, when I put the dictionary online I hope to document the whole
process (including code) when I get done with this (this is all
open-source/free-to-abuse stuff).  I think there are some really good FOP
tutorials in this.

Ted


---------------------------------------------------------------------
To unsubscribe, e-mail: fop-users-unsubscribe@xmlgraphics.apache.org
For additional commands, e-mail: fop-users-help@xmlgraphics.apache.org

Re: Making a Large Dictionary

Posted by Andreas Delmelle <an...@telenet.be>.

On May 18, 2008, at 06:10, Ted Young wrote:

Hi

> I am using FOP version 0 .94 under Java 1.6 update 6.
>
> I am using FOP to create a PDF dictionary from a source XML file.   
> There are only 12,000 entries in this dictionary.  The tricky bit  
> is that the words being defined are in Ancient Egyptian  
> Hieroglyphs.  Each word (or phrase) is stored in SVG format.  So  
> ultimately my FO document contains 12,000 instream-foreign-object  
> tags each containing SVG.  This configuration alone taxes the 1 GB  
> memory limit I am able to give to my virtual machine (a known issue  
> with running Java under Windows).
>
> Each dictionary entry would like to contain zero or more  
> alternative words and phrases (think thesaurus).  This increases  
> the number of instream-foreign-object tags containing SVG to the  
> order of 30,000 or 40,000.  Even breaking this up into individual  
> chapters I have a very hard time rendering these documents.  They  
> consume vast amounts of memory and bring my system to a halt even  
> under Linux.

Breaking up the document into chapters will not save /very/ much if  
there are a lot of cross-references, especially when they point  
forward to later chapters...

> So I was wondering if anyone had any suggestions on how I could  
> optimize my FO document and use of SVG.  Since all of the "see  
> also" words and phrases can be found elsewhere in my document is  
> there a way to generate a PDF-layer reference to content located at  
> another part of a document? I am thinking of something analogous to  
> symbols in Flash; a way of having FOP render the SVG word or phrase  
> once and have it instruct the PDF to reuse that content in many  
> locations.

If you would insert the SVG as fo:external-graphic and the source URI  
for the image is the same, it should normally be re-used at the other  
places it is referenced. Since fo:instream-foreign-objects do not  
have a URI, I don't think this allows for any internal caching/re- 
using of the images...
I'm not 100% certain, but I think using i-f-o currently means that  
inserting the same SVG 1000 times would lead to 1000 separate SVG  
DOMs as foreign nodes in the FO tree (?)

Trying external-graphic would, of course, require an extraction of  
all the SVG that is currently put in fo:instream-foreign-object to  
separate files.
If you don't already have separate files, then as a tryout, this  
could probably be done by applying an XSL transform to your current  
FO document, replacing fo:instream-foreign-object with fo:external- 
graphic, and using extensions to produce one output document per  
encountered SVG.

Taking care of duplicates may turn out to be a challenge with this  
approach, though...

If you already retrieve the SVG from separate files, or from a  
database, you could adapt the stylesheet to produce fo:external- 
graphic instead, and pass it the URI that is used as a source.

HTH!

Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: fop-users-unsubscribe@xmlgraphics.apache.org
For additional commands, e-mail: fop-users-help@xmlgraphics.apache.org

Re: Making a Large Dictionary

Posted by Peter Coppens <pc...@gmail.com>.

Ted,

No idea whether this is practical for your case nor whether it would  
actually make a difference, but you could try a two step approach

First render the 12000 entries to pdf (a 'page' for each entry), then  
use a second fo stylesheet that refers to the pdf pages as external  
graphics. See http://www.jeremias-maerki.ch/development/fop/ on how to  
use pdf (pages) as external graphics. If you are lucky fop will use  
pdf xobjects for each of the pdf pages and include them only once.

Perhaps it helps,

Peter


On 18 May 2008, at 06:10, Ted Young wrote:

> Greetings to the list,
>
> I am using FOP version 0 .94 under Java 1.6 update 6.
>
> I am using FOP to create a PDF dictionary from a source XML file.   
> There are only 12,000 entries in this dictionary.  The tricky bit is  
> that the words being defined are in Ancient Egyptian Hieroglyphs.   
> Each word (or phrase) is stored in SVG format.  So ultimately my FO  
> document contains 12,000 instream-foreign-object tags each  
> containing SVG.  This configuration alone taxes the 1 GB memory  
> limit I am able to give to my virtual machine (a known issue with  
> running Java under Windows).
>
> Each dictionary entry would like to contain zero or more alternative  
> words and phrases (think thesaurus).  This increases the number of  
> instream-foreign-object tags containing SVG to the order of 30,000  
> or 40,000.  Even breaking this up into individual chapters I have a  
> very hard time rendering these documents.  They consume vast amounts  
> of memory and bring my system to a halt even under Linux.
>
> So I was wondering if anyone had any suggestions on how I could  
> optimize my FO document and use of SVG.  Since all of the "see also"  
> words and phrases can be found elsewhere in my document is there a  
> way to generate a PDF-layer reference to content located at another  
> part of a document? I am thinking of something analogous to symbols  
> in Flash; a way of having FOP render the SVG word or phrase once and  
> have it instruct the PDF to reuse that content in many locations.
>
> Converting these to images is less than ideal since ancient Egyptian  
> hieroglyphs contain a lot of fine detail that would be lost or  
> blurred if rasterized.  One of my main reasons for choosing this  
> approach was that I knew FOP would preserve the hieroglyphs in  
> vector format.
>
> For what it is worth without the SVG content FOP runs  
> fantastically.  So this seems to be due to the sheer volume of  
> vector data added to the rendering process by the inclusion of all  
> of these SVG elements.
>
> Since this is my first time ever mailing and FOP related mailing  
> list when they take this opportunity to say that I have been using  
> FOP for over seven years now and have enjoyed every minute of it.   
> This is a fantastic product and I think the improvements in this new  
> branch in performance and API are nothing less than spectacular!   
> Thank you to everyone contributing to this project.
>
> Ted young