You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-users@xmlgraphics.apache.org by Stephan Thesing <th...@gmx.de> on 2011/08/03 10:23:48 UTC

FOP and large documents (again)

Hello,

as we all know, FOP has performance issues when handling documents
with a lot of pages in a page-sequence and/or when pages contain
"forward" references to pages or bookmarks.

Unfortunately, producing shorter page-sequences is not always a viable
option, nor is leaving out forward references.
In my setting, e.g., a TOC is required in the document, which adds forward
references to bookmarks and pages for all page-sequences; also, formal
requirements mandate that page numbers are given in the form "page x of y", which adds a forward reference on every page for the last page of the document.

Bluntly, I think FOP should be able to handle this.

Looking at the code (as far as I understand it), for each page-sequence
all KnuthElements are computed first by the layout managers.
This is split only for forced page breaks.
Then on the whole sequence, possible page break positions are searched for.

Only after this are the actual output areas computed and pages produced.

Clearly, this doesn't scale for large page-sequences...

Is there a reason why this approach was chosen, instead of "lazily" (or on-demand)computing KnuthElements, putting them on the page and as soon as it is filled, pass it to the renderer?

Naturally, this does not solve the problem of forward references, where one either needs a Renderer that can handle resolution after the actual page has already been rendered, or one needs two layout passes (as TeX does), where the first pass collects references and ids, which the second pass can then use.

I am sure I overlooked something, so if there is additional information, I would be grateful for a pointer...

Please let me know your view on this:-)

Best regards
Stephan
--
Dr.-Ing. Stephan Thesing
Elektrastr. 50
81925 München
GERMANY

Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

---------------------------------------------------------------------
To unsubscribe, e-mail: fop-users-unsubscribe@xmlgraphics.apache.org
For additional commands, e-mail: fop-users-help@xmlgraphics.apache.org

Re: FOP and large documents (again)

Posted by Simon Pepping <sp...@leverkruid.eu>.

On Tue, Aug 16, 2011 at 04:23:47AM +0200, Stephan Thesing wrote:
> I would rather change the layout managers to produce KnuthElements in the way TeX does and leave page alignment to the page collection stage.
> I don't see another manageable way to do this but to add to the layout managers a new interface for this demand driven approach.
> Essentially, this would result in a parallel implementation of generating content to the getNextKnuthElements() and addAreas() interface.

By demand driven do you mean that the page collection process requests
Knuth elements from the layout managers? That seems the way to go
indeed.

You may come a long way, but it will be difficult to reproduce all
details and pass all unit tests. OTOH, efforts that aimed at full
compatibility have failed so far, so an effort which focuses on the
target functionality first, and full compatibility later, may be the
best strategy.

> PS: Is there any more in-depth documentation about the way the
> layout managers work apart from the Wiki Pages?

Years ago I had documentation which was called DnI (Design and
Implementation). I deleted it because it had become obsolete. It has
quite a few details on the layout engine, and maybe some of it is
still useful. You can find it in old revisions, e.g.
http://svn.apache.org/viewvc/xmlgraphics/fop/trunk/src/documentation/content/xdocs/DnI/?pathrev=947983.
It is based on docbook 4.2. You may have some trouble building it,
because the Makefile still uses fop 0.20.5.

Other documentation than that we do not have.

Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: fop-users-unsubscribe@xmlgraphics.apache.org
For additional commands, e-mail: fop-users-help@xmlgraphics.apache.org

Re: FOP and large documents (again)

Posted by Vincent Hennebert <vh...@gmail.com>.

Hi Stephan,

[Moving to fop-dev as we are diving in the gory details.]
[... And sorry for the delay.]

Some time ago I worked on a prototype implementation of a new layout
engine that would create layout elements on the fly. The main goal was
to address the changing IPD problem, which as you noticed is not
properly handled by the current code. The idea was also to be able to
easily switch between a best-fit and a total-fit approach for page
breaking, by simply tweaking a few parameters.

In that exercise, tables proved to be the most challenging part. The
current table building algorithm relies on how much of the table would
end up after a page break, and sets the lengths of Knuth elements
accordingly. To do that it needs to assume that page have a certain,
fixed width. This algorithm will probably have to be re-designed in
order to implement any kind of best-fit approach.

Tables are in my view /the/ key part of the layout engine and, if you
can get tables working then all the rest would follow.

I wrote a document illustrating the challenges I faced, that you might
find interesting:
http://people.apache.org/~vhennebert/prototype/
The code for the prototype is in the Temp_Interleaved_Page_Line_Breaking
branch:
http://svn.apache.org/viewvc/xmlgraphics/fop/branches/Temp_Interleaved_Page_Line_Breaking/prototype/
(You can ignore the early Ruby version.)

It’s horribly flawed and outdated but might give you some ideas for your
implementation. I have an improved version on my hard drive but haven’t
had a chance to clean it up and publish it yet.

Good luck,
Vincent


On 16/08/11 03:23, Stephan Thesing wrote:
> Hello,
> 
> indeed, as the code currently is, it will be hard to make this a first
> fit layout algorithm for pages.
> 
> As the generation of KnuthElements (or ListElements) by the LayoutManagers
> seems to be quite interwoven with tasks like page alignment and other stuff, I would rather not want to adapt this to a "demand driven" generation of Elements as needed by a first fit approach.
> Also, the possible IPD changes between pages poses a problem (it is also a problem for the current code, which is "not nice" for that case).
> 
> I would rather change the layout managers to produce KnuthElements in the way TeX does and leave page alignment to the page collection stage.
> I don't see another manageable way to do this but to add to the layout managers a new interface for this demand driven approach.
> Essentially, this would result in a parallel implementation of generating content to the getNextKnuthElements() and addAreas() interface.
> 
> I can spend some effort and since I clearly have a need for a scalable first fit page layout, I will give this a try.
> 
> Best regards
>    Stephan
> 
> PS: Is there any more in-depth documentation about the way the
> layout managers work apart from the Wiki Pages?
> 
> 
> -------- Original-Nachricht --------
>> Datum: Wed, 3 Aug 2011 10:55:35 +0200
>> Von: Simon Pepping <sp...@leverkruid.eu>
>> An: fop-users@xmlgraphics.apache.org
>> Betreff: Re: FOP and large documents (again)
> 
>> On Wed, Aug 03, 2011 at 10:23:48AM +0200, Stephan Thesing wrote:
>>> Looking at the code (as far as I understand it), for each page-sequence
>>> all KnuthElements are computed first by the layout managers.
>>> This is split only for forced page breaks.
>>> Then on the whole sequence, possible page break positions are searched
>> for.
>>>
>>> Only after this are the actual output areas computed and pages produced.
>>>
>>> Clearly, this doesn't scale for large page-sequences...
>>>
>>> Is there a reason why this approach was chosen, instead of "lazily" (or
>> on-demand)computing KnuthElements, putting them on the page and as soon as
>> it is filled, pass it to the renderer?
>>
>> Both line and page breaking use the Knuth algorithm of a total fit.
>> The algorithm requires the complete content before it can be applied.
>> Clearly TeX does not do this; for page breaking it uses a best fit
>> approach.
>>
>> For FOP it would be better if it could apply either strategy, at the
>> demand of the user. But FOP is coded such that it first collects all
>> content, in the process doing all line breaking in paragraphs, before
>> it starts its page breaking algorithm. Therefore a best fit page
>> breaking algorithm does not solve the memory problem. Changing this so
>> that page breaking (best or total fit at the user's choice) is
>> considered while collecting content has proven too hard (or too
>> time-consuming) until now. See e.g.
>> http://svn.apache.org/viewvc/xmlgraphics/fop/branches/Temp_Interleaved_Page_Line_Breaking/.
>>
>> There is a best fit page breaking algorithm, which is mainly used for
>> cases with varying page widths. But it is a hack in the sense that it
>> throws away all collected content beyond the current page, and
>> restarts the process.
>>
>> So, help needed.
>>
>> Simon
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: fop-users-unsubscribe@xmlgraphics.apache.org
>> For additional commands, e-mail: fop-users-help@xmlgraphics.apache.org
>>
>

Re: FOP and large documents (again)

Posted by Stephan Thesing <th...@gmx.de>.

Hello,

indeed, as the code currently is, it will be hard to make this a first
fit layout algorithm for pages.

As the generation of KnuthElements (or ListElements) by the LayoutManagers
seems to be quite interwoven with tasks like page alignment and other stuff, I would rather not want to adapt this to a "demand driven" generation of Elements as needed by a first fit approach.
Also, the possible IPD changes between pages poses a problem (it is also a problem for the current code, which is "not nice" for that case).

I would rather change the layout managers to produce KnuthElements in the way TeX does and leave page alignment to the page collection stage.
I don't see another manageable way to do this but to add to the layout managers a new interface for this demand driven approach.
Essentially, this would result in a parallel implementation of generating content to the getNextKnuthElements() and addAreas() interface.

I can spend some effort and since I clearly have a need for a scalable first fit page layout, I will give this a try.

Best regards
   Stephan

PS: Is there any more in-depth documentation about the way the
layout managers work apart from the Wiki Pages?

-------- Original-Nachricht --------
> Datum: Wed, 3 Aug 2011 10:55:35 +0200
> Von: Simon Pepping <sp...@leverkruid.eu>
> An: fop-users@xmlgraphics.apache.org
> Betreff: Re: FOP and large documents (again)

> On Wed, Aug 03, 2011 at 10:23:48AM +0200, Stephan Thesing wrote:
> > Looking at the code (as far as I understand it), for each page-sequence
> > all KnuthElements are computed first by the layout managers.
> > This is split only for forced page breaks.
> > Then on the whole sequence, possible page break positions are searched
> for.
> > 
> > Only after this are the actual output areas computed and pages produced.
> > 
> > Clearly, this doesn't scale for large page-sequences...
> > 
> > Is there a reason why this approach was chosen, instead of "lazily" (or
> on-demand)computing KnuthElements, putting them on the page and as soon as
> it is filled, pass it to the renderer?
> 
> Both line and page breaking use the Knuth algorithm of a total fit.
> The algorithm requires the complete content before it can be applied.
> Clearly TeX does not do this; for page breaking it uses a best fit
> approach.
> 
> For FOP it would be better if it could apply either strategy, at the
> demand of the user. But FOP is coded such that it first collects all
> content, in the process doing all line breaking in paragraphs, before
> it starts its page breaking algorithm. Therefore a best fit page
> breaking algorithm does not solve the memory problem. Changing this so
> that page breaking (best or total fit at the user's choice) is
> considered while collecting content has proven too hard (or too
> time-consuming) until now. See e.g.
> http://svn.apache.org/viewvc/xmlgraphics/fop/branches/Temp_Interleaved_Page_Line_Breaking/.
> 
> There is a best fit page breaking algorithm, which is mainly used for
> cases with varying page widths. But it is a hack in the sense that it
> throws away all collected content beyond the current page, and
> restarts the process.
> 
> So, help needed.
> 
> Simon
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: fop-users-unsubscribe@xmlgraphics.apache.org
> For additional commands, e-mail: fop-users-help@xmlgraphics.apache.org
> 

-- 
Dr.-Ing. Stephan Thesing
Elektrastr. 50
81925 München
GERMANY

Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

---------------------------------------------------------------------
To unsubscribe, e-mail: fop-users-unsubscribe@xmlgraphics.apache.org
For additional commands, e-mail: fop-users-help@xmlgraphics.apache.org

Re: FOP and large documents (again)

Posted by Stephan Thesing <th...@gmx.de>.

Hi,

since I will need a scalable first-fit algorithm for page layout I will
look into this (or have a student look at it).
I can live with the suboptimal page layout the first-fit will produce.


Best regards
   Stephan

-------- Original-Nachricht --------
> Datum: Wed, 3 Aug 2011 10:55:35 +0200
> Von: Simon Pepping <sp...@leverkruid.eu>
> An: fop-users@xmlgraphics.apache.org
> Betreff: Re: FOP and large documents (again)

> On Wed, Aug 03, 2011 at 10:23:48AM +0200, Stephan Thesing wrote:
> > Looking at the code (as far as I understand it), for each page-sequence
> > all KnuthElements are computed first by the layout managers.
> > This is split only for forced page breaks.
> > Then on the whole sequence, possible page break positions are searched
> for.
> > 
> > Only after this are the actual output areas computed and pages produced.
> > 
> > Clearly, this doesn't scale for large page-sequences...
> > 
> > Is there a reason why this approach was chosen, instead of "lazily" (or
> on-demand)computing KnuthElements, putting them on the page and as soon as
> it is filled, pass it to the renderer?
> 
> Both line and page breaking use the Knuth algorithm of a total fit.
> The algorithm requires the complete content before it can be applied.
> Clearly TeX does not do this; for page breaking it uses a best fit
> approach.
> 
> For FOP it would be better if it could apply either strategy, at the
> demand of the user. But FOP is coded such that it first collects all
> content, in the process doing all line breaking in paragraphs, before
> it starts its page breaking algorithm. Therefore a best fit page
> breaking algorithm does not solve the memory problem. Changing this so
> that page breaking (best or total fit at the user's choice) is
> considered while collecting content has proven too hard (or too
> time-consuming) until now. See e.g.
> http://svn.apache.org/viewvc/xmlgraphics/fop/branches/Temp_Interleaved_Page_Line_Breaking/.
> 
> There is a best fit page breaking algorithm, which is mainly used for
> cases with varying page widths. But it is a hack in the sense that it
> throws away all collected content beyond the current page, and
> restarts the process.
> 
> So, help needed.
> 
> Simon
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: fop-users-unsubscribe@xmlgraphics.apache.org
> For additional commands, e-mail: fop-users-help@xmlgraphics.apache.org
> 

-- 
Dr.-Ing. Stephan Thesing
Elektrastr. 50
81925 München
GERMANY

NEU: FreePhone - 0ct/min Handyspartarif mit Geld-zurück-Garantie!		
Jetzt informieren: http://www.gmx.net/de/go/freephone

---------------------------------------------------------------------
To unsubscribe, e-mail: fop-users-unsubscribe@xmlgraphics.apache.org
For additional commands, e-mail: fop-users-help@xmlgraphics.apache.org

Re: FOP and large documents (again)

Posted by Simon Pepping <sp...@leverkruid.eu>.

On Wed, Aug 03, 2011 at 10:23:48AM +0200, Stephan Thesing wrote:
> Looking at the code (as far as I understand it), for each page-sequence
> all KnuthElements are computed first by the layout managers.
> This is split only for forced page breaks.
> Then on the whole sequence, possible page break positions are searched for.
> 
> Only after this are the actual output areas computed and pages produced.
> 
> Clearly, this doesn't scale for large page-sequences...
> 
> Is there a reason why this approach was chosen, instead of "lazily" (or on-demand)computing KnuthElements, putting them on the page and as soon as it is filled, pass it to the renderer?

Both line and page breaking use the Knuth algorithm of a total fit.
The algorithm requires the complete content before it can be applied.
Clearly TeX does not do this; for page breaking it uses a best fit
approach.

For FOP it would be better if it could apply either strategy, at the
demand of the user. But FOP is coded such that it first collects all
content, in the process doing all line breaking in paragraphs, before
it starts its page breaking algorithm. Therefore a best fit page
breaking algorithm does not solve the memory problem. Changing this so
that page breaking (best or total fit at the user's choice) is
considered while collecting content has proven too hard (or too
time-consuming) until now. See e.g.
http://svn.apache.org/viewvc/xmlgraphics/fop/branches/Temp_Interleaved_Page_Line_Breaking/.

There is a best fit page breaking algorithm, which is mainly used for
cases with varying page widths. But it is a hack in the sense that it
throws away all collected content beyond the current page, and
restarts the process.

So, help needed.

Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: fop-users-unsubscribe@xmlgraphics.apache.org
For additional commands, e-mail: fop-users-help@xmlgraphics.apache.org