You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-dev@xmlgraphics.apache.org by Dario Laera <la...@cs.unibo.it> on 2008/10/27 18:36:42 UTC

Choosing a better threshold in line breaking

Hi all,

I found the reason why breaking paragraphs into short lines is really  
slow and memory hungry: the threshold of the adjustment ratio, set to  
20 at the last try, is too high and makes EVERY legal breakpoint a  
feasible breakpoint too. A check should be performed to avoid such  
situation and to choose then a better threshold.

For example, if I have a 2 columns in A4 page layout the line width is  
~140000. The glue stretchability, as far as I can see in TextLM class,  
is often set to "3 * LineLayoutManager.DEFAULT_SPACE_WIDTH" that is  
equal to ~10000. When you compute the adj ratio for a line that have  
just one glue you get r = 140000/10000 = 14, that is lower than the  
threshold = 20.0, so an active node is added.

A better threshold can be chosen as follow: let idealDifference be a  
reasonable size we choose as good threshold. We can assume "3 *  
LineLayoutManager.DEFAULT_SPACE_WIDTH" as default stretchability a  
compute a better threshold in that way:

	idealRatio = idealDifference / (3 *  
LineLayoutManager.DEFAULT_SPACE_WIDTH);

and bound that value:

	1.0 <= idealRatio <= 20.

How to choose idealDifference? A naive solution, but probably not so  
bad, can be:

	idealDifference = iLineWidth / 2;

A more sophisticated, maybe too much sophisticated, solution can  
choose it by looking at the average box length: we can see how many  
average box can fit a line (wordsPerLine) and execute:

	avgWord = avgBox + LineLayoutManager.DEFAULT_SPACE_WIDTH;
	idealDifference = iLineWidth - (avgWord * (wordsPerLine / 2));


The test I've run have showed sensible performance improvement.  
Comments are welcome!


Dario

Re: Choosing a better threshold in line breaking

Posted by Vincent Hennebert <vh...@gmail.com>.

Dario Laera wrote:
> 
> Il giorno 28/ott/08, alle ore 18:00, Andreas Delmelle ha scritto:
> 
>> Have you investigated whether the column-balancing approach is of
>> influence here? I mean: are there significant differences between
>> using one narrower column (the same width you would get when using two
>> columns, only it will roughly yield double the amount of pages).
> 
> A quick test haven't raised any difference except the fact that 2
> columns used slightly less memory, but I think it's not related to
> breaking as the number of active nodes is the same.

Column balancing chimes in only if there’s a block with span="all" in
the FO file. Otherwise, having two columns is roughly the same as having
a single column on twice narrower pages and I wouldn’t expect a big
difference in memory consumption, which is confirmed by your test.

Vincent

Re: Choosing a better threshold in line breaking

Posted by Andreas Delmelle <an...@telenet.be>.

On Oct 29, 2008, at 10:51, Dario Laera wrote:

>
> Il giorno 28/ott/08, alle ore 18:00, Andreas Delmelle ha scritto:
>
>> Have you investigated whether the column-balancing approach is of  
>> influence here? I mean: are there significant differences between  
>> using one narrower column (the same width you would get when using  
>> two columns, only it will roughly yield double the amount of pages).
>
> A quick test haven't raised any difference except the fact that 2  
> columns used slightly less memory, but I think it's not related to  
> breaking as the number of active nodes is the same.

OK. The fact that less memory is needed for 2 columns, can probably  
be explained by the number of pages being roughly doubled.

As Vincent pointed out, the main cause is then the fact that (3 *  
space-width) represents a bigger portion of short lines. Basing that  
width on the actual line-width indeed sounds like the right idea to  
begin with.

If no one beats me to it, I'll have a look at changing that one of  
the coming days.

Cheers

Andreas

Re: Choosing a better threshold in line breaking

Posted by Dario Laera <la...@cs.unibo.it>.

Il giorno 28/ott/08, alle ore 18:00, Andreas Delmelle ha scritto:

> Have you investigated whether the column-balancing approach is of  
> influence here? I mean: are there significant differences between  
> using one narrower column (the same width you would get when using  
> two columns, only it will roughly yield double the amount of pages).

A quick test haven't raised any difference except the fact that 2  
columns used slightly less memory, but I think it's not related to  
breaking as the number of active nodes is the same.

Re: Choosing a better threshold in line breaking

Posted by Andreas Delmelle <an...@telenet.be>.

On Oct 27, 2008, at 23:53, Dario Laera wrote:

> Il giorno 27/ott/08, alle ore 18:53, Andreas Delmelle ha scritto:
>
>> Do you mean that this last try is /always/ performed (even when we  
>> already have a set of feasible breaks)?
>
> It's not always performed (so it's formally correct), but in my  
> tests it's rarely avoided, more precisely just once, with the file  
> "my_franklin_rep-jus.fo" that is composed of many paragraph in 1  
> column with justified text. What I think (obviously, I may be  
> wrong, as it has been proved in other mails ;) is that another  
> intermediate try, with a judicious threshold, can be performed,  
> leading to the same final result but with much better performance,  
> if this intermediate try doesn't fail like the previous.

Have you investigated whether the column-balancing approach is of  
influence here? I mean: are there significant differences between  
using one narrower column (the same width you would get when using  
two columns, only it will roughly yield double the amount of pages).


Cheers

Andreas

Re: Choosing a better threshold in line breaking

Posted by Dario Laera <la...@cs.unibo.it>.

Il giorno 27/ott/08, alle ore 18:53, Andreas Delmelle ha scritto:

> Do you mean that this last try is /always/ performed (even when we  
> already have a set of feasible breaks)?

It's not always performed (so it's formally correct), but in my tests  
it's rarely avoided, more precisely just once, with the file  
"my_franklin_rep-jus.fo" that is composed of many paragraph in 1  
column with justified text. What I think (obviously, I may be wrong,  
as it has been proved in other mails ;) is that another intermediate  
try, with a judicious threshold, can be performed, leading to the same  
final result but with much better performance, if this intermediate  
try doesn't fail like the previous.
Anyway I always run my tests with hyphenation enabled, I should try  
disabling to see if the second try is run with threshold=5 and if this  
doesn't fails.

Dario

Re: Choosing a better threshold in line breaking

Posted by Andreas Delmelle <an...@telenet.be>.

On Oct 27, 2008, at 18:36, Dario Laera wrote:

> Hi all,
>
> I found the reason why breaking paragraphs into short lines is  
> really slow and memory hungry: the threshold of the adjustment  
> ratio, set to 20 at the last try, is too high and makes EVERY legal  
> breakpoint a feasible breakpoint too. A check should be performed  
> to avoid such situation and to choose then a better

Well, AFAIU, this is intended, BUT...
This 'last try' should /only/ occur in case the earlier tries did not  
yield any breaks.
IIC, it is really meant as a last resort, and should indeed lead to  
*any* possible break being considered (anything goes at that point).

Do you mean that this last try is /always/ performed (even when we  
already have a set of feasible breaks)?

Cheers

Andreas

Re: Choosing a better threshold in line breaking

Posted by Dario Laera <la...@cs.unibo.it>.

Il giorno 28/ott/08, alle ore 13:53, Vincent Hennebert ha scritto:

> If you could run statistics on more real-life documents (how often is
> the first run without hyphenation sufficient, the third run required,
> justified and left-aligned text, single / two-column on A4 paper,  
> etc),
> that would be fantastic.

I've run the examples in the repository with some debug info, you can  
find the refined output in the attachment. The interesting output  
lines are those with high "lines" value (to see when long paragraphs  
becomes difficult to break) and those following two consecutive "RETRY".
hyphen.fo was the most interesting case: it clearly states that even  
for medium paragraph (10 lines) th=1.0 plus hyphenation is not enough.  
This is a bit language dependent: italian paragraphs don't need to  
increase the threshold, I think this is due  to the fact that italian  
lang allows for more hyphenation points than other langs like english,  
but I think we shouldn't care about this issue. I tried then to format  
hyphen.fo using at the second try th=5.0, and it was always enough  
regardless of the alignment. Finally, I've formatted the same fo with  
hyphenation disabled and the result was mixed: sometimes the third  
attempt was necessary, some others not.
The franklin*.fo files contains paragraphs longer than hyphen.fo, but  
with hyphenation disabled, so those paragraphs gets broken at the  
second attempt even if they are start-aligned.
In inhprop.fo a center-aligned non-hyphenated paragraph 4 lines long  
fall down in the forced mode, changing the alignment would make the  
third attempt unnecessary.

The results of these tests can be summarized as follows:
  * non-hyphenated paragraphs are handled efficiently for both justify  
and start alignment as the second attempt is usually sufficient  
(steps: 1.0, 5.0, 20.0);
  * hyphenated paragraphs should benefit from a th=5.0 attempt that  
isn't performed (steps: 1.0, 1.0 + hyph, 20.0 + hyph);
  * center-aligned mid/long sized paragraphs are likely to need  
threshold higher than 5.0.

If you have a typical user xsl-fo file which behavior is worth to be  
examined send it to me, please.

Dario

Re: Choosing a better threshold in line breaking

Posted by Dario Laera <la...@cs.unibo.it>.

Il giorno 28/ott/08, alle ore 13:53, Vincent Hennebert ha scritto:
>> A more sophisticated, maybe too much sophisticated, solution can  
>> choose
>> it by looking at the average box length: we can see how many  
>> average box
>> can fit a line (wordsPerLine) and execute:
>>
>>    avgWord = avgBox + LineLayoutManager.DEFAULT_SPACE_WIDTH;
>>    idealDifference = iLineWidth - (avgWord * (wordsPerLine / 2));
>
> I’m not sure I’m following you here. What’s the value of wordsPerLine?
> Is is set manually to a value that’s considered to be a reasonable  
> one?
> Because if it’s computed automatically, the formula can be simplified:
>    wordsPerLine = lineWidth / avgWord, so
>    idealDifference = lineWidth - lineWidth / 2
>                    = lineWidth / 2

I compute wordsPerLine as you wrote but the simplified version is  
slightly different because using integers and not floats, so  
wordsPerLine * avgWord may be different from lineWidth. But I realize  
this precision is unnecessary and probably useless.
	

> Anyway, the adjustment ratio is already a notion that is independent  
> of
> the line width; that’s precisely the purpose of a ratio. In the case  
> of
> left-justified mode, the only available stretchability is due to the
> space at the end of the line; the question is to determine up to how
> much we accept that space to be...
> Ok, by writing that I think I know what you mean now :-) But the issue
> should probably be considered the other way around: the problem is not
> so much the adjustment ratio as the amount of space allowed at the end
> of the line. In the case of narrow columns, that “3 times the width of
> a space character” is too big WRT the line width. Instead of having
> a fixed value, it should be changed into a small proportion of the  
> line
> width.
> At the origin that 3 * space-width value was probably chosen for
> “normal” line widths, that is lines containing an optimal amount of
> words. I’ve read somewhere that the optimal number of letters per line
> is 60. Taking the Times font, the average width of lowercase letters  
> is
> 459, so the optimal line width roughly is 459*60 = 27540. The width of
> the space character is 250, so 3 times a space character at the end of
> a line makes 2.7% of that line. So let’s go for an elastic space of 3%
> the line width, and then we can always chose the same adjustment  
> ratio;
> the number of active nodes would be “automatically” limited, whatever
> the line width.

Good idea!

> The two-column case is not surprising: the columns are too narrow,  
> which
> makes line-breaking particularly challenging. The one-column
> left-justified case surprises me a bit, however. I would have expected
> that text could be broken without even needing hyphenation. I find it
> a bit ironical that justifying text actually is easier for the
> line-breaking algorithm...
> At any rate, that adjustment ratio of 20 for the last run is surely  
> too
> much. It can probably be reduced to 5. Actually, I’m not even sure
> a third run with a high adjustment ratio is desirable. Maybe we should
> simply re-run the algorithm in forcing mode, and accept the underfull
> lines that will be introduced.

I agree.

> If you could run statistics on more real-life documents (how often is
> the first run without hyphenation sufficient, the third run required,
> justified and left-aligned text, single / two-column on A4 paper,  
> etc),
> that would be fantastic.

I already performed this tests but with paragraphs that probably are  
larger than normal. I'll give you more realistic reports asap,  
possibly regarding the example fo files in the repository too.


Dario

Re: Choosing a better threshold in line breaking

Posted by Vincent Hennebert <vh...@gmail.com>.

Hi Dario,

This is an interesting study. There is probably room for improvement, as
an adjustment ratio of 20, even if this is the last resort, is really
high. More below.

Dario Laera wrote:
> Hi all,
> 
> I found the reason why breaking paragraphs into short lines is really
> slow and memory hungry: the threshold of the adjustment ratio, set to 20
> at the last try, is too high and makes EVERY legal breakpoint a feasible
> breakpoint too. A check should be performed to avoid such situation and
> to choose then a better threshold.
> 
> For example, if I have a 2 columns in A4 page layout the line width is
> ~140000. The glue stretchability, as far as I can see in TextLM class,
> is often set to "3 * LineLayoutManager.DEFAULT_SPACE_WIDTH" that is
> equal to ~10000. When you compute the adj ratio for a line that have
> just one glue you get r = 140000/10000 = 14, that is lower than the
> threshold = 20.0, so an active node is added.
> 
> A better threshold can be chosen as follow: let idealDifference be a
> reasonable size we choose as good threshold. We can assume "3 *
> LineLayoutManager.DEFAULT_SPACE_WIDTH" as default stretchability a
> compute a better threshold in that way:
> 
>     idealRatio = idealDifference / (3 *
> LineLayoutManager.DEFAULT_SPACE_WIDTH);
> 
> and bound that value:
> 
>     1.0 <= idealRatio <= 20.
> 
> How to choose idealDifference? A naive solution, but probably not so
> bad, can be:
> 
>     idealDifference = iLineWidth / 2;
> 
> A more sophisticated, maybe too much sophisticated, solution can choose
> it by looking at the average box length: we can see how many average box
> can fit a line (wordsPerLine) and execute:
> 
>     avgWord = avgBox + LineLayoutManager.DEFAULT_SPACE_WIDTH;
>     idealDifference = iLineWidth - (avgWord * (wordsPerLine / 2));

I’m not sure I’m following you here. What’s the value of wordsPerLine?
Is is set manually to a value that’s considered to be a reasonable one?
Because if it’s computed automatically, the formula can be simplified:
    wordsPerLine = lineWidth / avgWord, so
    idealDifference = lineWidth - lineWidth / 2
                    = lineWidth / 2

Anyway, the adjustment ratio is already a notion that is independent of
the line width; that’s precisely the purpose of a ratio. In the case of
left-justified mode, the only available stretchability is due to the
space at the end of the line; the question is to determine up to how
much we accept that space to be...
Ok, by writing that I think I know what you mean now :-) But the issue
should probably be considered the other way around: the problem is not
so much the adjustment ratio as the amount of space allowed at the end
of the line. In the case of narrow columns, that “3 times the width of
a space character” is too big WRT the line width. Instead of having
a fixed value, it should be changed into a small proportion of the line
width.
At the origin that 3 * space-width value was probably chosen for
“normal” line widths, that is lines containing an optimal amount of
words. I’ve read somewhere that the optimal number of letters per line
is 60. Taking the Times font, the average width of lowercase letters is
459, so the optimal line width roughly is 459*60 = 27540. The width of
the space character is 250, so 3 times a space character at the end of
a line makes 2.7% of that line. So let’s go for an elastic space of 3%
the line width, and then we can always chose the same adjustment ratio;
the number of active nodes would be “automatically” limited, whatever
the line width.

> > Do you mean that this last try is /always/ performed (even when we 
> > already have a set of feasible breaks)?
> 
> It's not always performed (so it's formally correct), but in my tests 
> it's rarely avoided, more precisely just once, with the file 
> "my_franklin_rep-jus.fo" that is composed of many paragraph in 1 column 
> with justified text. What I think (obviously, I may be wrong, as it has 
> been proved in other mails ;) is that another intermediate try, with 
> a judicious threshold, can be performed, leading to the same final 
> result but with much better performance, if this intermediate try 
> doesn't fail like the previous.
> Anyway I always run my tests with hyphenation enabled, I should try 
> disabling to see if the second try is run with threshold=5 and if this 
> doesn't fails.

The two-column case is not surprising: the columns are too narrow, which
makes line-breaking particularly challenging. The one-column
left-justified case surprises me a bit, however. I would have expected
that text could be broken without even needing hyphenation. I find it
a bit ironical that justifying text actually is easier for the
line-breaking algorithm...
At any rate, that adjustment ratio of 20 for the last run is surely too
much. It can probably be reduced to 5. Actually, I’m not even sure
a third run with a high adjustment ratio is desirable. Maybe we should
simply re-run the algorithm in forcing mode, and accept the underfull
lines that will be introduced.

If you could run statistics on more real-life documents (how often is
the first run without hyphenation sufficient, the third run required,
justified and left-aligned text, single / two-column on A4 paper, etc),
that would be fantastic.

Thanks,
Vincent