You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-dev@xmlgraphics.apache.org by Manuel Mall <ma...@apache.org> on 2006/12/19 15:55:10 UTC

UAX#14 implementation

Just a quick heads up that I finally took the plunge to add UAX#14 line 
breaking to FOP. This is based on code donated by Joerg quite some time 
ago on which I did some work in October 2005. This had been documented 
on list at the time.

One of the major stumbling blocks in progressing this was the conflict 
between the recursive / nested getNextKnuthElement calls and the need 
to do the UAX#14 line breaking processing across inline boundaries.

In the end I decided, in the interest of making at least some progress 
in this area, to not attempt the 'all singing all dancing solution', 
but to simply apply this to the TextLayoutManager only. Yes, that gives 
us only limited new functionality, but hopefully its still an 
improvement. Also, the code is based on the Unicode 4.1 standard and 
not 5.0 but that can be fixed later.

Its looking OK so far and most of the layout engine tests pass. The 
change consists of a new package org.apache.fop.text.linebreak 
containing two classes and changes to the TextLayoutManager. Nothing 
else has been touched so far.

Its not ready for a commit yet, but hopefully in a few days.

The question that arises is if this should go into the planned release 
or if that is too risky and I should wait with the commit until the 
release is out or do it in a branch?

Another issue is that one of the two new files is actually generated by 
a little Java program (also from Joerg) from Unicode data files. While 
it would be a 'nice to have' for this generation to be integrated into 
the FOP build I would initially commit the generated file into the 
repository. To integrate the generation into the build we would either 
need have the Unicode data files in the Apache repository (not sure 
about licensing issues here) or the build would need to fetch those 
files causing an external dependency which usually is a hassle for 
people behind corporate firewalls etc.. Thats why I propose to apply 
the KISS principle initially.

Manuel

Re: UAX#14 implementation

Posted by Vincent Hennebert <vi...@anyware-tech.com>.
Nice work, Manuel! That will be a great addition to Fop.

I have never studied the problem in detail, so I can only give a general
opinion. But I think we should follow as closely as possible the Unicode
standard, even if that leads to behaviors incompatible with the current
one. It seems the Unicode standard is designed to nicely handle all
sorts of high-level typographical issues. This would be great to be able
to say "Fop is Unicode compliant". And users can refer to a well-known,
well-defined standard if they want to understand Fop's behavior or
achieve special effects.

So, by all means, go for it!

Vincent


Manuel Mall a écrit :
> On Wednesday 20 December 2006 20:43, Manuel Mall wrote:
>> On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
>> <snip/>
>>
>>> Its looking OK so far and most of the layout engine tests pass.
>> Just discovered the first instance of an existing testcase which
>> gives a different result.
> 
> Here is another one: The current FOP implementation treats spaces other 
> than NBSP, e.g. U+2009 (Thin Space) and U+200A (Hair Space) as 
> suppressible around line breaks. I believe that is incorrect as the 
> spec explicitly limits whitespace handling to the normal space U+0020. 
> The test case which shows that is block_white-space_4.xml. It tests for 
> specific Knuth element sequences which are now different because these 
> spaces are now treated as not suppressible.
> 
> After making the appropriate adjustment to the checks in that testcase 
> ALL testcases are now passing!
> 
>> <snip/>
>>
> Manuel

Re: UAX#14 implementation

Posted by Manuel Mall <mm...@arcus.com.au>.
On Wednesday 20 December 2006 20:43, Manuel Mall wrote:
> On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
> <snip/>
>
> > Its looking OK so far and most of the layout engine tests pass.
>
> Just discovered the first instance of an existing testcase which
> gives a different result.

Here is another one: The current FOP implementation treats spaces other 
than NBSP, e.g. U+2009 (Thin Space) and U+200A (Hair Space) as 
suppressible around line breaks. I believe that is incorrect as the 
spec explicitly limits whitespace handling to the normal space U+0020. 
The test case which shows that is block_white-space_4.xml. It tests for 
specific Knuth element sequences which are now different because these 
spaces are now treated as not suppressible.

After making the appropriate adjustment to the checks in that testcase 
ALL testcases are now passing!

> <snip/>
>
Manuel

Re: UAX#14 implementation

Posted by Manuel Mall <mm...@arcus.com.au>.
On Wednesday 20 December 2006 23:22, Chris Bowditch wrote:
> Manuel Mall wrote:
> > On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
> > <snip/>
> >
> >>Its looking OK so far and most of the layout engine tests pass.
> >
> > Just discovered the first instance of an existing testcase which
> > gives a different result. Under UAX#14 the following text (Note
> > this is plain text not FO markup!):
> >
> > text-align="center" .conditionality="retain"
> > linefeed-treatment="preserve".
> >
> > which appears in inline_border_padding_conditionality_2.xml has
> > only a single break opportunity which is before the word
> > linefeed-treatment. The space between "center" and .conditionality
> > is not a break
>
> Interesting. Just to clarify; are you saying that in the previous
> release 0.92beta the line breaking code identified 2 BP but in 0.93
> just the one BP is identifed?
>
No quite - what I am saying is that in the current fop trunk version 2 
break points are identified but in my local UAX#14 version of FOP only 
one break point is identified. After looking through the UAX#14 
specification the behaviour of my implementation appears to be correct.

> <snip/>
>
> Chris

Manuel

Re: UAX#14 implementation

Posted by Chris Bowditch <bo...@hotmail.com>.
Manuel Mall wrote:

> On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
> <snip/>
> 
>>Its looking OK so far and most of the layout engine tests pass. 
> 
> 
> Just discovered the first instance of an existing testcase which gives a 
> different result. Under UAX#14 the following text (Note this is plain 
> text not FO markup!):
> 
> text-align="center" .conditionality="retain" 
> linefeed-treatment="preserve".
> 
> which appears in inline_border_padding_conditionality_2.xml has only a 
> single break opportunity which is before the word linefeed-treatment. 
> The space between "center" and .conditionality is not a break 

Interesting. Just to clarify; are you saying that in the previous 
release 0.92beta the line breaking code identified 2 BP but in 0.93 just 
the one BP is identifed?

<snip/>

Chris




Re: UAX#14 implementation

Posted by Manuel Mall <mm...@arcus.com.au>.
On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
<snip/>
>
> Its looking OK so far and most of the layout engine tests pass. 

Just discovered the first instance of an existing testcase which gives a 
different result. Under UAX#14 the following text (Note this is plain 
text not FO markup!):

text-align="center" .conditionality="retain" 
linefeed-treatment="preserve".

which appears in inline_border_padding_conditionality_2.xml has only a 
single break opportunity which is before the word linefeed-treatment. 
The space between "center" and .conditionality is not a break 
opportunity as it is before a punctuation (Rule LB13). In our existing 
code this space is a valid break opportunity and under the specific 
circumstances this gives a different layout result.

I don't think this is actually a problem but it is a noticeable 
difference. It just shows that UAX#14 is designed to break typical 
written text and not programming language code which this text snippet 
resembles.

<snip/>
> Manuel

Manuel