You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-dev@xmlgraphics.apache.org by Manuel Mall <ma...@apache.org> on 2006/12/19 15:55:10 UTC
UAX#14 implementation
Just a quick heads up that I finally took the plunge to add UAX#14 line
breaking to FOP. This is based on code donated by Joerg quite some time
ago on which I did some work in October 2005. This had been documented
on list at the time.
One of the major stumbling blocks in progressing this was the conflict
between the recursive / nested getNextKnuthElement calls and the need
to do the UAX#14 line breaking processing across inline boundaries.
In the end I decided, in the interest of making at least some progress
in this area, to not attempt the 'all singing all dancing solution',
but to simply apply this to the TextLayoutManager only. Yes, that gives
us only limited new functionality, but hopefully its still an
improvement. Also, the code is based on the Unicode 4.1 standard and
not 5.0 but that can be fixed later.
Its looking OK so far and most of the layout engine tests pass. The
change consists of a new package org.apache.fop.text.linebreak
containing two classes and changes to the TextLayoutManager. Nothing
else has been touched so far.
Its not ready for a commit yet, but hopefully in a few days.
The question that arises is if this should go into the planned release
or if that is too risky and I should wait with the commit until the
release is out or do it in a branch?
Another issue is that one of the two new files is actually generated by
a little Java program (also from Joerg) from Unicode data files. While
it would be a 'nice to have' for this generation to be integrated into
the FOP build I would initially commit the generated file into the
repository. To integrate the generation into the build we would either
need have the Unicode data files in the Apache repository (not sure
about licensing issues here) or the build would need to fetch those
files causing an external dependency which usually is a hassle for
people behind corporate firewalls etc.. Thats why I propose to apply
the KISS principle initially.
Manuel
Re: UAX#14 implementation
Posted by Vincent Hennebert <vi...@anyware-tech.com>.
Nice work, Manuel! That will be a great addition to Fop.
I have never studied the problem in detail, so I can only give a general
opinion. But I think we should follow as closely as possible the Unicode
standard, even if that leads to behaviors incompatible with the current
one. It seems the Unicode standard is designed to nicely handle all
sorts of high-level typographical issues. This would be great to be able
to say "Fop is Unicode compliant". And users can refer to a well-known,
well-defined standard if they want to understand Fop's behavior or
achieve special effects.
So, by all means, go for it!
Vincent
Manuel Mall a écrit :
> On Wednesday 20 December 2006 20:43, Manuel Mall wrote:
>> On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
>> <snip/>
>>
>>> Its looking OK so far and most of the layout engine tests pass.
>> Just discovered the first instance of an existing testcase which
>> gives a different result.
>
> Here is another one: The current FOP implementation treats spaces other
> than NBSP, e.g. U+2009 (Thin Space) and U+200A (Hair Space) as
> suppressible around line breaks. I believe that is incorrect as the
> spec explicitly limits whitespace handling to the normal space U+0020.
> The test case which shows that is block_white-space_4.xml. It tests for
> specific Knuth element sequences which are now different because these
> spaces are now treated as not suppressible.
>
> After making the appropriate adjustment to the checks in that testcase
> ALL testcases are now passing!
>
>> <snip/>
>>
> Manuel
Re: UAX#14 implementation
Posted by Manuel Mall <mm...@arcus.com.au>.
On Wednesday 20 December 2006 20:43, Manuel Mall wrote:
> On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
> <snip/>
>
> > Its looking OK so far and most of the layout engine tests pass.
>
> Just discovered the first instance of an existing testcase which
> gives a different result.
Here is another one: The current FOP implementation treats spaces other
than NBSP, e.g. U+2009 (Thin Space) and U+200A (Hair Space) as
suppressible around line breaks. I believe that is incorrect as the
spec explicitly limits whitespace handling to the normal space U+0020.
The test case which shows that is block_white-space_4.xml. It tests for
specific Knuth element sequences which are now different because these
spaces are now treated as not suppressible.
After making the appropriate adjustment to the checks in that testcase
ALL testcases are now passing!
> <snip/>
>
Manuel
Re: UAX#14 implementation
Posted by Manuel Mall <mm...@arcus.com.au>.
On Wednesday 20 December 2006 23:22, Chris Bowditch wrote:
> Manuel Mall wrote:
> > On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
> > <snip/>
> >
> >>Its looking OK so far and most of the layout engine tests pass.
> >
> > Just discovered the first instance of an existing testcase which
> > gives a different result. Under UAX#14 the following text (Note
> > this is plain text not FO markup!):
> >
> > text-align="center" .conditionality="retain"
> > linefeed-treatment="preserve".
> >
> > which appears in inline_border_padding_conditionality_2.xml has
> > only a single break opportunity which is before the word
> > linefeed-treatment. The space between "center" and .conditionality
> > is not a break
>
> Interesting. Just to clarify; are you saying that in the previous
> release 0.92beta the line breaking code identified 2 BP but in 0.93
> just the one BP is identifed?
>
No quite - what I am saying is that in the current fop trunk version 2
break points are identified but in my local UAX#14 version of FOP only
one break point is identified. After looking through the UAX#14
specification the behaviour of my implementation appears to be correct.
> <snip/>
>
> Chris
Manuel
Re: UAX#14 implementation
Posted by Chris Bowditch <bo...@hotmail.com>.
Manuel Mall wrote:
> On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
> <snip/>
>
>>Its looking OK so far and most of the layout engine tests pass.
>
>
> Just discovered the first instance of an existing testcase which gives a
> different result. Under UAX#14 the following text (Note this is plain
> text not FO markup!):
>
> text-align="center" .conditionality="retain"
> linefeed-treatment="preserve".
>
> which appears in inline_border_padding_conditionality_2.xml has only a
> single break opportunity which is before the word linefeed-treatment.
> The space between "center" and .conditionality is not a break
Interesting. Just to clarify; are you saying that in the previous
release 0.92beta the line breaking code identified 2 BP but in 0.93 just
the one BP is identifed?
<snip/>
Chris
Re: UAX#14 implementation
Posted by Manuel Mall <mm...@arcus.com.au>.
On Tuesday 19 December 2006 23:55, Manuel Mall wrote:
<snip/>
>
> Its looking OK so far and most of the layout engine tests pass.
Just discovered the first instance of an existing testcase which gives a
different result. Under UAX#14 the following text (Note this is plain
text not FO markup!):
text-align="center" .conditionality="retain"
linefeed-treatment="preserve".
which appears in inline_border_padding_conditionality_2.xml has only a
single break opportunity which is before the word linefeed-treatment.
The space between "center" and .conditionality is not a break
opportunity as it is before a punctuation (Rule LB13). In our existing
code this space is a valid break opportunity and under the specific
circumstances this gives a different layout result.
I don't think this is actually a problem but it is a noticeable
difference. It just shows that UAX#14 is designed to break typical
written text and not programming language code which this text snippet
resembles.
<snip/>
> Manuel
Manuel