You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-dev@xmlgraphics.apache.org by Manuel Mall <mm...@arcus.com.au> on 2005/10/31 08:25:12 UTC

Unicode compliant Line Breaking

In a previous post Joerg pointed to the Unicode Standard Annex #14 on 
Line Breaking (http://www.unicode.org/reports/tr14/) and his initial 
implementation: http://people.apache.org/~pietsch/linebreak.tar.gz.

I had since a closer look at both UAX#14 and Joerg's code. Because I 
liked what I saw I went about adapting Joerg's code it to Unicode 4.1 
and added fairly extensive JUnit test cases to it mainly because it 
really helps to go through the various different cases mentioned in the 
spec in some structured fashion.

The results are now available for public inspection: 
http://people.apache.org/~manuel/fop/linebreak.tar.gz

1. I would like to propose that Unicode conformant line breaking be 
integrated into FOP trunk because it:
a) Moves FOP more towards being a universal formatter and not just a 
formatter for western languages
b) Moves FOP more towards becoming a high quality typesetting system 
(something that was really started by integrating Knuth style breaking)
The reason I think this needs to be voted on is because Unicode line 
breaking will in subtle ways change the current line breaking behaviour 
and therefore constitutes a (significant) change in FOPs overall 
rendering.

2. I would also like to propose that the Unicode conformant line 
breaking be implemented using our own pair-table based implementation 
and not using Java's line breaker, because:
a) It gives us full control and allows FOP to follow the Unicode 
standard (and its updates and erratas) closely and therefore keep FOPs 
Unicode compliance level independent of the Java version.
b) It allows us to tailor the algorithm to match the needs of XSL-FO and 
FOP.
c) It allows us to provide user customisation features (down the track) 
not available through using the Java APIs.

Of course there are downsides, like:
a) Are we falling for the 'not invented here' syndrome?
b) Duplicating code which is already in the Java base system
c) Increasing the memory footprint of FOP

3. Assuming we get enough +1 for the above proposals the first item to 
decide after that would be: Where should the code live?
a) Joerg would like to see it in Jakarta Commons but hasn't got the time 
to start the project. 
b) Jeremias suggested XMLGraphics Commons. 
c) Personally I think it is too early to factor it out. More experience 
with its design and use cases should be gathered before making it 
standalone and at this point in time it really only are 2 core Java 
classes. I would like to suggest that it initially lives under FOP in 
something like org.apache.fop.text. Should the need and energy levels 
(= developer enthusiasm) become available later to make this into an 
Jakarta Commons or XMLGraphics Commons project so be it.

Assuming now that this will be agreed as well the next step would be the 
more detailed design of the integration. But this is well beyond the 
scope of this e-mail as there are some tricky issues involved and they 
probably need to be tackled in conjunction with the white space 
handling issues. Many of the problems are related to our LayoutManager 
structures which create barriers when it comes to the need to process 
character sequences across those boundaries as is the case for both 
line breaking and white space handling. Add to that the design of the 
different Knuth sequences required to model the different break cases 
in conjunction with conditional border/padding and white space removal 
around line breaking and different types of line justifications and 
there is some real work ahead.

Cheers

Manuel

Should add my votes:

1.) +1
2.) +1
3.c) +1

Re: Unicode compliant Line Breaking

Posted by Manuel Mall <mm...@arcus.com.au>.

On Tue, 1 Nov 2005 07:27 pm, thomas.deweese@kodak.com wrote:
> Hi Manuel,
>
> Manuel Mall <mm...@arcus.com.au> wrote on 11/01/2005 04:24:05 AM:
> > On Tue, 1 Nov 2005 01:33 am, thomas.deweese@kodak.com wrote:
<snip> 
>> Had a look at the Batik code: Same algorithm as Joerg wrote (not
> > surprising as UAX#14 actually contains real C code) very similar
> > data structures internally. Data structures are hard coded and not
> > generated from the Unicode text files.
>
>    I would not think it would be worth the while to parse the Unicode
> files on startup every time (they aren't small).  Passing in the
> table mapping chars to types might be a useful extension (but in
> honesty I doubt .5% of users would ever provide their own, unless the
> code only included say Western Language by default).
Sorry, not very well explained on my part. Joerg's code includes a Java 
code generator that builds the tables from the Unicode text files. This 
is something that would be done at product build time not each time on 
startup. It just makes it easier IMO to maintain the data in sync with 
the Unicode standard.

<snip/>

Manuel

Re: Unicode compliant Line Breaking

Posted by th...@kodak.com.

Hi Manuel,

Manuel Mall <mm...@arcus.com.au> wrote on 11/01/2005 04:24:05 AM:

> On Tue, 1 Nov 2005 01:33 am, thomas.deweese@kodak.com wrote:
> >         Just an FYI, Batik also currently has an implementation of
> > the Unicode TR14 word breaking alg.
> > (org.apache.batik.gvt.flow.TextLineBreak).

> Thomas, thanks for the pointer (Note to myself - need to become more 
> aware of what's in the Batik code base. Feeble excuse - Joerg didn't 
> seem to know either).

    It's a fairly recent addition, to support proposals for flowing 
text in SVG 1.2.

> Had a look at the Batik code: Same algorithm as Joerg wrote (not 
> surprising as UAX#14 actually contains real C code) very similar data 
> structures internally. Data structures are hard coded and not generated 
> from the Unicode text files. 

   I would not think it would be worth the while to parse the Unicode
files on startup every time (they aren't small).  Passing in the table
mapping chars to types might be a useful extension (but in honesty
I doubt .5% of users would ever provide their own, unless the code
only included say Western Language by default).

> The API is different, especially it relies 
> on Batik specific types being passed across not just plain Strings (but 
> this could probably be handled by a wrapper).

   AttributedString (the type passed across the interface) is a
JDK class: java.text.AttributedString.  We do define now attributes
(keys) to hang the word break info off of.

> This probably strengthens the argument of making all of this part of 
> XMLGraphics Common....grumble...grumble...

   Yes, this is mostly why I mentioned it.  On the other hand the
code is not that large or really overly complex.

> My main reason for hesitation with the XMLGraphics Common approach is 
> simple man power. We need to setup the infrastructure (subversion, 
> mailing lists, web site, etc.). We need to maintain this. 

   Sure, some of this will happen anyway because of the current 
problems we have with the PDFTranscoder (Batik depends on FOP which
depends on Batik :( ).  Those dependencies need to be straightened
out.

> We would basically would publish APIs currently internal to Batik 
> and FOP with all the resultant support headaches. For example, 
> I would not like to see my time diluted in the moment by having 
> to discuss API needs outside of FOP/Batik. 

   Yes, this is the big issue, as soon as an API becomes public
it is a lot more work to maintain it. 

> Actually I am reluctant to even dive into the Batik code base 
> in the moment. FOP is complicated enough to digest.

   The hope is that by exposing some of these API's we will 
attract some people as contributors that would otherwise be 
'scared off' by the size and complexity of the FOP and
Batik code bases.

Re: Unicode compliant Line Breaking

Posted by Manuel Mall <mm...@arcus.com.au>.

On Tue, 1 Nov 2005 01:33 am, thomas.deweese@kodak.com wrote:
> Hi all,
>
>         Just an FYI, Batik also currently has an implementation of
> the Unicode TR14 word breaking alg.
> (org.apache.batik.gvt.flow.TextLineBreak).
>
>         As far as performance is concerned it should be fairly fast
> as it is mostly just table based.
>
Thomas, thanks for the pointer (Note to myself - need to become more 
aware of what's in the Batik code base. Feeble excuse - Joerg didn't 
seem to know either).
Had a look at the Batik code: Same algorithm as Joerg wrote (not 
surprising as UAX#14 actually contains real C code) very similar data 
structures internally. Data structures are hard coded and not generated 
from the Unicode text files. The API is different, especially it relies 
on Batik specific types being passed across not just plain Strings (but 
this could probably be handled by a wrapper).

This probably strengthens the argument of making all of this part of 
XMLGraphics Common....grumble...grumble...

My main reason for hesitation with the XMLGraphics Common approach is 
simple man power. We need to setup the infrastructure (subversion, 
mailing lists, web site, etc.). We need to maintain this. We would 
basically would publish APIs currently internal to Batik and FOP with 
all the resultant support headaches. For example, I would not like to 
see my time diluted in the moment by having to discuss API needs 
outside of FOP/Batik. Actually I am reluctant to even dive into the 
Batik code base in the moment. FOP is complicated enough to digest.

Hmmm... not sure where to go from here.

Manuel

Re: Unicode compliant Line Breaking

Posted by th...@kodak.com.

Hi all,

        Just an FYI, Batik also currently has an implementation of the
Unicode TR14 word breaking alg. (org.apache.batik.gvt.flow.TextLineBreak).

        As far as performance is concerned it should be fairly fast as it 
is
mostly just table based.

The Web Maestro <th...@gmail.com> wrote on 10/31/2005 11:04:54 
AM:

> IMO, Unicode conformant line-breaking is an important goal for FOP to 
> achieve. But before I vote, I have a question:
> 
> On Oct 30, 2005, at 11:25 PM, Manuel Mall wrote:
> 
> <snip>
> 
> > 2. I would also like to propose that the Unicode conformant line
> > breaking be implemented using our own pair-table based implementation
> > and not using Java's line breaker, because:
> 
> Does it make sense to have using our own implementation over Java's 
> 'configurable'? That way, our users could choose whether or not to use 
> it. In my case, we had no need for Unicode, and IIC the extra code 
> would merely serve to hinder FOP's performance & increase FOP's memory 
> footprint (unless it's only called when using Unicode). In addition, a 
> future Java implementation could bring a robust (and maintained) 
> Unicode solution.
> 
> <snip>
> 
> > 3. Assuming we get enough +1 for the above proposals the first item to
> > decide after that would be: Where should the code live?
> > a) Joerg would like to see it in Jakarta Commons but hasn't got the 
> > time
> > to start the project.
> > b) Jeremias suggested XMLGraphics Commons.
> > c) Personally I think it is too early to factor it out. More 
experience
> > with its design and use cases should be gathered before making it
> > standalone and at this point in time it really only are 2 core Java
> > classes. I would like to suggest that it initially lives under FOP in
> > something like org.apache.fop.text. Should the need and energy levels
> > (= developer enthusiasm) become available later to make this into an
> > Jakarta Commons or XMLGraphics Commons project so be it.
> 
> I would think it would be best to start it under XML Graphics Commons 
> (as that's where I suspect it will likely end up), and move it if 
> necessary from there.
> 
> Regards,
> 
> Web Maestro Clay
> -- 
> <th...@gmail.com> - <http://homepage.mac.com/webmaestro/>
> My religion is simple. My religion is kindness.
> - HH The 14th Dalai Lama of Tibet
>

Re: Unicode compliant Line Breaking

Posted by The Web Maestro <th...@gmail.com>.

IMO, Unicode conformant line-breaking is an important goal for FOP to 
achieve. But before I vote, I have a question:

On Oct 30, 2005, at 11:25 PM, Manuel Mall wrote:

<snip>

> 2. I would also like to propose that the Unicode conformant line
> breaking be implemented using our own pair-table based implementation
> and not using Java's line breaker, because:

Does it make sense to have using our own implementation over Java's 
'configurable'? That way, our users could choose whether or not to use 
it. In my case, we had no need for Unicode, and IIC the extra code 
would merely serve to hinder FOP's performance & increase FOP's memory 
footprint (unless it's only called when using Unicode). In addition, a 
future Java implementation could bring a robust (and maintained) 
Unicode solution.

<snip>

> 3. Assuming we get enough +1 for the above proposals the first item to
> decide after that would be: Where should the code live?
> a) Joerg would like to see it in Jakarta Commons but hasn't got the 
> time
> to start the project.
> b) Jeremias suggested XMLGraphics Commons.
> c) Personally I think it is too early to factor it out. More experience
> with its design and use cases should be gathered before making it
> standalone and at this point in time it really only are 2 core Java
> classes. I would like to suggest that it initially lives under FOP in
> something like org.apache.fop.text. Should the need and energy levels
> (= developer enthusiasm) become available later to make this into an
> Jakarta Commons or XMLGraphics Commons project so be it.

I would think it would be best to start it under XML Graphics Commons 
(as that's where I suspect it will likely end up), and move it if 
necessary from there.

Regards,

Web Maestro Clay
-- 
<th...@gmail.com> - <http://homepage.mac.com/webmaestro/>
My religion is simple. My religion is kindness.
- HH The 14th Dalai Lama of Tibet

Re: Unicode compliant Line Breaking

Posted by "J.Pietschmann" <j3...@yahoo.de>.

Simon Pepping wrote:
> I mean, will our current method of finding possible line breaking
> points using the hyphenation tables be part of a TR14 compliant system
> to find line break opportunities?

In some sense yes, but I'm not sure what you really mean.

Currently, spaces and slashes ("/") as well as hyphenation points
are considered break opportunities. TR14 doesn't care about hyphenation
but expands significantly on the other points. For example, in the
string "foo-bar" the position after the dash is a break opportunity,
as people usually expect, but in -1234 the position after the dash
isn't a break opportunity, also as people usually expect. The TR
encodes as much of such expectations as is possible with a limited
context.

A few places in TextLayoutManager which use BREAK_CHARS will have to
be changed, either keeping info from a previous scanning using a
BreakIterator or something, or looking up the line break Unicode
properties and looking up whether a break may occur in the
line-break matrix. Hyphenation points are generated elsewhere and
remain unaffected.

J.Pietschmann

Re: Unicode compliant Line Breaking

Posted by Simon Pepping <sp...@leverkruid.nl>.

On Tue, Nov 01, 2005 at 11:17:08PM +0100, J.Pietschmann wrote:
> Simon Pepping wrote:
> >Is our current hyphenation method a subset of Unicode's method?
> 
> Umm. What's the relation between hyphenation and TR14 (except for
> handling soft hyphens)? I guess you confuse finding line breaks
> in general and line breaking due to hyphenation.

I mean, will our current method of finding possible line breaking
points using the hyphenation tables be part of a TR14 compliant system
to find line break opportunities?

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.nl

Re: Unicode compliant Line Breaking

Posted by "J.Pietschmann" <j3...@yahoo.de>.

Simon Pepping wrote:
> Is our current hyphenation method a subset of Unicode's method?

Umm. What's the relation between hyphenation and TR14 (except for
handling soft hyphens)? I guess you confuse finding line breaks
in general and line breaking due to hyphenation.

> I seem to recall that the hyphenation code collects words across LM
> boundaries.

As it should. Word boundaries and FO boundaries are different things:
  <block>A w<wrapper text-decoration="underline">o</wrapper>rd</block>

J.Pietschmann

Re: Unicode compliant Line Breaking

Posted by Simon Pepping <sp...@leverkruid.nl>.

On Mon, Oct 31, 2005 at 03:25:12PM +0800, Manuel Mall wrote:
> In a previous post Joerg pointed to the Unicode Standard Annex #14 on 
> Line Breaking (http://www.unicode.org/reports/tr14/) and his initial 
> implementation: http://people.apache.org/~pietsch/linebreak.tar.gz.
> 
> I had since a closer look at both UAX#14 and Joerg's code. Because I 
> liked what I saw I went about adapting Joerg's code it to Unicode 4.1 
> and added fairly extensive JUnit test cases to it mainly because it 
> really helps to go through the various different cases mentioned in the 
> spec in some structured fashion.

Is our current hyphenation method a subset of Unicode's method?

> Assuming now that this will be agreed as well the next step would be the 
> more detailed design of the integration. But this is well beyond the 
> scope of this e-mail as there are some tricky issues involved and they 
> probably need to be tackled in conjunction with the white space 
> handling issues. Many of the problems are related to our LayoutManager 
> structures which create barriers when it comes to the need to process 
> character sequences across those boundaries as is the case for both 
> line breaking and white space handling. Add to that the design of the 

I seem to recall that the hyphenation code collects words across LM
boundaries.

It seems a useful goal to implement Unicode hyphenation. But since it
is a major effort, it does not fit in working towards a release. In
any case it would have to be in a separate branch until it proves to
work and to implement a substantial part of hyphenation. Then it does
not immediately matter if it is a separate project or a part of FOP.

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.nl

Re: Unicode compliant Line Breaking

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.

1. +1
2. +1
3.b) +1 for the separatable parts although c) is also ok for now.

+1 to try to find synergies with the code in Batik.

If I were you I'd create a branch and put your stuff in there. It's
easier for everyone to follow and to help (wishful thinking).

On 31.10.2005 08:25:12 Manuel Mall wrote:
> In a previous post Joerg pointed to the Unicode Standard Annex #14 on 
> Line Breaking (http://www.unicode.org/reports/tr14/) and his initial 
> implementation: http://people.apache.org/~pietsch/linebreak.tar.gz.
> 
> I had since a closer look at both UAX#14 and Joerg's code. Because I 
> liked what I saw I went about adapting Joerg's code it to Unicode 4.1 
> and added fairly extensive JUnit test cases to it mainly because it 
> really helps to go through the various different cases mentioned in the 
> spec in some structured fashion.
> 
> The results are now available for public inspection: 
> http://people.apache.org/~manuel/fop/linebreak.tar.gz
> 
> 1. I would like to propose that Unicode conformant line breaking be 
> integrated into FOP trunk because it:
> a) Moves FOP more towards being a universal formatter and not just a 
> formatter for western languages
> b) Moves FOP more towards becoming a high quality typesetting system 
> (something that was really started by integrating Knuth style breaking)
> The reason I think this needs to be voted on is because Unicode line 
> breaking will in subtle ways change the current line breaking behaviour 
> and therefore constitutes a (significant) change in FOPs overall 
> rendering.
> 
> 2. I would also like to propose that the Unicode conformant line 
> breaking be implemented using our own pair-table based implementation 
> and not using Java's line breaker, because:
> a) It gives us full control and allows FOP to follow the Unicode 
> standard (and its updates and erratas) closely and therefore keep FOPs 
> Unicode compliance level independent of the Java version.
> b) It allows us to tailor the algorithm to match the needs of XSL-FO and 
> FOP.
> c) It allows us to provide user customisation features (down the track) 
> not available through using the Java APIs.
> 
> Of course there are downsides, like:
> a) Are we falling for the 'not invented here' syndrome?
> b) Duplicating code which is already in the Java base system
> c) Increasing the memory footprint of FOP
> 
> 3. Assuming we get enough +1 for the above proposals the first item to 
> decide after that would be: Where should the code live?
> a) Joerg would like to see it in Jakarta Commons but hasn't got the time 
> to start the project. 
> b) Jeremias suggested XMLGraphics Commons. 
> c) Personally I think it is too early to factor it out. More experience 
> with its design and use cases should be gathered before making it 
> standalone and at this point in time it really only are 2 core Java 
> classes. I would like to suggest that it initially lives under FOP in 
> something like org.apache.fop.text. Should the need and energy levels 
> (= developer enthusiasm) become available later to make this into an 
> Jakarta Commons or XMLGraphics Commons project so be it.
> 
> Assuming now that this will be agreed as well the next step would be the 
> more detailed design of the integration. But this is well beyond the 
> scope of this e-mail as there are some tricky issues involved and they 
> probably need to be tackled in conjunction with the white space 
> handling issues. Many of the problems are related to our LayoutManager 
> structures which create barriers when it comes to the need to process 
> character sequences across those boundaries as is the case for both 
> line breaking and white space handling. Add to that the design of the 
> different Knuth sequences required to model the different break cases 
> in conjunction with conditional border/padding and white space removal 
> around line breaking and different types of line justifications and 
> there is some real work ahead.
> 
> Cheers
> 
> Manuel
> 
> Should add my votes:
> 
> 1.) +1
> 2.) +1
> 3.c) +1



Jeremias Maerki