You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-dev@xmlgraphics.apache.org by Manuel Mall <mm...@arcus.com.au> on 2005/11/01 10:24:05 UTC

Re: Unicode compliant Line Breaking

On Tue, 1 Nov 2005 01:33 am, thomas.deweese@kodak.com wrote:
> Hi all,
>
>         Just an FYI, Batik also currently has an implementation of
> the Unicode TR14 word breaking alg.
> (org.apache.batik.gvt.flow.TextLineBreak).
>
>         As far as performance is concerned it should be fairly fast
> as it is mostly just table based.
>
Thomas, thanks for the pointer (Note to myself - need to become more 
aware of what's in the Batik code base. Feeble excuse - Joerg didn't 
seem to know either).
Had a look at the Batik code: Same algorithm as Joerg wrote (not 
surprising as UAX#14 actually contains real C code) very similar data 
structures internally. Data structures are hard coded and not generated 
from the Unicode text files. The API is different, especially it relies 
on Batik specific types being passed across not just plain Strings (but 
this could probably be handled by a wrapper).

This probably strengthens the argument of making all of this part of 
XMLGraphics Common....grumble...grumble...

My main reason for hesitation with the XMLGraphics Common approach is 
simple man power. We need to setup the infrastructure (subversion, 
mailing lists, web site, etc.). We need to maintain this. We would 
basically would publish APIs currently internal to Batik and FOP with 
all the resultant support headaches. For example, I would not like to 
see my time diluted in the moment by having to discuss API needs 
outside of FOP/Batik. Actually I am reluctant to even dive into the 
Batik code base in the moment. FOP is complicated enough to digest.

Hmmm... not sure where to go from here.

Manuel

Re: Unicode compliant Line Breaking

Posted by Manuel Mall <mm...@arcus.com.au>.

On Tue, 1 Nov 2005 07:27 pm, thomas.deweese@kodak.com wrote:
> Hi Manuel,
>
> Manuel Mall <mm...@arcus.com.au> wrote on 11/01/2005 04:24:05 AM:
> > On Tue, 1 Nov 2005 01:33 am, thomas.deweese@kodak.com wrote:
<snip> 
>> Had a look at the Batik code: Same algorithm as Joerg wrote (not
> > surprising as UAX#14 actually contains real C code) very similar
> > data structures internally. Data structures are hard coded and not
> > generated from the Unicode text files.
>
>    I would not think it would be worth the while to parse the Unicode
> files on startup every time (they aren't small).  Passing in the
> table mapping chars to types might be a useful extension (but in
> honesty I doubt .5% of users would ever provide their own, unless the
> code only included say Western Language by default).
Sorry, not very well explained on my part. Joerg's code includes a Java 
code generator that builds the tables from the Unicode text files. This 
is something that would be done at product build time not each time on 
startup. It just makes it easier IMO to maintain the data in sync with 
the Unicode standard.

<snip/>

Manuel

Re: Unicode compliant Line Breaking

Posted by th...@kodak.com.

Hi Manuel,

Manuel Mall <mm...@arcus.com.au> wrote on 11/01/2005 04:24:05 AM:

> On Tue, 1 Nov 2005 01:33 am, thomas.deweese@kodak.com wrote:
> >         Just an FYI, Batik also currently has an implementation of
> > the Unicode TR14 word breaking alg.
> > (org.apache.batik.gvt.flow.TextLineBreak).

> Thomas, thanks for the pointer (Note to myself - need to become more 
> aware of what's in the Batik code base. Feeble excuse - Joerg didn't 
> seem to know either).

    It's a fairly recent addition, to support proposals for flowing 
text in SVG 1.2.

> Had a look at the Batik code: Same algorithm as Joerg wrote (not 
> surprising as UAX#14 actually contains real C code) very similar data 
> structures internally. Data structures are hard coded and not generated 
> from the Unicode text files. 

   I would not think it would be worth the while to parse the Unicode
files on startup every time (they aren't small).  Passing in the table
mapping chars to types might be a useful extension (but in honesty
I doubt .5% of users would ever provide their own, unless the code
only included say Western Language by default).

> The API is different, especially it relies 
> on Batik specific types being passed across not just plain Strings (but 
> this could probably be handled by a wrapper).

   AttributedString (the type passed across the interface) is a
JDK class: java.text.AttributedString.  We do define now attributes
(keys) to hang the word break info off of.

> This probably strengthens the argument of making all of this part of 
> XMLGraphics Common....grumble...grumble...

   Yes, this is mostly why I mentioned it.  On the other hand the
code is not that large or really overly complex.

> My main reason for hesitation with the XMLGraphics Common approach is 
> simple man power. We need to setup the infrastructure (subversion, 
> mailing lists, web site, etc.). We need to maintain this. 

   Sure, some of this will happen anyway because of the current 
problems we have with the PDFTranscoder (Batik depends on FOP which
depends on Batik :( ).  Those dependencies need to be straightened
out.

> We would basically would publish APIs currently internal to Batik 
> and FOP with all the resultant support headaches. For example, 
> I would not like to see my time diluted in the moment by having 
> to discuss API needs outside of FOP/Batik. 

   Yes, this is the big issue, as soon as an API becomes public
it is a lot more work to maintain it. 

> Actually I am reluctant to even dive into the Batik code base 
> in the moment. FOP is complicated enough to digest.

   The hope is that by exposing some of these API's we will 
attract some people as contributors that would otherwise be 
'scared off' by the size and complexity of the FOP and
Batik code bases.