You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-dev@xmlgraphics.apache.org by "J.Pietschmann" <j3...@yahoo.de> on 2003/11/11 20:21:07 UTC

Re: cvs commit: xml-fop/src/java/org/apache/fop/layoutmgr TextLayoutManager.java

gmazza@apache.org wrote:
>   Hyphenation problem in Bug 23985

Actually, implementing UTR14 would solve the line breaking problem,
although not the URL breaking problem.

Points to discuss:
- JDK 1.4 has a java.text.BreakIterator, which implements UTR14, or
  at least the parts releant for western languages (can't see them
  dealing with Thai properly).
  The questions
   + Can somebody verify this class is already available in JDK 1.3?
   I already deletetd the 1.3 docs, and can't be bothered to reinstall
   them.
   + Can this class really be leveraged? We probably need to supply
   a CharacterIterator which computes running line width into a global
   state, and check after each return of the iterator wehther the line
   is full. This might fit well with getNextBreak(), but I have
   difficulties to see how this would interact with hyphenation.
- Should we provide for custom line breaking algorithms?
  Some languages/scripts like Thai almost certainly require augmenting
  any stock line breaking algorithms. However, the problem seems to
  be more clever breaking of non-natural-languaage stuff, like URL.
  We can leave this completely to the FO creators, forcing them for
  example
   + use language="x-url" to turn off hyphenation locally
   + use glue characters line NBZWS to keep the stock line breaking
    algorithm to break after slashes
  The latter is quite intrusive.

I've got my own UTR14 implementation (simplified, of course), which
should appear on http://cvs.apache.org/~pietsch later this evening
for review. It uses a LineBreakStatus object for tracking the status,
which might be folded into the LayoutContext or a subclass used for
inline FOs and text.

Comments?

J.Pietschmann

Re: cvs commit: xml-fop/src/java/org/apache/fop/layoutmgr TextLayoutManager.java

Posted by "J.Pietschmann" <j3...@yahoo.de>.

Glen Mazza wrote:
> Make sure you're separating the issues here...

Uh, sorry the issue hasn't really much to do with your patch except
that it is in roughly the same region of code... I should have started
a new thread.

> Probably more than just Western, here's a Japanese
> description: 
> 
> http://java.sun.com/j2se/1.3/ja/docs/ja/api/java/text/BreakIterator.html

Yeah, UTR-14 can handle japanese. However, it states explicitely it
needs context in order to determine whether certain characters are
handled as "alphabetic" (no breaks in between) or "ideographic" (break
opportunity in between). I can't see how BreakIterator gets this.

> Yes.  See
> http://java.sun.com/j2se/1.3/docs/api/index.html

Darn! Didn't thought of the online docs!

> Again, I don't know the code, but it may be a good
> idea.

Well, the problem is: BreakIterator returns on break opportunities.
How would this fit into the LM framework?

> Anything that is based on Sun Java code (and an
> official standard like UTR14) probably makes our life
> much easier--anyone has a complaint about the
> hyphenation decisions can go complain to Sun about
> them!
They don't deal with hyphenation, unfortunately.

J.Pietschmann

Re: cvs commit: xml-fop/src/java/org/apache/fop/layoutmgr TextLayoutManager.java

Posted by Glen Mazza <gr...@yahoo.com>.

--- "J.Pietschmann" <j3...@yahoo.de> wrote:
> gmazza@apache.org wrote:
> >   Hyphenation problem in Bug 23985
> 
> Actually, implementing UTR14 would solve the line
> breaking problem,
> although not the URL breaking problem.
> 

Make sure you're separating the issues here...the bug
in question involved "hard" hyphens and forward
slashes (already present in the text) like
"vice-versa" and "and/or".  This is what was fixed.

Regular hyphenation, i.e., breaking up of unhyphenated
words I guess is after the above types of processing
are done.  I'm not familiar that much with hyphenation
yet.

> Points to discuss:
> - JDK 1.4 has a java.text.BreakIterator, which
> implements UTR14, or
>   at least the parts releant for western languages
> (can't see them
>   dealing with Thai properly).

Probably more than just Western, here's a Japanese
description: 

http://java.sun.com/j2se/1.3/ja/docs/ja/api/java/text/BreakIterator.html

BreakIterator has a function getAvailableLocales()
that lists all languages they support.  We would
probably be OK with just supporting the languages that
Sun supports.

>   The questions
>    + Can somebody verify this class is already
> available in JDK 1.3?

Yes.  See
http://java.sun.com/j2se/1.3/docs/api/index.html

>    + Can this class really be leveraged? 

Again, I don't know the code, but it may be a good
idea. Anything that is based on Sun Java code (and an
official standard like UTR14) probably makes our life
much easier--anyone has a complaint about the
hyphenation decisions can go complain to Sun about
them!

Glen

__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

Re: RT: line breaking

Posted by "Peter B. West" <pb...@powerup.com.au>.

J.Pietschmann wrote:
> Be careful with the various TRs: UTR14 does not deal with character
> (rather: grapheme) or word boundaries, that's UTX-29. Actually, we
> don't use the latter.
> Our line breaking should probably be done the following way (this
> implements the "naive" paragraph filling strategy)
>   loop
>     calculate line width if next character is added
>     check for a line breaking opportunity before the next character
>     if there is an opportunity
>       if the line is not full
>         discard the last saved opportunity and save this
>       else
>         try hyphenation on the string accumulated since the
>           last break opportunity (if enabled), save returned
>           opportunity if any
>         return saved line breaking opportunity
>       end if
>     end if
>   end loop
> 
> hyphenation of a string:
>  loop
>    skip non-word characters (for this hyphenator)
>    word = continuous run of word characters (for this hyphenator)
>    if the end of the word is past the end of the line
>      try hyphenating the word, generate new break opportunities
>      return best fitting line break opportunity or null
>    end if
>  end loop
> 
> There is the degenerate case if the line overflows and no line break
> opportunity is discovered at all.
> The TeX paragraph filling strategy has to detect line break opportunities
> the same way but selects the opportunities turning into actual line breaks
> in a more clever way. We could do that too.

In my own thinking about the process of line-breaking, I have always 
assumed that a (possibly recursive) block of text is a fixed resource; a 
superset of the fixed resource that is a single glyph/grapheme with 
given font attributes.  As such, it should be processed by a separate 
co-routine (to use the language of the Rec).  All of the information 
about the hierarchy of potential break positions is determined by the 
text itself.

As a first cut, I would I would determine all potential breaks, along 
with information relevant to later line-height calculations, at the time 
  a block is first prepared for layout.  The co-routine (thread, 
whatever) that is grooming the text would then respond to enquiries 
about line-area possibilities, and eventually return contents for 
line-areas of particular dimensions.  All of this is tentative, and all 
of the calculated information about the block would have to be held 
until the layout of the block is finalised.

What "finalised" means depends on the complexity of the layout 
strategies employed, but at a minimum, it must be maintained until the 
last page containing text from the block, and the subsequent page (if 
any) have been laid out, to allow for backtracking during last-page 
processing.

Peter
-- 
Peter B. West <http://www.powerup.com.au/~pbwest/resume.html>

RT: line breaking

Posted by "J.Pietschmann" <j3...@yahoo.de>.

Victor Mote wrote:
> I know of at least two line-breaking strategies that we probably want to
> have in our stock strategies: 1) the line-by-line method used right now, and
> 2) a Tex-like paragraph-oriented strategy, which AFAIK doesn't exist yet.

Ahem, that's not what I meant, or the scope of UTR14. UTR14 provides for
"line break opportunities", for example you can break foo-bar after the
hyphen but not 789-123. Which opportunities are used is another matter.
FOP's current algorithm for determining line break opportunities is utterly
simplistic, basically "possibly break before any breaking space, or after
a hyphen or slash", the latter is done if hyphenation is enabled.

I omitted the forced line break issue, which is also in the UTR14 scope,
and hyphenation, which may lead to additional line break opportunities
but is outside of the UTR14 scope.

> In your URL example, couldn't FOP see the "x-url" language & automatically
> add or assume the glue characters for the user? That would perhaps make it
> less obtrusive (I assume that you meant for the user).

Well, yes.

> I don't see it there yet, but I am a little confused. It seems to me that
> line-breaking consists of at least these components: 1) character-based
> line-breaking opportunities (which UTR14 addresses), 2) word-based
> line-breaking opportunities (which hyphenation dictionaries and patterns
> address), and 3) some strategy for using these to find acceptable/optimal
> line breaks. It sounds like you have addressed at least 1 and 3 in your
> implementation.

Paragraph filling (your point 3) is not addressed.
Be careful with the various TRs: UTR14 does not deal with character
(rather: grapheme) or word boundaries, that's UTX-29. Actually, we
don't use the latter.
Our line breaking should probably be done the following way (this
implements the "naive" paragraph filling strategy)
   loop
     calculate line width if next character is added
     check for a line breaking opportunity before the next character
     if there is an opportunity
       if the line is not full
         discard the last saved opportunity and save this
       else
         try hyphenation on the string accumulated since the
           last break opportunity (if enabled), save returned
           opportunity if any
         return saved line breaking opportunity
       end if
     end if
   end loop

hyphenation of a string:
  loop
    skip non-word characters (for this hyphenator)
    word = continuous run of word characters (for this hyphenator)
    if the end of the word is past the end of the line
      try hyphenating the word, generate new break opportunities
      return best fitting line break opportunity or null
    end if
  end loop

There is the degenerate case if the line overflows and no line break
opportunity is discovered at all.
The TeX paragraph filling strategy has to detect line break opportunities
the same way but selects the opportunities turning into actual line breaks
in a more clever way. We could do that too.

> This seems at least remotely related to fo.FOText.isWordChar(), which
> attempts to find breaks between words.

Actually, we don't need breaks between words. We need identifying line
breaking opportunities, words for the purpose of hyphenation, and
resizable spaces for justification.
That's why WordArea was such a bad name.

J.Pietschmann

RE: cvs commit: xml-fop/src/java/org/apache/fop/layoutmgr TextLayoutManager.java

Posted by Victor Mote <vi...@outfitr.com>.

J.Pietschmann wrote:

> gmazza@apache.org wrote:
> >   Hyphenation problem in Bug 23985
>
> Actually, implementing UTR14 would solve the line breaking problem,
> although not the URL breaking problem.
>
> Points to discuss:

...

> - Should we provide for custom line breaking algorithms?
>   Some languages/scripts like Thai almost certainly require augmenting
>   any stock line breaking algorithms. However, the problem seems to
>   be more clever breaking of non-natural-languaage stuff, like URL.
>   We can leave this completely to the FO creators, forcing them for
>   example
>    + use language="x-url" to turn off hyphenation locally
>    + use glue characters line NBZWS to keep the stock line breaking
>     algorithm to break after slashes
>   The latter is quite intrusive.

IMO, yes, we should allow for custom line-breaking, although it somewhat
depends on what level you are thinking. IIRC, this is the example used for
the GoF Strategy pattern. Now, we have now implemented in a simplistic (and,
so far, not very useful) way, the layout strategy concept. Any given layout
strategy can control how its line-breaking works. It could conceivably use
one of several "stock" strategies available, its own proprietary method, or
even allow the user to choose. In general, I hope that proprietary methods
can/will be extracted to stock strategies for others to use, but I suppose
that may not always be feasible.

I know of at least two line-breaking strategies that we probably want to
have in our stock strategies: 1) the line-by-line method used right now, and
2) a Tex-like paragraph-oriented strategy, which AFAIK doesn't exist yet.

In your URL example, couldn't FOP see the "x-url" language & automatically
add or assume the glue characters for the user? That would perhaps make it
less obtrusive (I assume that you meant for the user).

> I've got my own UTR14 implementation (simplified, of course), which
> should appear on http://cvs.apache.org/~pietsch later this evening
> for review. It uses a LineBreakStatus object for tracking the status,
> which might be folded into the LayoutContext or a subclass used for
> inline FOs and text.
>
> Comments?

I don't see it there yet, but I am a little confused. It seems to me that
line-breaking consists of at least these components: 1) character-based
line-breaking opportunities (which UTR14 addresses), 2) word-based
line-breaking opportunities (which hyphenation dictionaries and patterns
address), and 3) some strategy for using these to find acceptable/optimal
line breaks. It sounds like you have addressed at least 1 and 3 in your
implementation. If the part related to item 1 is factored out for use/reuse,
that sure seems valuable. Then the part related to item 3 becomes (perhaps)
one of the line-breaking strategies available to layout strategies? Or maybe
I have underestimated the scope of UTR14?

This seems at least remotely related to fo.FOText.isWordChar(), which
attempts to find breaks between words.

Victor Mote