You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-dev@xmlgraphics.apache.org by Luca Furini <lf...@cs.unibo.it> on 2005/12/05 10:48:09 UTC

Re: DO NOT REPLY [Bug 37743] - exception: border-style (shorthand)

Manuel Mall wrote:

> After that we really need to redesign the line breaking stuff. Not the 
> Knuth approach (and the implemented algorithms related to that) but the 
> way we arrive at the Knuth sequences and iterate and process the text 
> elements. This needs to be done to be able to do white-space-treatment, 
> UAX#14 line breaking, start- /end- space resolution and generally to be 
> able to handle some more aspects of Unicode (e.g. glyph merging).

Just trying to write down a few thoghts and summarise what we have already 
said about this:

- the inline LMs can directly apply the linefeed-treatment property 
(ignoring, preserving or transforming the LF character) but have too much 
a limited "view" to handle correctly white-space-treatment and 
white-space-collapse, and to count the number of letter spaces

- the LineLM has to collect the "text" from its descendant nodes: 
non-textual objects should be taken into account too, as, for example, a 
leader between two spaces should prevent them from being collapsed; if 
spaces collapse only if they come from sibling nodes, this could maybe be 
handled during the collection by the InlineStackingLM

- the LineLM should then mark spaces that must be removed because they are 
trailing / leading, glyphs that must be merged (but which LM will paint 
them if the characters come from different text nodes?) and find the 
breaking points according to the unicode rules

- the LineLM should give someway the computed information to the 
descendant LM, that would use it to create at once the correct elements

- the resulting sequences would be ready for the breaking phase, without 
further analysis / checks / substitutions / changes

The revised interface for inline LMs could then have (just a quick idea) a 
new appendText(StringBuffer) method and a modified version of 
getNextKnuthElements() having some extra parameter storing the information 
created by the LineLM; we should finally get rid of addALetterSpaceTo(), 
getWordChars(), hyphenate(), applyChanges() and getChangedKnuthElements().

WDYT?

Regards
     Luca

Re: DO NOT REPLY [Bug 37743] - exception: border-style (shorthand)

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.

On 05.12.2005 15:53:51 Luca Furini wrote:
> First of all, thanks for your comments: I really tend to forget in a short 
> time all the details concerning white space!
> 
> Manuel Mall wrote:
> 
> > Glyphs are only allowed to be merged if they carry the same / matching 
> > set of property values. Personally I would not be concerned if we 
> > therefore limit that logic to within a LM. While it is possible that 
> > someone could write something like
> > <fo:block><fo:inline>a</fo:inline><fo:inline>&#x0308;</fo:inline>
> > and the a and &#x0308; could be combined into an &x00e4; IMO this is a 
> > pretty degenerated case.
> 
> Seems reasonable: so, we can delete glyph substitution from the list of 
> things we must consider in this phase.
> 
> But, now I think of it, we must consider kerning too, so the list does not 
> get any thinner!

I may even be able to make the list even longer: font-variant support,
font-selection-strategy="character-by-character". I don't want to make
this any harder but maybe it's worth having at least a short glance at
the two properties while you're at it. But I'm not sure if it really
plays into the current topic.

<snip/>


Jeremias Maerki

Re: DO NOT REPLY [Bug 37743] - exception: border-style (shorthand)

Posted by Manuel Mall <mm...@arcus.com.au>.

On Tue, 6 Dec 2005 03:36 am, Simon Pepping wrote:
> On Mon, Dec 05, 2005 at 09:10:06PM +0800, Manuel Mall wrote:
> > Luca,
> >
> > Glyphs are only allowed to be merged if they carry the same /
> > matching set of property values. Personally I would not be
> > concerned if we therefore limit that logic to within a LM. While it
> > is possible that someone could write something like
> > <fo:block><fo:inline>a</fo:inline><fo:inline>&#x0308;</fo:inline>
> > and the a and &#x0308; could be combined into an &x00e4; IMO this
> > is a pretty degenerated case.
>
> One cannot write this. The Reader would probably combine >&#x0308;,
> after which a not-well-formed XML file would result. The text is not
> fully normalized, see
> http://www.w3.org/TR/2005/WD-charmod-norm-20051027/#sec-FullyNormaliz
>ed.

Yes I agree that the text is not fully normalised as defined in 
WD-charmod-norm-20051027. But, neither the XSL-FO nor the XML spec 
require the input to be normalised.

>
> Simon

Manuel

Re: DO NOT REPLY [Bug 37743] - exception: border-style (shorthand)

Posted by Simon Pepping <sp...@leverkruid.nl>.

On Mon, Dec 05, 2005 at 09:10:06PM +0800, Manuel Mall wrote:
> Luca,
> 
> Glyphs are only allowed to be merged if they carry the same / matching 
> set of property values. Personally I would not be concerned if we 
> therefore limit that logic to within a LM. While it is possible that 
> someone could write something like
> <fo:block><fo:inline>a</fo:inline><fo:inline>&#x0308;</fo:inline>
> and the a and &#x0308; could be combined into an &x00e4; IMO this is a 
> pretty degenerated case.

One cannot write this. The Reader would probably combine >&#x0308;,
after which a not-well-formed XML file would result. The text is not
fully normalized, see
http://www.w3.org/TR/2005/WD-charmod-norm-20051027/#sec-FullyNormalized.

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.nl

Re: DO NOT REPLY [Bug 37743] - exception: border-style (shorthand)

Posted by Manuel Mall <mm...@arcus.com.au>.

Luca,

great summary and I appreciate you taking a serious look into this 
problem. Some comments below.

On Mon, 5 Dec 2005 05:48 pm, Luca Furini wrote:
> Manuel Mall wrote:
> > After that we really need to redesign the line breaking stuff. Not
> > the Knuth approach (and the implemented algorithms related to that)
> > but the way we arrive at the Knuth sequences and iterate and
> > process the text elements. This needs to be done to be able to do
> > white-space-treatment, UAX#14 line breaking, start- /end- space
> > resolution and generally to be able to handle some more aspects of
> > Unicode (e.g. glyph merging).
>
> Just trying to write down a few thoghts and summarise what we have
> already said about this:
>
> - the inline LMs can directly apply the linefeed-treatment property
> (ignoring, preserving or transforming the LF character) but have too
> much a limited "view" to handle correctly white-space-treatment and
> white-space-collapse, and to count the number of letter spaces
>

I would formulate the problem slightly differently:

The line breaking logic requires inspection of adjacent characters in 
the input, even if these characters are contained in different inline 
fo's, in the following cases:
a) To determine line break possibilities in accordance with the Unicode 
Annex UAX#14
b) To be able to apply the white-space-treatment property around FOP 
generated line breaks
c) To determine word boundaries, which are used
	i) to calculate the number of letter spaces in a word
	ii) to determine the actual words presented to the hyphenation 
algorithm

> - the LineLM has to collect the "text" from its descendant nodes:
> non-textual objects should be taken into account too, as, for
> example, a leader between two spaces should prevent them from being
> collapsed; if spaces collapse only if they come from sibling nodes,
> this could maybe be handled during the collection by the
> InlineStackingLM

I have a slightly different view on the handling of spaces. We only need 
to be concerned about white-space-treatment around line breaks we 
generate. Everything else is already dealt with by the time the LM are 
invoked. This in turn means IMO we only need to know how "big" the glue 
element needs to be which is dropped if a line break is actually 
generated by the Knuth algorithm. Determining the value for "big" 
however means we need to consider adjacent spaces even if contained in 
different inline fo's.

>
> - the LineLM should then mark spaces that must be removed because
> they are trailing / leading, glyphs that must be merged (but which LM
> will paint them if the characters come from different text nodes?)
> and find the breaking points according to the unicode rules

Glyphs are only allowed to be merged if they carry the same / matching 
set of property values. Personally I would not be concerned if we 
therefore limit that logic to within a LM. While it is possible that 
someone could write something like
<fo:block><fo:inline>a</fo:inline><fo:inline>&#x0308;</fo:inline>
and the a and &#x0308; could be combined into an &x00e4; IMO this is a 
pretty degenerated case.

>
> - the LineLM should give someway the computed information to the
> descendant LM, that would use it to create at once the correct
> elements

Yes it could, but I am in two minds if this is the best approach or if 
the Line LM should create the Knuth sequences right away and store in 
them enough information so that during the addAreas phase the inline 
LMs can create the correct areas.

>
> - the resulting sequences would be ready for the breaking phase,
> without further analysis / checks / substitutions / changes
>
> The revised interface for inline LMs could then have (just a quick
> idea) a new appendText(StringBuffer) method and a modified version of
> getNextKnuthElements() having some extra parameter storing the
> information created by the LineLM; we should finally get rid of
> addALetterSpaceTo(), getWordChars(), hyphenate(), applyChanges() and
> getChangedKnuthElements().

Yes, we both seem to look for the same outcome.  My (certainly not fully 
thought through) model was more along the lines of the iterator 
approach used by the fo's to iterate over its char sequences during 
refinement. However, the iterator should probably not just return a 
character but enough information for the Line LM to build the area info 
objects to attach to the Knuth elements so that the add areas phase 
works correctly later.

>
> WDYT?

Very good discussion - my summary is:

a) We both seem to want the same outcome, that is add required features 
and at the same time get rid of some of the workarounds currently used.

b) We both agree that the character by character analysis is done at 
Line LM level.

c) Your initial thought is that the Line LM should then provide enough 
information to the LMs to generate their Knuth sequences while my 
initial thought is that the Line LM generates the Knuth sequences and 
provides enough information for the LMs to generate their areas.

If you agree with this summary may be we can concentrate on discussing 
the pros and cons of the two approaches mentioned in item c) above?

>
> Regards
>      Luca

Cheers

Manuel