You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-dev@xmlgraphics.apache.org by Manuel Mall <mm...@arcus.com.au> on 2006/02/06 08:17:11 UTC

Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

> On Feb 5, 2006, at 14:13, bugzilla@apache.org wrote:
>
> Hi Manuel,
>
>> ------- Additional Comments From manuel@apache.org  2006-02-05
>> 14:13 -------
<snip/>
> A preserved carriage return can be treated the same way as a
> linefeed, under the very exceptional condition that it survives white-
> space handling:
>   * white-space-treatment="ignore-if-*"
>   * the CR does not follow/precede a linefeed
>   * it is the first character in a sequence of whitespace, so
>     it survives white-space-collapse
>

Shouldn't a CR always survive whitespace handling? For a starters it is
fairly difficult to get a CR out of a XML parser. Only if the CR is hidden
in an entity reference can it survive. Also, as Simon pointed out in some
other contribution, whitespace handling is designed to deal with pretty
printing and readable XML layout introduced whitespace. A CR preserved by
the XML parser certainly does not fall into that category. I am also not
aware that the XSL-FO spec mentions CR as falling under whitespace. IMO
for whitespace handling CR is just a non whitespace character.

So, we only need to consider what fop layout should do if it encounters a
CR. I would say, keep it simple, throw it away and log a warning.

> Now, what about a tab character under the same circumstances? Do we
> use an elastic width of X spaces optimum, where X is purely
> conventional?
>

Similar considerations as for CR apply to TAB.

Any way both CR and TAB have not much to do with the problem at hand: NBSP
not handled correctly.

<snip/>
>> The non breaking sequences are probably very simple:
>>
>> 1. Justified text: pen INF + elastic glue
>> 2. All other justification modes: either just a box of the width of
>> the space
>> or pen INF + fixed width glue.
>>
>> Curious what Luca and others think. Are the above two cases OK for
>> NBSP or have I oversimplified and missed something, that is for the
>> text-align values other then "justify", that
>> is "start", "center", "end", is it enough to just reserve
>> a fixed width for the NBSP?
>
> Still depends on text-align-last, no?

Yes correct but even then do the two rules above suffice, i.e. possible
justification required: Rule 1; no justification required: Rule 2?

> BTW, is this not one of those situations where it's possible that the
> used font contains a glyph for the NBSP character, so we should check
> that as well?

Yes but again it has very little to do with the problem. If the font has a
glyph for NBSP we should use that glyphs width and not the SP width in the
glue elements generated. That's all.

>
> Cheers,
>
> Andreas
>

Manuel

Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by Andreas L Delmelle <a_...@pandora.be>.

On Feb 7, 2006, at 01:07, Manuel Mall wrote:

> On Tuesday 07 February 2006 01:11, Andreas L Delmelle wrote:
>> On Feb 6, 2006, at 08:17, Manuel Mall wrote:
>>
>>> For a starters it is fairly difficult to get a CR out of a XML
>>> parser.
>>
>> Difficult? It's simply a characters event, just like any other...
>>
>
> From the XML spec:
>
> <quote>
> To simplify the tasks of applications, the XML processor MUST  
> behave as
> if it normalized all line breaks in external parsed entities  
> (including
> the document entity) on input, before parsing, by translating both the
> two-character sequence #xD #xA and any #xD that is not followed by #xA
> to a single #xA character.
> <quote/>
>
> To  me this means unless you define an entity <!ENTITY cr "&#xD;" >  
> and
> then later reference it as &cr; you never get a CR out of an XML  
> parser
> (even on Windows).

You're right! Makes our job much, much simpler...

Cheers,

Andreas

Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by Manuel Mall <mm...@arcus.com.au>.

On Tuesday 07 February 2006 01:11, Andreas L Delmelle wrote:
> On Feb 6, 2006, at 08:17, Manuel Mall wrote:
> >> [ME:]
> >
> > <snip/>
> >
> >> A preserved carriage return can be treated the same way as a
> >> linefeed, under the very exceptional condition that it survives
> >> white-
> >> space handling:
> >>   * white-space-treatment="ignore-if-*"
> >>   * the CR does not follow/precede a linefeed
> >>   * it is the first character in a sequence of whitespace, so
> >>     it survives white-space-collapse
> >
> > Shouldn't a CR always survive whitespace handling?
>
> Not always:
> If white-space-treatment="preserve" then any XML whitespace other
> than a linefeed is converted into a normal space. IMO, the editors
> put it this way because of the possibility of Windows-specific line-
> endings, where a linefeed is followed by a CR.
>
> > For a starters it is fairly difficult to get a CR out of a XML
> > parser.
>
> Difficult? It's simply a characters event, just like any other...
>

From the XML spec:

<quote>
S (white space) consists of one or more space (#x20) characters, 
carriage returns, line feeds, or tabs.
White Space
[3]   	S	   ::=   	(#x20 | #x9 | #xD | #xA)+

Note:

The presence of #xD in the above production is maintained purely for 
backward compatibility with the First Edition. As explained in 2.11 
End-of-Line Handling, all #xD characters literally present in an XML 
document are either removed or replaced by #xA characters before any 
other processing is done. The only way to get a #xD character to match 
this production is to use a character reference in an entity value 
literal.

...

2.11 End-of-Line Handling

XML parsed entities are often stored in computer files which, for 
editing convenience, are organized into lines. These lines are 
typically separated by some combination of the characters CARRIAGE 
RETURN (#xD) and LINE FEED (#xA).

To simplify the tasks of applications, the XML processor MUST behave as 
if it normalized all line breaks in external parsed entities (including 
the document entity) on input, before parsing, by translating both the 
two-character sequence #xD #xA and any #xD that is not followed by #xA 
to a single #xA character.
<quote/>

To  me this means unless you define an entity <!ENTITY cr "&#xD;" > and 
then later reference it as &cr; you never get a CR out of an XML parser 
(even on Windows).

>
> Cheers,
>
> Andreas

Regards

Manuel

Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by Andreas L Delmelle <a_...@pandora.be>.

On Feb 6, 2006, at 18:11, Andreas L Delmelle wrote:

> A carriage-return can survive white-space-handling, for instance,  
> in the following case (suppose Mac-encoding):
>
> <fo:block>
>   First line, then a CR&#x0D;     some spaces, and more text
> </fo:block>

Cool! I just realized that this would be one way to preserve  
'linefeeds' (= carriage-returns) without having to specify linefeed- 
treatment="preserve". :-)

Cheers,

Andreas

Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by Andreas L Delmelle <a_...@pandora.be>.

On Feb 6, 2006, at 08:17, Manuel Mall wrote:

>> [ME:]
> <snip/>
>> A preserved carriage return can be treated the same way as a
>> linefeed, under the very exceptional condition that it survives  
>> white-
>> space handling:
>>   * white-space-treatment="ignore-if-*"
>>   * the CR does not follow/precede a linefeed
>>   * it is the first character in a sequence of whitespace, so
>>     it survives white-space-collapse
>>
>
> Shouldn't a CR always survive whitespace handling?

Not always:
If white-space-treatment="preserve" then any XML whitespace other  
than a linefeed is converted into a normal space. IMO, the editors  
put it this way because of the possibility of Windows-specific line- 
endings, where a linefeed is followed by a CR.

> For a starters it is fairly difficult to get a CR out of a XML parser.

Difficult? It's simply a characters event, just like any other...

> Only if the CR is hidden in an entity reference can it survive.
> Also, as Simon pointed out in some other contribution, whitespace  
> handling
> is designed to deal with pretty printing and readable XML layout  
> introduced
> whitespace. A CR preserved by the XML parser certainly does not  
> fall into
> that category.

Oh yes it does... Remember that not all our users are unix/linux- 
based, which means for Windows users, you're likely to get the  
sequence '&#x0A;&#x0D;' as line-terminator, while Mac-users saving a  
source file with native line-endings will simply get a '&#x0D;'.  
(UTF-8 encoding is recommended, but not enforced... An XML file can  
be any encoding the parser supports on top of the UTF-8 minimum.)

A carriage-return can survive white-space-handling, for instance, in  
the following case (suppose Mac-encoding):

<fo:block>
   First line, then a CR&#x0D;     some spaces, and more text
</fo:block>

The CR (which isn't necessarily a Numerical Character Reference, but  
could be just the byte '0D') is not converted into a space (white- 
space-treatment="ignore-if-surrounding-linefeed").
It does not precede or follow a linefeed.
It is the first character in a sequence of whitespace, so no matter  
what the value of white-space-collapse, it will survive...

> I am also not aware that the XSL-FO spec mentions CR as falling  
> under whitespace. IMO
> for whitespace handling CR is just a non whitespace character.

Nope, it does fall into the category of XML whitespace. There are  
exactly four of those: &#x09; (tab), &#x0A; (linefeed), &#x0D;  
(carriage-return) and &#x20; (space). If you don't believe me, it's  
indeed not in the XSL-FO Rec, but you might want to check the XML  
Recommendation...

> So, we only need to consider what fop layout should do if it  
> encounters a
> CR. I would say, keep it simple, throw it away and log a warning.
>
>> Now, what about a tab character under the same circumstances? Do we
>> use an elastic width of X spaces optimum, where X is purely
>> conventional?
>>
>
> Similar considerations as for CR apply to TAB.

...

Cheers,

Andreas