You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-dev@xmlgraphics.apache.org by bu...@apache.org on 2006/02/04 14:47:22 UTC

DO NOT REPLY [Bug 38507] New: - Non-breaking space in PDF title output

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38507>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=38507

           Summary: Non-breaking space in PDF title output
           Product: Fop
           Version: 0.91
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: pdf
        AssignedTo: fop-dev@xmlgraphics.apache.org
        ReportedBy: paroz@email.ch


The <quote> XML tag is converted in French with a � followed with a non-breaking
space. In the body (<para>) text, non-breaking space seems to be respected.
But I have this sequence in a chapter title (Eg: <chapter><title>Some
<quote>text</quote></title></chapter>). In this case, the PDF output of the
title did not respect the non breaking space and broke the line just after the "�".
I've checked the fo source produced by XSL and it contains really the
non-breaking space sequence.

FYI, fop 0-20-5 did this correctly.

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by Andreas L Delmelle <a_...@pandora.be>.
On Feb 7, 2006, at 01:07, Manuel Mall wrote:

> On Tuesday 07 February 2006 01:11, Andreas L Delmelle wrote:
>> On Feb 6, 2006, at 08:17, Manuel Mall wrote:
>>
>>> For a starters it is fairly difficult to get a CR out of a XML
>>> parser.
>>
>> Difficult? It's simply a characters event, just like any other...
>>
>
> From the XML spec:
>
> <quote>
> To simplify the tasks of applications, the XML processor MUST  
> behave as
> if it normalized all line breaks in external parsed entities  
> (including
> the document entity) on input, before parsing, by translating both the
> two-character sequence #xD #xA and any #xD that is not followed by #xA
> to a single #xA character.
> <quote/>
>
> To  me this means unless you define an entity <!ENTITY cr "&#xD;" >  
> and
> then later reference it as &cr; you never get a CR out of an XML  
> parser
> (even on Windows).

You're right! Makes our job much, much simpler...

Cheers,

Andreas

Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by Manuel Mall <mm...@arcus.com.au>.
On Tuesday 07 February 2006 01:11, Andreas L Delmelle wrote:
> On Feb 6, 2006, at 08:17, Manuel Mall wrote:
> >> [ME:]
> >
> > <snip/>
> >
> >> A preserved carriage return can be treated the same way as a
> >> linefeed, under the very exceptional condition that it survives
> >> white-
> >> space handling:
> >>   * white-space-treatment="ignore-if-*"
> >>   * the CR does not follow/precede a linefeed
> >>   * it is the first character in a sequence of whitespace, so
> >>     it survives white-space-collapse
> >
> > Shouldn't a CR always survive whitespace handling?
>
> Not always:
> If white-space-treatment="preserve" then any XML whitespace other
> than a linefeed is converted into a normal space. IMO, the editors
> put it this way because of the possibility of Windows-specific line-
> endings, where a linefeed is followed by a CR.
>
> > For a starters it is fairly difficult to get a CR out of a XML
> > parser.
>
> Difficult? It's simply a characters event, just like any other...
>

From the XML spec:

<quote>
S (white space) consists of one or more space (#x20) characters, 
carriage returns, line feeds, or tabs.
White Space
[3]   	S	   ::=   	(#x20 | #x9 | #xD | #xA)+

Note:

The presence of #xD in the above production is maintained purely for 
backward compatibility with the First Edition. As explained in 2.11 
End-of-Line Handling, all #xD characters literally present in an XML 
document are either removed or replaced by #xA characters before any 
other processing is done. The only way to get a #xD character to match 
this production is to use a character reference in an entity value 
literal.

...

2.11 End-of-Line Handling

XML parsed entities are often stored in computer files which, for 
editing convenience, are organized into lines. These lines are 
typically separated by some combination of the characters CARRIAGE 
RETURN (#xD) and LINE FEED (#xA).

To simplify the tasks of applications, the XML processor MUST behave as 
if it normalized all line breaks in external parsed entities (including 
the document entity) on input, before parsing, by translating both the 
two-character sequence #xD #xA and any #xD that is not followed by #xA 
to a single #xA character.
<quote/>

To  me this means unless you define an entity <!ENTITY cr "&#xD;" > and 
then later reference it as &cr; you never get a CR out of an XML parser 
(even on Windows).

>
> Cheers,
>
> Andreas

Regards

Manuel

Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by Andreas L Delmelle <a_...@pandora.be>.
On Feb 6, 2006, at 18:11, Andreas L Delmelle wrote:

> A carriage-return can survive white-space-handling, for instance,  
> in the following case (suppose Mac-encoding):
>
> <fo:block>
>   First line, then a CR&#x0D;     some spaces, and more text
> </fo:block>

Cool! I just realized that this would be one way to preserve  
'linefeeds' (= carriage-returns) without having to specify linefeed- 
treatment="preserve". :-)

Cheers,

Andreas


Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by Andreas L Delmelle <a_...@pandora.be>.
On Feb 6, 2006, at 08:17, Manuel Mall wrote:

>> [ME:]
> <snip/>
>> A preserved carriage return can be treated the same way as a
>> linefeed, under the very exceptional condition that it survives  
>> white-
>> space handling:
>>   * white-space-treatment="ignore-if-*"
>>   * the CR does not follow/precede a linefeed
>>   * it is the first character in a sequence of whitespace, so
>>     it survives white-space-collapse
>>
>
> Shouldn't a CR always survive whitespace handling?

Not always:
If white-space-treatment="preserve" then any XML whitespace other  
than a linefeed is converted into a normal space. IMO, the editors  
put it this way because of the possibility of Windows-specific line- 
endings, where a linefeed is followed by a CR.

> For a starters it is fairly difficult to get a CR out of a XML parser.

Difficult? It's simply a characters event, just like any other...

> Only if the CR is hidden in an entity reference can it survive.
> Also, as Simon pointed out in some other contribution, whitespace  
> handling
> is designed to deal with pretty printing and readable XML layout  
> introduced
> whitespace. A CR preserved by the XML parser certainly does not  
> fall into
> that category.

Oh yes it does... Remember that not all our users are unix/linux- 
based, which means for Windows users, you're likely to get the  
sequence '&#x0A;&#x0D;' as line-terminator, while Mac-users saving a  
source file with native line-endings will simply get a '&#x0D;'.  
(UTF-8 encoding is recommended, but not enforced... An XML file can  
be any encoding the parser supports on top of the UTF-8 minimum.)

A carriage-return can survive white-space-handling, for instance, in  
the following case (suppose Mac-encoding):

<fo:block>
   First line, then a CR&#x0D;     some spaces, and more text
</fo:block>

The CR (which isn't necessarily a Numerical Character Reference, but  
could be just the byte '0D') is not converted into a space (white- 
space-treatment="ignore-if-surrounding-linefeed").
It does not precede or follow a linefeed.
It is the first character in a sequence of whitespace, so no matter  
what the value of white-space-collapse, it will survive...

> I am also not aware that the XSL-FO spec mentions CR as falling  
> under whitespace. IMO
> for whitespace handling CR is just a non whitespace character.

Nope, it does fall into the category of XML whitespace. There are  
exactly four of those: &#x09; (tab), &#x0A; (linefeed), &#x0D;  
(carriage-return) and &#x20; (space). If you don't believe me, it's  
indeed not in the XSL-FO Rec, but you might want to check the XML  
Recommendation...

> So, we only need to consider what fop layout should do if it  
> encounters a
> CR. I would say, keep it simple, throw it away and log a warning.
>
>> Now, what about a tab character under the same circumstances? Do we
>> use an elastic width of X spaces optimum, where X is purely
>> conventional?
>>
>
> Similar considerations as for CR apply to TAB.

...

Cheers,

Andreas

Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by Manuel Mall <mm...@arcus.com.au>.
> On Feb 5, 2006, at 14:13, bugzilla@apache.org wrote:
>
> Hi Manuel,
>
>> ------- Additional Comments From manuel@apache.org  2006-02-05
>> 14:13 -------
<snip/>
> A preserved carriage return can be treated the same way as a
> linefeed, under the very exceptional condition that it survives white-
> space handling:
>   * white-space-treatment="ignore-if-*"
>   * the CR does not follow/precede a linefeed
>   * it is the first character in a sequence of whitespace, so
>     it survives white-space-collapse
>

Shouldn't a CR always survive whitespace handling? For a starters it is
fairly difficult to get a CR out of a XML parser. Only if the CR is hidden
in an entity reference can it survive. Also, as Simon pointed out in some
other contribution, whitespace handling is designed to deal with pretty
printing and readable XML layout introduced whitespace. A CR preserved by
the XML parser certainly does not fall into that category. I am also not
aware that the XSL-FO spec mentions CR as falling under whitespace. IMO
for whitespace handling CR is just a non whitespace character.

So, we only need to consider what fop layout should do if it encounters a
CR. I would say, keep it simple, throw it away and log a warning.

> Now, what about a tab character under the same circumstances? Do we
> use an elastic width of X spaces optimum, where X is purely
> conventional?
>

Similar considerations as for CR apply to TAB.

Any way both CR and TAB have not much to do with the problem at hand: NBSP
not handled correctly.

<snip/>
>> The non breaking sequences are probably very simple:
>>
>> 1. Justified text: pen INF + elastic glue
>> 2. All other justification modes: either just a box of the width of
>> the space
>> or pen INF + fixed width glue.
>>
>> Curious what Luca and others think. Are the above two cases OK for
>> NBSP or have I oversimplified and missed something, that is for the
>> text-align values other then "justify", that
>> is "start", "center", "end", is it enough to just reserve
>> a fixed width for the NBSP?
>
> Still depends on text-align-last, no?

Yes correct but even then do the two rules above suffice, i.e. possible
justification required: Rule 1; no justification required: Rule 2?

> BTW, is this not one of those situations where it's possible that the
> used font contains a glyph for the NBSP character, so we should check
> that as well?

Yes but again it has very little to do with the problem. If the font has a
glyph for NBSP we should use that glyphs width and not the SP width in the
glue elements generated. That's all.

>
> Cheers,
>
> Andreas
>

Manuel


Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by Andreas L Delmelle <a_...@pandora.be>.
On Feb 5, 2006, at 14:13, bugzilla@apache.org wrote:

Hi Manuel,

> ------- Additional Comments From manuel@apache.org  2006-02-05  
> 14:13 -------
> Jeremias, no that is not it IMO. Knuth doesn't break between  
> elements as such.
> The glue or penalty element itself is the break opportunity and is  
> discarded
> when used as a break. Therefore, IMO we are not breaking before or  
> after a
> space or NBSP but at the space/NBSP.

OK, IIC you're directing this at the wrong person... The last  
question was mine. :-)

> The problem is the coding model used for Knuth element element  
> generation for
> spaces is flawed. What is done is that the only difference between  
> normal space
> and NBSP is an infinite penalty at the beginning of the sequence.

Yep. A few other gaps in that coding model, I'm currently looking at.  
See my most recent commit, and change of the white-space Wiki.  
Created some nasty side-effects in exotic situations... currently  
under investigation.
A preserved carriage return can be treated the same way as a  
linefeed, under the very exceptional condition that it survives white- 
space handling:
  * white-space-treatment="ignore-if-*"
  * the CR does not follow/precede a linefeed
  * it is the first character in a sequence of whitespace, so
    it survives white-space-collapse

Now, what about a tab character under the same circumstances? Do we  
use an elastic width of X spaces optimum, where X is purely  
conventional?

> However, some sequences are pretty long and involve multiple pen- 
> glue combinations and
> therefore break opportunities further into the sequence. We  
> probably need to
> separate this more cleanly. Have one function for non breaking  
> elastic elements
> (e.g. NBSP) and one function for breaking eleastic elements (e.g.  
> SPACE). The
> non breaking sequences are probably very simple:
>
> 1. Justified text: pen INF + elastic glue
> 2. All other justification modes: either just a box of the width of  
> the space
> or pen INF + fixed width glue.
>
> Curious what Luca and others think. Are the above two cases OK for  
> NBSP or have
> I oversimplified and missed something, that is for the text-align  
> values other
> then "justify", that is "start", "center", "end", is it enough to  
> just reserve
> a fixed width for the NBSP?

Still depends on text-align-last, no?
BTW, is this not one of those situations where it's possible that the  
used font contains a glyph for the NBSP character, so we should check  
that as well?

Cheers,

Andreas


DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38507>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=38507





------- Additional Comments From manuel@apache.org  2006-02-05 14:13 -------
Jeremias, no that is not it IMO. Knuth doesn't break between elements as such. 
The glue or penalty element itself is the break opportunity and is discarded 
when used as a break. Therefore, IMO we are not breaking before or after a 
space or NBSP but at the space/NBSP.

The problem is the coding model used for Knuth element element generation for 
spaces is flawed. What is done is that the only difference between normal space 
and NBSP is an infinite penalty at the beginning of the sequence. However, some 
sequences are pretty long and involve multiple pen-glue combinations and 
therefore break opportunities further into the sequence. We probably need to 
separate this more cleanly. Have one function for non breaking elastic elements 
(e.g. NBSP) and one function for breaking eleastic elements (e.g. SPACE). The 
non breaking sequences are probably very simple:

1. Justified text: pen INF + elastic glue
2. All other justification modes: either just a box of the width of the space 
or pen INF + fixed width glue.

Curious what Luca and others think. Are the above two cases OK for NBSP or have 
I oversimplified and missed something, that is for the text-align values other 
then "justify", that is "start", "center", "end", is it enough to just reserve 
a fixed width for the NBSP?

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38507>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=38507





------- Additional Comments From paroz@email.ch  2006-02-04 14:51 -------
Created an attachment (id=17585)
 --> (http://issues.apache.org/bugzilla/attachment.cgi?id=17585&action=view)
fo source for reproducing bug


-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38507>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=38507





------- Additional Comments From jeremias@apache.org  2006-02-04 18:39 -------
First step towards fixing the problem is adding a test case:
http://svn.apache.org/viewcvs?rev=374892&view=rev

If anyone is interested I've got a quick fix / hack for this. It doesn't fix the
test case as it should but it fixes the symptom in FOP Trunk. But better do it
right. That's why I didn't commit the fix.

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38507>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=38507


lfurini@cs.unibo.it changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED




------- Additional Comments From lfurini@cs.unibo.it  2006-02-08 09:19 -------
Fixed in the revision 375585 (http://svn.apache.org/viewcvs?rev=375585&view=rev).

Thank you for finding and signalling this bug!

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38507>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=38507





------- Additional Comments From manuel@apache.org  2006-02-04 16:26 -------
Claude,

thank you for the problem report. I had a quick look at it and can confirm that 
it appears to be a bug in Fop 0.91.

Technical background: The Knuth sequences generated for a NBSP do not actually 
prevent (in some cases at least) for the line to be broken at the NBSP. In 
TextLayoutManager.createElementsForASpace when a NBSP is encountered a normal 
sequence as for an ordinary space is generated and simply prefixed with an 
infinite penalty. This is not enough to prevent breaks within some of the 
sequences. However, it needs some more thought on how to best fix it.

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38507>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=38507





------- Additional Comments From a_l.delmelle@pandora.be  2006-02-05 12:59 -------
ad comment #2:
"... when a NBSP is encountered a normal sequence as for an ordinary space is generated and simply 
prefixed with an infinite penalty."

So, does this 'prefixed' mean a NBSP currently only prevents breaking _before_ the space (breaking after it 
--or before the next glue/box-- could still be considered 'possible/desirable')? If so, the effect we need to 
achieve would be something analogous to having:
<fo:character character="&#xA0;" keep-with-next.within-line="always" />



-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.