You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-dev@xmlgraphics.apache.org by bu...@apache.org on 2006/02/04 14:47:22 UTC
DO NOT REPLY [Bug 38507] New: - Non-breaking space in PDF title output
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38507>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=38507
Summary: Non-breaking space in PDF title output
Product: Fop
Version: 0.91
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: pdf
AssignedTo: fop-dev@xmlgraphics.apache.org
ReportedBy: paroz@email.ch
The <quote> XML tag is converted in French with a � followed with a non-breaking
space. In the body (<para>) text, non-breaking space seems to be respected.
But I have this sequence in a chapter title (Eg: <chapter><title>Some
<quote>text</quote></title></chapter>). In this case, the PDF output of the
title did not respect the non breaking space and broke the line just after the "�".
I've checked the fo source produced by XSL and it contains really the
non-breaking space sequence.
FYI, fop 0-20-5 did this correctly.
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
Posted by Andreas L Delmelle <a_...@pandora.be>.
On Feb 7, 2006, at 01:07, Manuel Mall wrote:
> On Tuesday 07 February 2006 01:11, Andreas L Delmelle wrote:
>> On Feb 6, 2006, at 08:17, Manuel Mall wrote:
>>
>>> For a starters it is fairly difficult to get a CR out of a XML
>>> parser.
>>
>> Difficult? It's simply a characters event, just like any other...
>>
>
> From the XML spec:
>
> <quote>
> To simplify the tasks of applications, the XML processor MUST
> behave as
> if it normalized all line breaks in external parsed entities
> (including
> the document entity) on input, before parsing, by translating both the
> two-character sequence #xD #xA and any #xD that is not followed by #xA
> to a single #xA character.
> <quote/>
>
> To me this means unless you define an entity <!ENTITY cr "
" >
> and
> then later reference it as &cr; you never get a CR out of an XML
> parser
> (even on Windows).
You're right! Makes our job much, much simpler...
Cheers,
Andreas
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
Posted by Manuel Mall <mm...@arcus.com.au>.
On Tuesday 07 February 2006 01:11, Andreas L Delmelle wrote:
> On Feb 6, 2006, at 08:17, Manuel Mall wrote:
> >> [ME:]
> >
> > <snip/>
> >
> >> A preserved carriage return can be treated the same way as a
> >> linefeed, under the very exceptional condition that it survives
> >> white-
> >> space handling:
> >> * white-space-treatment="ignore-if-*"
> >> * the CR does not follow/precede a linefeed
> >> * it is the first character in a sequence of whitespace, so
> >> it survives white-space-collapse
> >
> > Shouldn't a CR always survive whitespace handling?
>
> Not always:
> If white-space-treatment="preserve" then any XML whitespace other
> than a linefeed is converted into a normal space. IMO, the editors
> put it this way because of the possibility of Windows-specific line-
> endings, where a linefeed is followed by a CR.
>
> > For a starters it is fairly difficult to get a CR out of a XML
> > parser.
>
> Difficult? It's simply a characters event, just like any other...
>
From the XML spec:
<quote>
S (white space) consists of one or more space (#x20) characters,
carriage returns, line feeds, or tabs.
White Space
[3] S ::= (#x20 | #x9 | #xD | #xA)+
Note:
The presence of #xD in the above production is maintained purely for
backward compatibility with the First Edition. As explained in 2.11
End-of-Line Handling, all #xD characters literally present in an XML
document are either removed or replaced by #xA characters before any
other processing is done. The only way to get a #xD character to match
this production is to use a character reference in an entity value
literal.
...
2.11 End-of-Line Handling
XML parsed entities are often stored in computer files which, for
editing convenience, are organized into lines. These lines are
typically separated by some combination of the characters CARRIAGE
RETURN (#xD) and LINE FEED (#xA).
To simplify the tasks of applications, the XML processor MUST behave as
if it normalized all line breaks in external parsed entities (including
the document entity) on input, before parsing, by translating both the
two-character sequence #xD #xA and any #xD that is not followed by #xA
to a single #xA character.
<quote/>
To me this means unless you define an entity <!ENTITY cr "
" > and
then later reference it as &cr; you never get a CR out of an XML parser
(even on Windows).
>
> Cheers,
>
> Andreas
Regards
Manuel
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
Posted by Andreas L Delmelle <a_...@pandora.be>.
On Feb 6, 2006, at 18:11, Andreas L Delmelle wrote:
> A carriage-return can survive white-space-handling, for instance,
> in the following case (suppose Mac-encoding):
>
> <fo:block>
> First line, then a CR
 some spaces, and more text
> </fo:block>
Cool! I just realized that this would be one way to preserve
'linefeeds' (= carriage-returns) without having to specify linefeed-
treatment="preserve". :-)
Cheers,
Andreas
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
Posted by Andreas L Delmelle <a_...@pandora.be>.
On Feb 6, 2006, at 08:17, Manuel Mall wrote:
>> [ME:]
> <snip/>
>> A preserved carriage return can be treated the same way as a
>> linefeed, under the very exceptional condition that it survives
>> white-
>> space handling:
>> * white-space-treatment="ignore-if-*"
>> * the CR does not follow/precede a linefeed
>> * it is the first character in a sequence of whitespace, so
>> it survives white-space-collapse
>>
>
> Shouldn't a CR always survive whitespace handling?
Not always:
If white-space-treatment="preserve" then any XML whitespace other
than a linefeed is converted into a normal space. IMO, the editors
put it this way because of the possibility of Windows-specific line-
endings, where a linefeed is followed by a CR.
> For a starters it is fairly difficult to get a CR out of a XML parser.
Difficult? It's simply a characters event, just like any other...
> Only if the CR is hidden in an entity reference can it survive.
> Also, as Simon pointed out in some other contribution, whitespace
> handling
> is designed to deal with pretty printing and readable XML layout
> introduced
> whitespace. A CR preserved by the XML parser certainly does not
> fall into
> that category.
Oh yes it does... Remember that not all our users are unix/linux-
based, which means for Windows users, you're likely to get the
sequence '

' as line-terminator, while Mac-users saving a
source file with native line-endings will simply get a '
'.
(UTF-8 encoding is recommended, but not enforced... An XML file can
be any encoding the parser supports on top of the UTF-8 minimum.)
A carriage-return can survive white-space-handling, for instance, in
the following case (suppose Mac-encoding):
<fo:block>
First line, then a CR
 some spaces, and more text
</fo:block>
The CR (which isn't necessarily a Numerical Character Reference, but
could be just the byte '0D') is not converted into a space (white-
space-treatment="ignore-if-surrounding-linefeed").
It does not precede or follow a linefeed.
It is the first character in a sequence of whitespace, so no matter
what the value of white-space-collapse, it will survive...
> I am also not aware that the XSL-FO spec mentions CR as falling
> under whitespace. IMO
> for whitespace handling CR is just a non whitespace character.
Nope, it does fall into the category of XML whitespace. There are
exactly four of those: 	 (tab), 
 (linefeed), 
(carriage-return) and   (space). If you don't believe me, it's
indeed not in the XSL-FO Rec, but you might want to check the XML
Recommendation...
> So, we only need to consider what fop layout should do if it
> encounters a
> CR. I would say, keep it simple, throw it away and log a warning.
>
>> Now, what about a tab character under the same circumstances? Do we
>> use an elastic width of X spaces optimum, where X is purely
>> conventional?
>>
>
> Similar considerations as for CR apply to TAB.
...
Cheers,
Andreas
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title
output
Posted by Manuel Mall <mm...@arcus.com.au>.
> On Feb 5, 2006, at 14:13, bugzilla@apache.org wrote:
>
> Hi Manuel,
>
>> ------- Additional Comments From manuel@apache.org 2006-02-05
>> 14:13 -------
<snip/>
> A preserved carriage return can be treated the same way as a
> linefeed, under the very exceptional condition that it survives white-
> space handling:
> * white-space-treatment="ignore-if-*"
> * the CR does not follow/precede a linefeed
> * it is the first character in a sequence of whitespace, so
> it survives white-space-collapse
>
Shouldn't a CR always survive whitespace handling? For a starters it is
fairly difficult to get a CR out of a XML parser. Only if the CR is hidden
in an entity reference can it survive. Also, as Simon pointed out in some
other contribution, whitespace handling is designed to deal with pretty
printing and readable XML layout introduced whitespace. A CR preserved by
the XML parser certainly does not fall into that category. I am also not
aware that the XSL-FO spec mentions CR as falling under whitespace. IMO
for whitespace handling CR is just a non whitespace character.
So, we only need to consider what fop layout should do if it encounters a
CR. I would say, keep it simple, throw it away and log a warning.
> Now, what about a tab character under the same circumstances? Do we
> use an elastic width of X spaces optimum, where X is purely
> conventional?
>
Similar considerations as for CR apply to TAB.
Any way both CR and TAB have not much to do with the problem at hand: NBSP
not handled correctly.
<snip/>
>> The non breaking sequences are probably very simple:
>>
>> 1. Justified text: pen INF + elastic glue
>> 2. All other justification modes: either just a box of the width of
>> the space
>> or pen INF + fixed width glue.
>>
>> Curious what Luca and others think. Are the above two cases OK for
>> NBSP or have I oversimplified and missed something, that is for the
>> text-align values other then "justify", that
>> is "start", "center", "end", is it enough to just reserve
>> a fixed width for the NBSP?
>
> Still depends on text-align-last, no?
Yes correct but even then do the two rules above suffice, i.e. possible
justification required: Rule 1; no justification required: Rule 2?
> BTW, is this not one of those situations where it's possible that the
> used font contains a glyph for the NBSP character, so we should check
> that as well?
Yes but again it has very little to do with the problem. If the font has a
glyph for NBSP we should use that glyphs width and not the SP width in the
glue elements generated. That's all.
>
> Cheers,
>
> Andreas
>
Manuel
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
Posted by Andreas L Delmelle <a_...@pandora.be>.
On Feb 5, 2006, at 14:13, bugzilla@apache.org wrote:
Hi Manuel,
> ------- Additional Comments From manuel@apache.org 2006-02-05
> 14:13 -------
> Jeremias, no that is not it IMO. Knuth doesn't break between
> elements as such.
> The glue or penalty element itself is the break opportunity and is
> discarded
> when used as a break. Therefore, IMO we are not breaking before or
> after a
> space or NBSP but at the space/NBSP.
OK, IIC you're directing this at the wrong person... The last
question was mine. :-)
> The problem is the coding model used for Knuth element element
> generation for
> spaces is flawed. What is done is that the only difference between
> normal space
> and NBSP is an infinite penalty at the beginning of the sequence.
Yep. A few other gaps in that coding model, I'm currently looking at.
See my most recent commit, and change of the white-space Wiki.
Created some nasty side-effects in exotic situations... currently
under investigation.
A preserved carriage return can be treated the same way as a
linefeed, under the very exceptional condition that it survives white-
space handling:
* white-space-treatment="ignore-if-*"
* the CR does not follow/precede a linefeed
* it is the first character in a sequence of whitespace, so
it survives white-space-collapse
Now, what about a tab character under the same circumstances? Do we
use an elastic width of X spaces optimum, where X is purely
conventional?
> However, some sequences are pretty long and involve multiple pen-
> glue combinations and
> therefore break opportunities further into the sequence. We
> probably need to
> separate this more cleanly. Have one function for non breaking
> elastic elements
> (e.g. NBSP) and one function for breaking eleastic elements (e.g.
> SPACE). The
> non breaking sequences are probably very simple:
>
> 1. Justified text: pen INF + elastic glue
> 2. All other justification modes: either just a box of the width of
> the space
> or pen INF + fixed width glue.
>
> Curious what Luca and others think. Are the above two cases OK for
> NBSP or have
> I oversimplified and missed something, that is for the text-align
> values other
> then "justify", that is "start", "center", "end", is it enough to
> just reserve
> a fixed width for the NBSP?
Still depends on text-align-last, no?
BTW, is this not one of those situations where it's possible that the
used font contains a glyph for the NBSP character, so we should check
that as well?
Cheers,
Andreas
DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38507>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=38507
------- Additional Comments From manuel@apache.org 2006-02-05 14:13 -------
Jeremias, no that is not it IMO. Knuth doesn't break between elements as such.
The glue or penalty element itself is the break opportunity and is discarded
when used as a break. Therefore, IMO we are not breaking before or after a
space or NBSP but at the space/NBSP.
The problem is the coding model used for Knuth element element generation for
spaces is flawed. What is done is that the only difference between normal space
and NBSP is an infinite penalty at the beginning of the sequence. However, some
sequences are pretty long and involve multiple pen-glue combinations and
therefore break opportunities further into the sequence. We probably need to
separate this more cleanly. Have one function for non breaking elastic elements
(e.g. NBSP) and one function for breaking eleastic elements (e.g. SPACE). The
non breaking sequences are probably very simple:
1. Justified text: pen INF + elastic glue
2. All other justification modes: either just a box of the width of the space
or pen INF + fixed width glue.
Curious what Luca and others think. Are the above two cases OK for NBSP or have
I oversimplified and missed something, that is for the text-align values other
then "justify", that is "start", "center", "end", is it enough to just reserve
a fixed width for the NBSP?
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38507>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=38507
------- Additional Comments From paroz@email.ch 2006-02-04 14:51 -------
Created an attachment (id=17585)
--> (http://issues.apache.org/bugzilla/attachment.cgi?id=17585&action=view)
fo source for reproducing bug
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38507>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=38507
------- Additional Comments From jeremias@apache.org 2006-02-04 18:39 -------
First step towards fixing the problem is adding a test case:
http://svn.apache.org/viewcvs?rev=374892&view=rev
If anyone is interested I've got a quick fix / hack for this. It doesn't fix the
test case as it should but it fixes the symptom in FOP Trunk. But better do it
right. That's why I didn't commit the fix.
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38507>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=38507
lfurini@cs.unibo.it changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Additional Comments From lfurini@cs.unibo.it 2006-02-08 09:19 -------
Fixed in the revision 375585 (http://svn.apache.org/viewcvs?rev=375585&view=rev).
Thank you for finding and signalling this bug!
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38507>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=38507
------- Additional Comments From manuel@apache.org 2006-02-04 16:26 -------
Claude,
thank you for the problem report. I had a quick look at it and can confirm that
it appears to be a bug in Fop 0.91.
Technical background: The Knuth sequences generated for a NBSP do not actually
prevent (in some cases at least) for the line to be broken at the NBSP. In
TextLayoutManager.createElementsForASpace when a NBSP is encountered a normal
sequence as for an ordinary space is generated and simply prefixed with an
infinite penalty. This is not enough to prevent breaks within some of the
sequences. However, it needs some more thought on how to best fix it.
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
Posted by bu...@apache.org.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=38507>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=38507
------- Additional Comments From a_l.delmelle@pandora.be 2006-02-05 12:59 -------
ad comment #2:
"... when a NBSP is encountered a normal sequence as for an ordinary space is generated and simply
prefixed with an infinite penalty."
So, does this 'prefixed' mean a NBSP currently only prevents breaking _before_ the space (breaking after it
--or before the next glue/box-- could still be considered 'possible/desirable')? If so, the effect we need to
achieve would be something analogous to having:
<fo:character character=" " keep-with-next.within-line="always" />
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.