You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-dev@xmlgraphics.apache.org by Manuel Mall <mm...@arcus.com.au> on 2005/11/01 15:52:33 UTC

zero width space

Currently if one puts a zero-width-space (U+200B) into an XSL-FO file 
(or specifies linefeed-treatment="treat-as-zero-width-space") it is 
rendered as a "missing character" in PDF. Is that correct, i.e. does 
this character have to exist in the font used or should the formatter 
or renderer simply remove this character? It is the second approach 
that both AntennaHouse and RenderX appear to have chosen.

Manuel

Re: zero width space

Posted by Andreas L Delmelle <a_...@pandora.be>.
On Nov 1, 2005, at 15:52, Manuel Mall wrote:

> Currently if one puts a zero-width-space (U+200B) into an XSL-FO file
> (or specifies linefeed-treatment="treat-as-zero-width-space") it is
> rendered as a "missing character" in PDF. Is that correct, i.e. does
> this character have to exist in the font used or should the formatter
> or renderer simply remove this character? It is the second approach
> that both AntennaHouse and RenderX appear to have chosen.

It's certainly not correct to render a missing glyph character, but  
it would also be wrong to remove it too early. The character doesn't  
take part in white-space treatment/collapsing, since it's not XML  
whitespace. It's somewhere in layout that the decision has to be made  
not to allocate space for this character, but it could play a part in  
line-building...

My two cents.

Cheers,

Andreas


Re: zero width space

Posted by The Web Maestro <th...@gmail.com>.
On Nov 3, 2005, at 9:31 PM, Manuel Mall wrote:
> Thanks a lot Peter. Seems like the Unicode consortium did change their
> mind on U+200B between version 3.0 and 4.0. For the purpose of the
> current version of FOP which does not (yet) recognise scripts nor
> allows customisation of such behaviours we will then stick with ZWS not
> affecting justification I assume?
>
> Manuel

Just for clarity (and for the archives), does this mean FOP will be 
supporting version Unicode 3.0 or Unicode 4.0?

Regards,

Web Maestro Clay
-- 
<th...@gmail.com> - <http://homepage.mac.com/webmaestro/>
My religion is simple. My religion is kindness.
- HH The 14th Dalai Lama of Tibet


Re: zero width space

Posted by Manuel Mall <mm...@arcus.com.au>.
On Fri, 4 Nov 2005 01:27 pm, Peter S. Housel wrote:
> On Fri, 2005-11-04 at 11:55 +0800, Manuel Mall wrote:
> > On Fri, 4 Nov 2005 04:46 am, J.Pietschmann wrote:
> > > Manuel Mall wrote:
> > > > With respect to U+200B it says in
> > >
> > > [snip]
> > >
> > > > It therefore surprises me that you imply U+200B may expand in
> > > > justification.
> > >
> > > The Unicode 3.0 book explicitely mentions that ZWS may be
> > > expanded for justification, to my great surprise. The 2.0 book
> > > doesn't have any remarks in this direction. I don't have access
> > > to a book more recent than 3.0. Maybe they changed mind
> > > (again...).
> >
> > Any one out there who has the 4.0 book and can shed some light on
> > this?
>
> It says that U+200B normally has no effect on letter spacing in most
> scripts, but only indicates a word boundary (and therefore a possible
> line break).  It also mentions that when letter-spacing Thai it may
> grow to have a non-zero width, but that is the exception. (Thai
> apparently doesn't put spaces between words, and uses U+200B as a
> word separator.)

Thanks a lot Peter. Seems like the Unicode consortium did change their 
mind on U+200B between version 3.0 and 4.0. For the purpose of the 
current version of FOP which does not (yet) recognise scripts nor 
allows customisation of such behaviours we will then stick with ZWS not 
affecting justification I assume?

Manuel

Re: zero width space

Posted by "J.Pietschmann" <j3...@yahoo.de>.
Manuel Mall wrote:
>> What about character composition/decomposition?
> 
> Good question? Where is the answer?

Lets clarify the problem first. Let's say the input contains
the sequence U+0061 U+0308 (latin small a, combining diaresis),
the font has a glyph for U+00E4 but not U+0308. Obviously,
putting the precomposed character U+00E4 into the output is
a smart move. Where should this transformation occur: output
generation, renderer, layout stage? A slight problem is that
the width of U+00E4 may be different from U+0061.

J.Pietschmann

Re: zero width space

Posted by "Peter S. Housel" <ho...@acm.org>.
On Fri, 2005-11-04 at 11:55 +0800, Manuel Mall wrote:
> On Fri, 4 Nov 2005 04:46 am, J.Pietschmann wrote:
> > Manuel Mall wrote:
> > > With respect to U+200B it says in
> >
> > [snip]
> >
> > > It therefore surprises me that you imply U+200B may expand in
> > > justification.
> >
> > The Unicode 3.0 book explicitely mentions that ZWS may be expanded
> > for justification, to my great surprise. The 2.0 book doesn't have
> > any remarks in this direction. I don't have access to a book more
> > recent than 3.0. Maybe they changed mind (again...).
> >
> Any one out there who has the 4.0 book and can shed some light on this?

It says that U+200B normally has no effect on letter spacing in most
scripts, but only indicates a word boundary (and therefore a possible
line break).  It also mentions that when letter-spacing Thai it may grow
to have a non-zero width, but that is the exception. (Thai apparently
doesn't put spaces between words, and uses U+200B as a word separator.)

-- 
Peter S. Housel <ho...@acm.org>

Re: zero width space

Posted by Manuel Mall <mm...@arcus.com.au>.
On Fri, 4 Nov 2005 04:46 am, J.Pietschmann wrote:
> Manuel Mall wrote:
> > With respect to U+200B it says in
>
> [snip]
>
> > It therefore surprises me that you imply U+200B may expand in
> > justification.
>
> The Unicode 3.0 book explicitely mentions that ZWS may be expanded
> for justification, to my great surprise. The 2.0 book doesn't have
> any remarks in this direction. I don't have access to a book more
> recent than 3.0. Maybe they changed mind (again...).
>
Any one out there who has the 4.0 book and can shed some light on this?

> > Thanks for that list. With respect to the issue at hand, that is
> > which codepoints should be given to the renderers it seems there
> > are 3 types:
>
> ...
>
> > 2. Those we never give to the renderers, e.g. Soft Hyphen (its
> > either suppressed or replaced by the proper hyphen), zero-width
> > joiners, ...
>
> In case of the hypothetical HTML renderer, you *want* to pass all
> these characters to the renderer.

I would see a HTML renderer more like the RTF renderer which bypasses 
all the LayoutManager logic and it is really only a 'simple' 
conversion. That is XSL-FO formatting instructions are translated into 
HTML/CSS (or RTF) formatting instructions but no actual layout is 
performed (no page breaking, line breaking and the like). And yes for 
those types of renderers all text would need to be preserved.

>
> > Is that a sensible grouping?
>
> Dunno.
> What about character composition/decomposition?

Good question? Where is the answer?
>
>
> J.Pietschmann
Manuel

Re: zero width space

Posted by "J.Pietschmann" <j3...@yahoo.de>.
Manuel Mall wrote:
> With respect to U+200B it says in 
[snip]
> It therefore surprises me that you imply U+200B may expand in 
> justification.

The Unicode 3.0 book explicitely mentions that ZWS may be expanded
for justification, to my great surprise. The 2.0 book doesn't have
any remarks in this direction. I don't have access to a book more
recent than 3.0. Maybe they changed mind (again...).

> Thanks for that list. With respect to the issue at hand, that is which 
> codepoints should be given to the renderers it seems there are 3 types: 
...
> 2. Those we never give to the renderers, e.g. Soft Hyphen (its either 
> suppressed or replaced by the proper hyphen), zero-width joiners, ...

In case of the hypothetical HTML renderer, you *want* to pass all these
characters to the renderer.

> Is that a sensible grouping?

Dunno.
What about character composition/decomposition?


J.Pietschmann

Re: zero width space

Posted by Manuel Mall <mm...@arcus.com.au>.
On Thu, 3 Nov 2005 05:57 am, J.Pietschmann wrote:
> Manuel Mall wrote:
> > That seems to be the consensus, that is consider ZWS for line
> > breaking but then discard and don't give it to the renderers.
>
> Renderers could deal with ZWS if the font would have a glyph for
> this character; unfortunately, that's not the case for the PDF
> standard fonts  :-)  Some fonts *do* have glyphs for various Unicode
> space characters, notably the fixed width spaces.
>
> This leads to the question: Is a space a character? What *is* a
> character? The Unicode people had endless discussions about this.
> Spaces are exactly in the gray area between "real characters"
> which leave marks and layout control.
>
> Handling space characters in layout and discarding them before
> rendering has the distinctive advantage that they work for
> any font in any renderer (which can handle variable space areas
> properly, of course). OTOH, renderers which output a format which
> can handle the spaces itself, like a hypothetical HTML renderer,
> would better get the original character.
>
Exactly this was actually discussed recently in an exchange between 
myself and Luca. Luca pointed out that leaving space characters out of 
a PDF would lead to copy/paste behaviour most likely contrary to user 
expectations. I thought that was a very important point.

> > Are there any other (unusual Unicode) characters which fall in the
> > same category that is they influence layout decisions but should
> > not be seen by the renderers?
>
> * Unicode spaces
>   + variable with spaces
>     - ordinary space U+0020
>     - ordinary non-breaking space U+00A0
>   + fixed width spaces; potentially available in fonts and *may*
>     be passed to renderers, *except* for U+200B
>     - zero width space U+200B, may expand in justification (not
>       implemented this way in FOP 0.20.5, which will haunt us)
>     - zero width non breaking space, aka byte order mark U+FEFF,
>       should now only be used as BOM (as the BOM is eaten by the
>       XML parser, FOP could emit a "deprecated" warning)
With respect to U+200B it says in 
http://www.unicode.org/Public/UNIDATA/UCD.html:
<quote>
White_Space: Those separator characters and control characters which 
should be treated by programming languages as "white space" for the 
purpose of parsing elements.

Note: ZERO WIDTH SPACE and ZERO WIDTH NO-BREAK SPACE are not included, 
since their functions are restricted to line-break control. Their names 
are unfortunately misleading in this respect.
</quote>
Also in UAX#14 it says:
<quote>
When expanding or compressing inter-word space according to common 
typographical practice, only the spaces marked by U+0020  SPACE, U+00A0  
NO-BREAK SPACE, and U+3000  IDEOGRAPHIC SPACE are subject to 
compression, and only spaces marked by U+0020 SPACE, U+00A0  NO-BREAK 
SPACE, and occasionally spaces marked by U+2009  THIN SPACE are subject 
to expansion. All other space characters normally have fixed width. 
When expanding or compressing inter-character space the presence of 
U+200B ZERO WIDTH SPACE or U+2060 WORD JOINER is always ignored.
</quote>

It therefore surprises me that you imply U+200B may expand in 
justification. However, I don't have the Unicode book (pretty 
expensive) and rely on the Internet for this sort of information. But I 
noticed that http://en.wikipedia.org/wiki/Space_character indicates 
U+200B can be used for justification. 

>     - en quad U+2000, according to my Unicode book *identical* to
>       U+2002, *not* a 4en space (strange)
>     - em quad U+2001, similar to U+2000
>     - en space aka nut U+2002,
>     - em space aka mutton U+2003
>     - three-per-em space aka thick space (1/3 em width) U+2004
>     - four-per-em space aka mid space (1/4 em width) U+2005
>     - six-per-em space (generally 1/6 em width) U+2006
>     - figure space (font dependent) U+2007
>     - punctuation space (as wide as a dot or comma) U+2008
>     - thin space (1/5..1/8 em width) U+2009
>     - hair space (1/10..1/16 em width) U+200A
>     - narrow no-break space (probably 1/6 em width) U+202F
>     - mathematical space U+205F
>     - non breaking word joiner U+2060 replaces U+FFEF in text
>     - ideographic space U+3000
>     - OGHAM SPACE MARK U+1680 (odd stuff)
>     - Note: ETHIOPIC WORDSPACE U+1361 leaves marks and is therefore
>       not a space. At least I hope so.
>   + see also
>      http://en.wikipedia.org/wiki/Space_character
>      http://www.alistapart.com/stories/emen/
>
> * Other characters
>   + Character shaping hints; they do not cause line breaks.
>     - zero width joiner U+200D
>     - zero width non-joiner U+200C (may probably also hint at
>       preventing ligatures)
>     - see http://en.wikipedia.org/wiki/Zero-width_joiner et al.
>   + Soft hyphen U+00AD. Must be hidden if no line break follows.
>   + Formatting characters. I'd say these characters should not occur
>     in XSLFO source, because there are FO which represent the same
>     functionality.
>     - line separator U+2028, FOP 0.20.5 creates an unconditional line
>       break regardless of any FO properties
>     - paragraph separator U+2029
>     - bidi control characters 200E-200F, 202A-202E
>     - deprecated controls 206A-206F
>

Thanks for that list. With respect to the issue at hand, that is which 
codepoints should be given to the renderers it seems there are 3 types: 

1. Those we always give to the renderers even if they are not in the 
font (this is the default and applies to the vast majority)

2. Those we never give to the renderers, e.g. Soft Hyphen (its either 
suppressed or replaced by the proper hyphen), zero-width joiners, ...

3. Those we replace (by another character or layout positioning) only if 
they are not in the font, e.g. fixed width spaces

Is that a sensible grouping?

Of course there are other modifications to codepoints not mentioned here 
like combining into ligatures, hyphenation combined with spelling 
changes, ....

>
> J.Pietschmann

Manuel

Re: zero width space

Posted by "J.Pietschmann" <j3...@yahoo.de>.
Manuel Mall wrote:
> That seems to be the consensus, that is consider ZWS for line breaking 
> but then discard and don't give it to the renderers.


Renderers could deal with ZWS if the font would have a glyph for
this character; unfortunately, that's not the case for the PDF
standard fonts  :-)  Some fonts *do* have glyphs for various Unicode
space characters, notably the fixed width spaces.

This leads to the question: Is a space a character? What *is* a
character? The Unicode people had endless discussions about this.
Spaces are exactly in the gray area between "real characters"
which leave marks and layout control.

Handling space characters in layout and discarding them before
rendering has the distinctive advantage that they work for
any font in any renderer (which can handle variable space areas
properly, of course). OTOH, renderers which output a format which
can handle the spaces itself, like a hypothetical HTML renderer,
would better get the original character.

> Are there any other (unusual Unicode) characters which fall in the same 
> category that is they influence layout decisions but should not be seen 
> by the renderers?

* Unicode spaces
  + variable with spaces
    - ordinary space U+0020
    - ordinary non-breaking space U+00A0
  + fixed width spaces; potentially available in fonts and *may*
    be passed to renderers, *except* for U+200B
    - zero width space U+200B, may expand in justification (not
      implemented this way in FOP 0.20.5, which will haunt us)
    - zero width non breaking space, aka byte order mark U+FEFF,
      should now only be used as BOM (as the BOM is eaten by the
      XML parser, FOP could emit a "deprecated" warning)
    - en quad U+2000, according to my Unicode book *identical* to
      U+2002, *not* a 4en space (strange)
    - em quad U+2001, similar to U+2000
    - en space aka nut U+2002,
    - em space aka mutton U+2003
    - three-per-em space aka thick space (1/3 em width) U+2004
    - four-per-em space aka mid space (1/4 em width) U+2005
    - six-per-em space (generally 1/6 em width) U+2006
    - figure space (font dependent) U+2007
    - punctuation space (as wide as a dot or comma) U+2008
    - thin space (1/5..1/8 em width) U+2009
    - hair space (1/10..1/16 em width) U+200A
    - narrow no-break space (probably 1/6 em width) U+202F
    - mathematical space U+205F
    - non breaking word joiner U+2060 replaces U+FFEF in text
    - ideographic space U+3000
    - OGHAM SPACE MARK U+1680 (odd stuff)
    - Note: ETHIOPIC WORDSPACE U+1361 leaves marks and is therefore
      not a space. At least I hope so.
  + see also
     http://en.wikipedia.org/wiki/Space_character
     http://www.alistapart.com/stories/emen/

* Other characters
  + Character shaping hints; they do not cause line breaks.
    - zero width joiner U+200D
    - zero width non-joiner U+200C (may probably also hint at
      preventing ligatures)
    - see http://en.wikipedia.org/wiki/Zero-width_joiner et al.
  + Soft hyphen U+00AD. Must be hidden if no line break follows.
  + Formatting characters. I'd say these characters should not occur
    in XSLFO source, because there are FO which represent the same
    functionality.
    - line separator U+2028, FOP 0.20.5 creates an unconditional line
      break regardless of any FO properties
    - paragraph separator U+2029
    - bidi control characters 200E-200F, 202A-202E
    - deprecated controls 206A-206F


J.Pietschmann

Re: zero width space

Posted by Manuel Mall <mm...@arcus.com.au>.
On Wed, 2 Nov 2005 02:04 pm, Manuel Mall wrote:
> On Wed, 2 Nov 2005 01:03 am, Chris Bowditch wrote:
> > Manuel Mall wrote:
> > > Currently if one puts a zero-width-space (U+200B) into an XSL-FO
> > > file (or specifies
> > > linefeed-treatment="treat-as-zero-width-space") it is rendered as
> > > a "missing character" in PDF. Is that correct, i.e. does this
> > > character have to exist in the font used or should the formatter
> > > or renderer simply remove this character? It is the second
> > > approach that both AntennaHouse and RenderX appear to have
> > > chosen.
> >
> > I recommend that no character is output for a ZWS. The whole
> > purpose of placing a ZWS into the input XSL-FO is to give layout an
> > extra break opportunity, without changing the appearance of the
> > generated document.
>
> That seems to be the consensus, that is consider ZWS for line
> breaking but then discard and don't give it to the renderers.
>
> Are there any other (unusual Unicode) characters which fall in the
> same category that is they influence layout decisions but should not
> be seen by the renderers?
>

Possible candidates are all characters of General Category 'Cf' (Other, 
Format) and may be 'Cc' (Other, Control)?

> > Chris
>
> Manuel
Manuel

Re: zero width space

Posted by Manuel Mall <mm...@arcus.com.au>.
On Wed, 2 Nov 2005 05:27 pm, Jingjing Lee wrote:
> --- Manuel Mall <mm...@arcus.com.au> wrote:
> > On Wed, 2 Nov 2005 01:03 am, Chris Bowditch wrote:
> > > Manuel Mall wrote:
> > > > Currently if one puts a zero-width-space
> >
> > (U+200B) into an XSL-FO
> >
> > > > file (or specifies
> >
> > linefeed-treatment="treat-as-zero-width-space")
> >
> > > > it is rendered as a "missing character" in PDF.
> >
> > Is that correct,
> >
> > > > i.e. does this character have to exist in the
> >
> > font used or should
> >
> > > > the formatter or renderer simply remove this
> >
> > character? It is the
> >
> > > > second approach that both AntennaHouse and
> >
> > RenderX appear to have
> >
> > > > chosen.
> > >
> > > I recommend that no character is output for a ZWS.
> >
> > The whole purpose
> >
> > > of placing a ZWS into the input XSL-FO is to give
> >
> > layout an extra
> >
> > > break opportunity, without changing the appearance
> >
> > of the generated
> >
> > > document.
> >
> > That seems to be the consensus, that is consider ZWS
> > for line breaking
> > but then discard and don't give it to the renderers.
> >
> > Are there any other (unusual Unicode) characters
> > which fall in the same
> > category that is they influence layout decisions but
> > should not be seen
> > by the renderers?
> >
> > > Chris
> >
> > Manuel
>
> According to UAX#14, this characters is invisible too
> 00AD SOFT HYPHEN (SHY)
> 2060 WORD JOINER (WJ)
> FEFF ZERO WIDTH NO-BREAK SPACE (ZWNBSP)
>
Yes and they are all of General Category Cf I think so may be all Cf 
characters?

> Jingjing
>
Manuel

Re: zero width space

Posted by Jingjing Lee <ra...@yahoo.com>.

--- Manuel Mall <mm...@arcus.com.au> wrote:

> On Wed, 2 Nov 2005 01:03 am, Chris Bowditch wrote:
> > Manuel Mall wrote:
> > > Currently if one puts a zero-width-space
> (U+200B) into an XSL-FO
> > > file (or specifies
> linefeed-treatment="treat-as-zero-width-space")
> > > it is rendered as a "missing character" in PDF.
> Is that correct,
> > > i.e. does this character have to exist in the
> font used or should
> > > the formatter or renderer simply remove this
> character? It is the
> > > second approach that both AntennaHouse and
> RenderX appear to have
> > > chosen.
> >
> > I recommend that no character is output for a ZWS.
> The whole purpose
> > of placing a ZWS into the input XSL-FO is to give
> layout an extra
> > break opportunity, without changing the appearance
> of the generated
> > document.
> >
> That seems to be the consensus, that is consider ZWS
> for line breaking 
> but then discard and don't give it to the renderers.
> 
> Are there any other (unusual Unicode) characters
> which fall in the same 
> category that is they influence layout decisions but
> should not be seen 
> by the renderers?
> 
> > Chris
> 
> Manuel
> 

According to UAX#14, this characters is invisible too
00AD SOFT HYPHEN (SHY)
2060 WORD JOINER (WJ)
FEFF ZERO WIDTH NO-BREAK SPACE (ZWNBSP)

Jingjing



	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: zero width space

Posted by Manuel Mall <mm...@arcus.com.au>.
On Wed, 2 Nov 2005 01:03 am, Chris Bowditch wrote:
> Manuel Mall wrote:
> > Currently if one puts a zero-width-space (U+200B) into an XSL-FO
> > file (or specifies linefeed-treatment="treat-as-zero-width-space")
> > it is rendered as a "missing character" in PDF. Is that correct,
> > i.e. does this character have to exist in the font used or should
> > the formatter or renderer simply remove this character? It is the
> > second approach that both AntennaHouse and RenderX appear to have
> > chosen.
>
> I recommend that no character is output for a ZWS. The whole purpose
> of placing a ZWS into the input XSL-FO is to give layout an extra
> break opportunity, without changing the appearance of the generated
> document.
>
That seems to be the consensus, that is consider ZWS for line breaking 
but then discard and don't give it to the renderers.

Are there any other (unusual Unicode) characters which fall in the same 
category that is they influence layout decisions but should not be seen 
by the renderers?

> Chris

Manuel

Re: zero width space

Posted by Chris Bowditch <bo...@hotmail.com>.
Manuel Mall wrote:

> Currently if one puts a zero-width-space (U+200B) into an XSL-FO file 
> (or specifies linefeed-treatment="treat-as-zero-width-space") it is 
> rendered as a "missing character" in PDF. Is that correct, i.e. does 
> this character have to exist in the font used or should the formatter 
> or renderer simply remove this character? It is the second approach 
> that both AntennaHouse and RenderX appear to have chosen.

I recommend that no character is output for a ZWS. The whole purpose of 
placing a ZWS into the input XSL-FO is to give layout an extra break 
opportunity, without changing the appearance of the generated document.

Chris