You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-dev@xmlgraphics.apache.org by Manuel Mall <mm...@arcus.com.au> on 2005/10/05 09:46:18 UTC

script property

While I am at it (this whole alignment stuff I mean) we may as well do
it properly. This would include support for the "script" property. The
allowed values for script are defined for example here:
http://www.unicode.org/iso15924/iso15924-codes.html.

I assume we don't bother to validate if a correct code has been
provided as we don't do that for the "country" and "language"
properties either (should we? If we do we need more external config
files or expand fop.xconf to hold those values as they tend to change
over time).

But what we do need is a mapping from scripts to default baselines for
these scripts. I haven't found a mapping list on the net. Any one come
across something like that? Otherwise we may have to make that up. That
means entries somewhere similar to: <script code="Guru"
baseline="hanging" />. Is the fop config file the right place for this
stuff? Any not defined scripts encountered in an fo file would map to
baseline="alphabetic" (may be with a warning to the user?).

What we also need for proper script support is a mapping from Unicode
code point to script. The mappings are for example defined here:
http://www.unicode.org/Public/UNIDATA/Scripts.txt.
How would one best process this (has this been done in FOP before?)?
Is there other Unicode stuff FOP needs which should be considered at the 
same time? 
Are we better off working with the "raw" Unicode data 
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)?

Manuel

Re: script property

Posted by "J.Pietschmann" <j3...@yahoo.de>.
Manuel Mall wrote:
> It doesn't quite solve all issues though I think:

Correct.

> May be a wrapper around this class to provide that functionality?

Given that we must get data from Unicode files anyway, we
could as well have our own implementation for everything.

J.Pietschmann

Re: script property

Posted by Manuel Mall <mm...@arcus.com.au>.
On Fri, 7 Oct 2005 03:30 am, J.Pietschmann wrote:
> Manuel Mall wrote:
> > What we also need for proper script support is a mapping from
> > Unicode code point to script.
>
> On a second thought: isn't this what Class Character.UnicodeBlock
> does?
>
Joerg,

Thank you - I didn't even know that this class existed.

It doesn't quite solve all issues though I think:

a) We need a mapping from the ISO 4 letter codes to the 
Character.UnicodeBlock classes.

b) We need a mapping from the Character.UnicodeBlock to script 
properties (actually at this point in time the only property I am aware 
off is the default baseline for the script).

May be a wrapper around this class to provide that functionality?

> J.Pietschmann

Manuel

Re: script property

Posted by "J.Pietschmann" <j3...@yahoo.de>.
Manuel Mall wrote:
> What we also need for proper script support is a mapping from Unicode
> code point to script.

On a second thought: isn't this what Class Character.UnicodeBlock
does?

J.Pietschmann

Re: script property

Posted by "Peter B. West" <li...@pbw.id.au>.
Manuel Mall wrote:
> On Wed, 5 Oct 2005 04:17 pm, Jeremias Maerki wrote:
> 
>>On 05.10.2005 09:46:18 Manuel Mall wrote:
>>
>>>While I am at it (this whole alignment stuff I mean) we may as well
>>>do it properly. This would include support for the "script"
>>>property. The allowed values for script are defined for example
>>>here:
>>>http://www.unicode.org/iso15924/iso15924-codes.html.
>>>
>>>I assume we don't bother to validate if a correct code has been
>>>provided as we don't do that for the "country" and "language"
>>>properties either (should we? If we do we need more external config
>>>files or expand fop.xconf to hold those values as they tend to
>>>change over time).
>>
>>We don't have to but we could. Since this is not something that
>>changes often I wouldn't put it into the config file, but in resource
>>files instead.
>>
> 
> OK - makes sense.
> 
Validation issues considered in alt-design circa 2002. See 
CountryLanguageScript.java in the alt-design code for an attempt at 
this.  Generated from xml-lang.xml and xml-lang.xsl.  No baselines.
> 
Peter
-- 
Peter B. West <http://cv.pbw.id.au/>
Folio <http://defoe.sourceforge.net/folio/>

Re: script property

Posted by Manuel Mall <mm...@arcus.com.au>.
On Wed, 5 Oct 2005 04:17 pm, Jeremias Maerki wrote:
> On 05.10.2005 09:46:18 Manuel Mall wrote:
> > While I am at it (this whole alignment stuff I mean) we may as well
> > do it properly. This would include support for the "script"
> > property. The allowed values for script are defined for example
> > here:
> > http://www.unicode.org/iso15924/iso15924-codes.html.
> >
> > I assume we don't bother to validate if a correct code has been
> > provided as we don't do that for the "country" and "language"
> > properties either (should we? If we do we need more external config
> > files or expand fop.xconf to hold those values as they tend to
> > change over time).
>
> We don't have to but we could. Since this is not something that
> changes often I wouldn't put it into the config file, but in resource
> files instead.
>
OK - makes sense.

> > But what we do need is a mapping from scripts to default baselines
> > for these scripts. I haven't found a mapping list on the net. Any
> > one come across something like that?
>
> Nope.
>
> > Otherwise we may have to make that up. That
> > means entries somewhere similar to: <script code="Guru"
> > baseline="hanging" />. Is the fop config file the right place for
> > this stuff?
>
> Again, I'd put it in separate resource files as this is not going to
> change often and a rebuild of FOP is not the end of the world in this
> case.

My suggestion was based around the assumption that if we have to make up 
the mappings from script to baseline ourselves we may get it wrong. 
Therefore leave it up to the user to add the mappings for his/her 
language/script environment to the config file. Most users will deal 
only with a very few scripts so its not a big deal.

>
> > Any not defined scripts encountered in an fo file would map to
> > baseline="alphabetic" (may be with a warning to the user?).
>
> Sure.
>
> > What we also need for proper script support is a mapping from
> > Unicode code point to script. The mappings are for example defined
> > here: http://www.unicode.org/Public/UNIDATA/Scripts.txt.
> > How would one best process this?
>
> <shrug/>
>
> > (has this been done in FOP before?)
>
> I don't think so.
>
See Joerg's response.

> > Is there other Unicode stuff FOP needs which should be considered
> > at the same time?
> > Are we better off working with the "raw" Unicode data
> > (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)?
>
> <shrug/>
Seems like line breaking (and hyphenation, e.g. script specific 
hyphenation character) may also need Unicode stuff (not necessarily 
from the raw data file though).

>
> We should simply make sure that this doesn't influence performance
> too much for the big majority of users happy to use latin scripts.
> After all, this looks like many lookups are necessary and all these
> maps have to be loaded at one point.
>
Yes, that is a valid consideration. May be it needs to be designed in a 
way that these lookups can be disabled and replaced by defaults from 
the config file.
>
> Jeremias Maerki
Manuel

Re: script property

Posted by Manuel Mall <mm...@arcus.com.au>.
On Thu, 6 Oct 2005 04:23 am, J.Pietschmann wrote:
> Jeremias Maerki wrote:
> >> What we also need for proper script support is a mapping from
> >> Unicode code point to script.
>
> ...
>
> >> (has this been done in FOP before?)
> >
> > I don't think so.
>
> Have a look at
>   http://people.apache.org/~pietsch/linebreak.tar.gz
>
> Occasionally I've thought about some sort of Jakarta commons
> Unicode file component, but the guys there weren't all that
> enthusiastic about this, and I've not enough time to get
> the ball rolling all of my own.
>
Joerg,

thanks for that.

Do I understand this correctly that you use a Java code generation 
approach here. That is you generate Java source code from the Unicode 
text files which is then compiled as part of the line breaking code?

Not so sure I like that but then again if it works. For me this type of 
stuff feels more like pure data but of course we don't want to parse 
these text files each time FOP loads. What about the hyphenation 
pattern approach? Store it as a serialized object and treat it more 
like a resource? Accessing that should be comparable in time to class 
loading (I think as I haven't ever empirically tested that).

I haven't studied your code in detail but could we / should we integrate 
this into the FOP trunk to support 'Unicode compliant' line breaking?

My main goal still is to make FOP happen therefore I wouldn't like to 
dilute my effort / time in trying to argue / establishing another 
commons subproject at the moment. What about we create a 
org.apache.fop.unicode package for the time being where we keep unicode 
specific support stuff? That can then at a later stage be refactored 
into a commons subproject if the time/will/energy is there.

> J.Pietschmann

Manuel

Re: script property

Posted by "J.Pietschmann" <j3...@yahoo.de>.
Jeremias Maerki wrote:
>> What we also need for proper script support is a mapping from Unicode
>> code point to script.
...
>> (has this been done in FOP before?)
> 
> I don't think so.

Have a look at
  http://people.apache.org/~pietsch/linebreak.tar.gz

Occasionally I've thought about some sort of Jakarta commons
Unicode file component, but the guys there weren't all that
enthusiastic about this, and I've not enough time to get
the ball rolling all of my own.

J.Pietschmann


Re: script property

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
On 05.10.2005 09:46:18 Manuel Mall wrote:
> While I am at it (this whole alignment stuff I mean) we may as well do
> it properly. This would include support for the "script" property. The
> allowed values for script are defined for example here:
> http://www.unicode.org/iso15924/iso15924-codes.html.
> 
> I assume we don't bother to validate if a correct code has been
> provided as we don't do that for the "country" and "language"
> properties either (should we? If we do we need more external config
> files or expand fop.xconf to hold those values as they tend to change
> over time).

We don't have to but we could. Since this is not something that changes
often I wouldn't put it into the config file, but in resource files
instead.

> But what we do need is a mapping from scripts to default baselines for
> these scripts. I haven't found a mapping list on the net. Any one come
> across something like that?

Nope.

> Otherwise we may have to make that up. That
> means entries somewhere similar to: <script code="Guru"
> baseline="hanging" />. Is the fop config file the right place for this
> stuff?

Again, I'd put it in separate resource files as this is not going to
change often and a rebuild of FOP is not the end of the world in this
case.

> Any not defined scripts encountered in an fo file would map to
> baseline="alphabetic" (may be with a warning to the user?).

Sure.

> What we also need for proper script support is a mapping from Unicode
> code point to script. The mappings are for example defined here:
> http://www.unicode.org/Public/UNIDATA/Scripts.txt.
> How would one best process this? 

<shrug/>

> (has this been done in FOP before?)

I don't think so.

> Is there other Unicode stuff FOP needs which should be considered at the 
> same time? 
> Are we better off working with the "raw" Unicode data 
> (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)?

<shrug/>

We should simply make sure that this doesn't influence performance too
much for the big majority of users happy to use latin scripts. After all,
this looks like many lookups are necessary and all these maps have to be
loaded at one point.


Jeremias Maerki