You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Stefano Mazzocchi <st...@apache.org> on 2000/06/11 15:20:27 UTC

[RT] i18n in Cocoon and language independent semantic contexts

The problems i18n poses are big and it's the reason why both Java and
XML have Unicode support right from their core (a big advantage over
almost all other programming languages).

Cocoon = Java + XML, so this means we need to place i18n support right
into our core, or we'll be doomed by design limitations for the rest of
its lifetime (and force us to do a cocoon3 to fix design problems)

Let's see those problems:

1) internal messages: errors, logs, comments all should be driven by the
JVM locale. Normally this is performed with Java ResourceBoundles.

Is this enough? Should we create an XML version of those resource
boundles? is this a following the golden-hammer antipattern of "do it
all with XML"?

2) uri space: good URIs don't change and are human readable. The sitemap
allows you to enforce the first (if you don't use extentions to indicate
your resources), and your URI-space design should enforce the second
one.

Be careful, something like "/news/today" is a perfectly designed URI for
a website and can stand ages without requiring to change. But it's  not
human readable by non-english speakers. So it would be the italian
equivalent "/notizie/oggi".

This leads to something that was already expressed on the list: can the
sitemap allow to enforce different views of the same URI space based on
i18n issues? What's the best manageable way to do this? Where does
separation of concerns accounts here? What's the best way to scale such
a thing?

And, most important, is something like this worth the effort? (I've
never seen translated URI spaces, is there a web site that does this?)

3) schemas: this is something I've been concerned about for quite some
time and maybe some of you who were into the SGML world before can give
us advices. Schema has one embedded natural language.

 <page xml:lang="it">
  <title>Hello World!</title>
  <paragraph>
   <bold>Hello World!</bold>
  </paragraph>
 </page>

can be translated into

 <page xml:lang="it">
  <title>Ciao a tutti!</title>
  <paragraph>
   <bold>Ciao a tutti!</bold>
  </paragraph>
 </page>

but this _requires_ authors to understand english to understand the
markup. The real translation is

 <pagina xml:lang="it">
  <titolo>Ciao a tutti!</titolo>
  <paragrafo>
   <grassetto>Ciao a tutti!</grassetto>
  </paragrafo>
 </pagina>

which could easily pass my "father's test" (he doesn't speak english),
while the previous one would not.

Are those pages different? No, they are different views of the same
information.

[Note: Ok, we made a very strong hypothesis: each natural language has
the same expressivity range. Many could argue this is far from being
true. For example, there is no italian equivalent for the english word
"privacy" and there is no english equivalent for the word "pizza". Also,
everybody knows that many jokes loose their funny meaning if translated
(italians use policemen like americans use blondes). Many italian
dialects contain expressions that would require pages italian to express
the same feeling to the listener (italian dialects are mostly oral-only
languages), Japanese embeds several language constructs to indicate
difference of social position and so on.]

But it can be reasonably assumed that schemas contain the same amount of
information and expose themselves with different views. Natural
languages as "knowledge representation syles" of abstract structured
relationship between different semantic areas.

So, let us suppose there exists one schema and the reference schema is
written in english.

It should be possible to introduce a view of this schema by allowing
semantic inheritance of the elements.

Let's make an example:

 <page:page xml:lang="en" xmlns:page="urn:page" xmlns:style="urn:style">
  <page:title>Hello World!</page:title>
  <page:paragraph>
   <style:bold>Hello World!</style:bold>
  </page:paragraph>
 </page:page>

and we want to translate this into HTML so we need page->html and
markup->html (supposing page doesn't contain the equivalent of "style"
semantic information)

No we want this to be readable for italians that don't know english, but
want to keep the same stylesheets. How could we achieve that?

I have a solution that requires (unfortunately) patching both the
namespace and XMLSchema specifications:

 <pagina:pagina xml:lang="it" 
    xmlns:pagina="urn:page" xmlns:pagina:lang="it" 
    xmlns:stile="urn:style" xmlns:stile:lang="it">
  <pagina:titolo>Ciao a tutti!</pagina:titolo>
  <pagina:paragrafo>
   <stile:grassetto>Ciao a tutti!</stile:grassetto>
  </pagina:paragrafo>
 </pagina:pagina>

where the XMLSchema should indicate that

 <pagina> -(equals)-> <page>
 <titolo> -(equals)-> <title>
 <paragrafo> -(equals)-> <paragraph>

and all create different natural languages views of the same namespace
(urn:page) while

 <grassetto> -(equals)-> <bold>

for the namespace (urn:style).

Then, it can be possible for XML parsers to map all those elements in
"language-neutral semantic equivalent classes" where XPaths can access
them indipendently of their natural language form.

For example, the XPath "/page/title" should return "Ciao a Tutti!" if
applied to the italian version of the page and "Hello World!" if applied
to the english version (version indicated with xml:lang), but should be
transparent on the language used to present the schema elements.

This allows another level of separation of concern where who creates the
XSLT is a english designer and who writes the XML document is an italian
journalist. (yes, the eurofootball.com web site triggered many of these
thoughts)

Today, XPath and XMLSchema create contracts on the "strings of unicode
chars" used to express semantic ideas. 

This is, IMO, a big limitation since what is "linked" is not the element
name but the semantic context it represents.

This would allow the creation of classes of equivalence for XML schemas,
each one representing a different view of the same language independent
semantic context they all share.

Where would something like this be useful in Cocoon?

For all schemas used to generate the resources (user level) and for
Cocoon's own schemas (mainly the sitemap and configurations).

For example, non-english-speakers could install and maintain Cocoon's
sitemaps or, sitemaps with localized schemas can be given to people with
different language skills.

Being completely "orthogonal" on the schema (this is why it needs to
patch both namespaces and schema capabilities), this would positively
impact on every XML usage.

                         ------------------ o ------------------

Ok, but what can we do inside Cocoon without having to proprietarely
extend the XML specifications?

Also, how can we simplify the sitemap evolution without compromising the
rest of the system?

I think a possible solution is sitemap pluggability and compilation.

You could think at the sitemap like a big XSP taglib that is responsible
to drive directly the execution of the resource creation pipelines.

It would also increase performance, since matching could be optimized
and what not.

It would also allow different sitemap schemas to be developped. In
theory, you could create your own sitemap schema.

Well, this collection of RT is admittedly wild.

Digest with caution but think about it extensively since I know many FS
hides between the lines.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------

Re: [RT] i18n in Cocoon and language independent semantic contexts

Posted by Stefano Mazzocchi <st...@apache.org>.

"N. Sean Timm" wrote:
> 
> "Stefano Mazzocchi" <st...@apache.org> once wrote:
> <snip summary="Lots of good comments about i18n, which, while
> interesting, are not what my response is in regard to." />
> 
> > Ok, but what can we do inside Cocoon without having to proprietarely
> > extend the XML specifications?
> >
> > Also, how can we simplify the sitemap evolution without compromising the
> > rest of the system?
> >
> > I think a possible solution is sitemap pluggability and compilation.
> >
> > You could think at the sitemap like a big XSP taglib that is responsible
> > to drive directly the execution of the resource creation pipelines.
> >
> > It would also increase performance, since matching could be optimized
> > and what not.
> >
> > It would also allow different sitemap schemas to be developped. In
> > theory, you could create your own sitemap schema.
> >
> 
> YES YES YES!  We developed a custom sitemap processor for use with
> Cocoon 1, and I've been looking at what it would take to merge our custom
> stuff with the slight overlap that occurs in Cocoon 2.  I've been meaning to
> take some time to look at this before speaking up, but it looks like you've
> already come to the conclusion I would hope for.  :)

I still don't have a clear vision of this.

For example, stylebook used XSLT to simply sitemap generation from a
simpler schema. This resulted in nobody using the powerful schema and
all the functionality was simply hidden underneath.

We can't affort to do the same mistake over again.

Also, it could be a pain to train people on "Cocoon" if everything can
change from site to site.

Sometimes flexibility equals weak contracts and they may result in
slower evolution due to those "floppy" contracts.

But I'm not sure....

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------

Re: [RT] i18n in Cocoon and language independent semantic contexts

Posted by "N. Sean Timm" <st...@mailgo.com>.

"Stefano Mazzocchi" <st...@apache.org> once wrote:
<snip summary="Lots of good comments about i18n, which, while
interesting, are not what my response is in regard to." />

> Ok, but what can we do inside Cocoon without having to proprietarely
> extend the XML specifications?
>
> Also, how can we simplify the sitemap evolution without compromising the
> rest of the system?
>
> I think a possible solution is sitemap pluggability and compilation.
>
> You could think at the sitemap like a big XSP taglib that is responsible
> to drive directly the execution of the resource creation pipelines.
>
> It would also increase performance, since matching could be optimized
> and what not.
>
> It would also allow different sitemap schemas to be developped. In
> theory, you could create your own sitemap schema.
>

YES YES YES!  We developed a custom sitemap processor for use with
Cocoon 1, and I've been looking at what it would take to merge our custom
stuff with the slight overlap that occurs in Cocoon 2.  I've been meaning to
take some time to look at this before speaking up, but it looks like you've
already come to the conclusion I would hope for.  :)

- Sean T.

Re: ResourceBundles - was [RT] i18n

Posted by Berin Loritsch <bl...@infoplanning.com>.

Stefano Mazzocchi wrote:

> Hey, I like this and I think this is general enough to be incorporated
> into Avalon as a very common block (like logging, thread management,
> object storing and such).

It may end up being just an extension of the ResourceBundle.  The i18n
file loading and such might constitute making it a block, I'll breach it with
Federico and get his input.  I'm still getting a handle on what architecturally
will constitute an architectural component/block, and what should simply
be a class.  ResourceBundle implicitly has a lot of that functionality in
it--so it may not constitute being a block or even a component.

Then again, to fit in the framework, and have an XMLResourceBundle
be part of a generic i18n component it would make sense.

> Just one thing on the schema: the use of "id" could create naming
> problems since ids cannot be repeated inside the same file. So I'd
> suggest:
>
> <?xml version="1.0"?>
> <resources xml:lang="en">
>   <group name="error">
>     <resource name="500">Internal Server Error</resource>
>     <resource name="404">Page not found</resource>
>   </group>
>   <group name="uri">
>     <resource name="addUser">user/add</resource>
>     <resource name="killUser">user/obliterate</resource>
>   </group>
>   <group name="form">
>     <resource name="userName" value="Enter the user's name here"/>
>   </group>
> </resource>
>
> Note how the syntax
>
>  <resource name="xxx" value="yyy">
>
> is equivalent to
>
>  <resource name="xxx>yyy</resource>
>
> which is perfectly legal and allows you to use the markup you like the
> most.

I like that.
+1.

Re: ResourceBundles - was [RT] i18n

Posted by Mike Engelhart <me...@earthtrip.com>.

on 6/13/00 7:19 AM, Stefano Mazzocchi at stefano@apache.org wrote:

> 
> Hey, I like this and I think this is general enough to be incorporated
> into Avalon as a very common block (like logging, thread management,
> object storing and such).
Yeah, I use ResourceBundle's for all sorts of uses in my app that are
unrelated to i18n.  They're very generic and useful when you call their
getResouceBundle() methods that don't have a Locale parameter.

> 
> Note how the syntax
> 
> <resource name="xxx" value="yyy">
> 
> is equivalent to
> 
> <resource name="xxx>yyy</resource>
> 
> which is perfectly legal and allows you to use the markup you like the
> most.
I like this too 
+1


Mike

Re: ResourceBundles - was [RT] i18n

Posted by Stefano Mazzocchi <st...@apache.org>.

Berin Loritsch wrote:
> 
> Mike Engelhart wrote:
> 
> > I've been following the i18n thread and last night had a few minutes to whip
> > up some code.
> >
> > Here's my thoughts (I guess "random" thoughts :-))
> >
> > 1)  We all like XML for configuration.  Properties are OK but XML is better
> > and portable.
> > 2)  I've been using ResourceBundle's a lot and like them for several reasons
> > some of which are that they've already been developed (my personal
> > favorite), they're debugged, they're standard Java and really easy to use
> > and caching is handled automatically so they only get loaded once by the
> > classloader
> >
> > i've been using PropertyResourceBundle's in my code but have wanted an XML
> > based solution. My suggestion is this (please remove code and ideas that
> > suck and replace them with better code & ideas that don't suck).  Also be
> > advised that I have not even looked at C2 yet so my understanding of the
> > sitemap is basically non-existant so I didn't try and integrate this within
> > the C2 architecture.. :-(  This is solely to allow us to use XML documents
> > but still get all the ResourceBundle qualities that are cool.
> >
> 
> I like using existing code to.  No sense in reinventing the wheel if it performs
> 
> your needs.
> 
> > Then use the following class, I'm calling it CocoonResourceBundle which
> > reads in XML files of the following format (again replace XML that sucks
> > with XML that doesn't suck).
> > <!-- lang_en.xml -->
> > <?xml version="1.0"?>
> > <document xml:lang="en">
> >     <word>
> >         <key>_DATE</key>
> >         <value>Date</value>
> >     </word>
> >     <word>
> >         <key>_TIME</key>
> >         <value>Time</value>
> >     </word>
> > </document>
> 
> I would propose a slightly different document style.  The style that you
> proposed offers no advantage over Property files--its more verbose,
> not as easy to follow.  I would propose the following style that incorporates
> the concept of resource groups.
> 
> <?xml version="1.0"?>
> <resource xml:lang="en">
>   <group id="error">
>     <value id="500">Internal Server Error</value>
>     <value id="404">Page not found</value>
>   </group>
>   <group id="uri">
>     <value id="addUser">user/add</value>
>     <value id="killUser">user/obliterate</value>
>   </group>
>   <group id="form">
>     <value id="userName">Enter the user's name here</value>
>   </group>
> </resource>
> 
> That way, we have the ability to group the informaiton into some form
> of context.  It is simple, straight-forward, and provides a benefit over
> simple property files.
> 
> Once we have our XMLResourceBundle object, we can get the resource
> by id.  Something like this would be possible:
> 
> ResourceBundle res;
> res = XMLResourceFactory.getResourceBundle(String role);
> res.getString(String key);

Hey, I like this and I think this is general enough to be incorporated
into Avalon as a very common block (like logging, thread management,
object storing and such).

Just one thing on the schema: the use of "id" could create naming
problems since ids cannot be repeated inside the same file. So I'd
suggest:

<?xml version="1.0"?>
<resources xml:lang="en">
  <group name="error">
    <resource name="500">Internal Server Error</resource>
    <resource name="404">Page not found</resource>
  </group>
  <group name="uri">
    <resource name="addUser">user/add</resource>
    <resource name="killUser">user/obliterate</resource>
  </group>
  <group name="form">
    <resource name="userName" value="Enter the user's name here"/>
  </group>
</resource>

Note how the syntax

 <resource name="xxx" value="yyy">

is equivalent to

 <resource name="xxx>yyy</resource>

which is perfectly legal and allows you to use the markup you like the
most.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------

Re: ResourceBundles - was [RT] i18n

Posted by Berin Loritsch <bl...@infoplanning.com>.

Mike Engelhart wrote:

> on 6/12/00 11:43 AM, Berin Loritsch at bloritsch@infoplanning.com wrote:
>
> > <?xml version="1.0"?>
> > <resource xml:lang="en">
> > <group id="error">
> > <value id="500">Internal Server Error</value>
> > <value id="404">Page not found</value>
> > </group>
> > <group id="uri">
> > <value id="addUser">user/add</value>
> > <value id="killUser">user/obliterate</value>
> > </group>
> > <group id="form">
> > <value id="userName">Enter the user's name here</value>
> > </group>
> > </resource>
> >
> > That way, we have the ability to group the informaiton into some form
> > of context.  It is simple, straight-forward, and provides a benefit over
> > simple property files.
> >
> > Once we have our XMLResourceBundle object, we can get the resource
> > by id.  Something like this would be possible:
> >
> > ResourceBundle res;
> > res = XMLResourceFactory.getResourceBundle(String role);
> > res.getString(String key);
>
> I think I like this better too. Are you saying that each <group>'s id
> attribute constitutes a separate ResourceBundle behind the scenes then?
> So in your example, it would be like
>
> String role = "uri";
> ResourceBundle res = XMLResourceFactory.getResourceBundle(role);
> res.getString("addUser");
>
> Mike

That's the idea.

Re: ResourceBundles - was [RT] i18n

Posted by Mike Engelhart <me...@earthtrip.com>.

on 6/12/00 11:43 AM, Berin Loritsch at bloritsch@infoplanning.com wrote:

> <?xml version="1.0"?>
> <resource xml:lang="en">
> <group id="error">
> <value id="500">Internal Server Error</value>
> <value id="404">Page not found</value>
> </group>
> <group id="uri">
> <value id="addUser">user/add</value>
> <value id="killUser">user/obliterate</value>
> </group>
> <group id="form">
> <value id="userName">Enter the user's name here</value>
> </group>
> </resource>
> 
> That way, we have the ability to group the informaiton into some form
> of context.  It is simple, straight-forward, and provides a benefit over
> simple property files.
> 
> Once we have our XMLResourceBundle object, we can get the resource
> by id.  Something like this would be possible:
> 
> ResourceBundle res;
> res = XMLResourceFactory.getResourceBundle(String role);
> res.getString(String key);

I think I like this better too. Are you saying that each <group>'s id
attribute constitutes a separate ResourceBundle behind the scenes then?
So in your example, it would be like

String role = "uri";
ResourceBundle res = XMLResourceFactory.getResourceBundle(role);
res.getString("addUser");

Mike

Re: ResourceBundles - was [RT] i18n

Posted by Berin Loritsch <bl...@infoplanning.com>.

Mike Engelhart wrote:

> I've been following the i18n thread and last night had a few minutes to whip
> up some code.
>
> Here's my thoughts (I guess "random" thoughts :-))
>
> 1)  We all like XML for configuration.  Properties are OK but XML is better
> and portable.
> 2)  I've been using ResourceBundle's a lot and like them for several reasons
> some of which are that they've already been developed (my personal
> favorite), they're debugged, they're standard Java and really easy to use
> and caching is handled automatically so they only get loaded once by the
> classloader
>
> i've been using PropertyResourceBundle's in my code but have wanted an XML
> based solution. My suggestion is this (please remove code and ideas that
> suck and replace them with better code & ideas that don't suck).  Also be
> advised that I have not even looked at C2 yet so my understanding of the
> sitemap is basically non-existant so I didn't try and integrate this within
> the C2 architecture.. :-(  This is solely to allow us to use XML documents
> but still get all the ResourceBundle qualities that are cool.
>

I like using existing code to.  No sense in reinventing the wheel if it performs

your needs.

> Then use the following class, I'm calling it CocoonResourceBundle which
> reads in XML files of the following format (again replace XML that sucks
> with XML that doesn't suck).
> <!-- lang_en.xml -->
> <?xml version="1.0"?>
> <document xml:lang="en">
>     <word>
>         <key>_DATE</key>
>         <value>Date</value>
>     </word>
>     <word>
>         <key>_TIME</key>
>         <value>Time</value>
>     </word>
> </document>

I would propose a slightly different document style.  The style that you
proposed offers no advantage over Property files--its more verbose,
not as easy to follow.  I would propose the following style that incorporates
the concept of resource groups.

<?xml version="1.0"?>
<resource xml:lang="en">
  <group id="error">
    <value id="500">Internal Server Error</value>
    <value id="404">Page not found</value>
  </group>
  <group id="uri">
    <value id="addUser">user/add</value>
    <value id="killUser">user/obliterate</value>
  </group>
  <group id="form">
    <value id="userName">Enter the user's name here</value>
  </group>
</resource>

That way, we have the ability to group the informaiton into some form
of context.  It is simple, straight-forward, and provides a benefit over
simple property files.

Once we have our XMLResourceBundle object, we can get the resource
by id.  Something like this would be possible:

ResourceBundle res;
res = XMLResourceFactory.getResourceBundle(String role);
res.getString(String key);

ResourceBundles - was [RT] i18n

Posted by Mike Engelhart <me...@earthtrip.com>.

I've been following the i18n thread and last night had a few minutes to whip
up some code. 

Here's my thoughts (I guess "random" thoughts :-))

1)  We all like XML for configuration.  Properties are OK but XML is better
and portable.
2)  I've been using ResourceBundle's a lot and like them for several reasons
some of which are that they've already been developed (my personal
favorite), they're debugged, they're standard Java and really easy to use
and caching is handled automatically so they only get loaded once by the
classloader

i've been using PropertyResourceBundle's in my code but have wanted an XML
based solution. My suggestion is this (please remove code and ideas that
suck and replace them with better code & ideas that don't suck).  Also be
advised that I have not even looked at C2 yet so my understanding of the
sitemap is basically non-existant so I didn't try and integrate this within
the C2 architecture.. :-(  This is solely to allow us to use XML documents
but still get all the ResourceBundle qualities that are cool.

anyway, we define an i18n directory in the sitemap where langauge based XML
documents reside using standard naming conventions for the documents, e.g.,
lang.xml, lang_en.xml, lang_es.xml, etc. (the default file being lang.xml
with no suffix that is used when an attempt to use a language that doesn't
have an associated ResourceBundle is made).

Then use the following class, I'm calling it CocoonResourceBundle which
reads in XML files of the following format (again replace XML that sucks
with XML that doesn't suck).
<!-- lang_en.xml -->
<?xml version="1.0"?>
<document xml:lang="en">
    <word>
        <key>_DATE</key>
        <value>Date</value>
    </word>
    <word>
        <key>_TIME</key>
        <value>Time</value>
    </word>
</document>

<!-- lang_es.xml -->
<?xml version="1.0"?>
<document xml:lang="es">
  <word>
    <key>_DATE</key>
    <value>Dia</value>
  </word>

  <word>
    <key>_TIME</key>
    <value>Tiempo</value>
  </word>
</document>

Here's the code for CocoonResourceBundle.  It extends
java.util.ListResourceBundle which itself subclasses
java.util.ResourceBundle.
This allows for us to customize the filling in of the "contents" array with
the XML based language file.



/** JDK classes **/
import java.io.IOException;
import java.util.ListResourceBundle;

/** W3C DOM classes **/
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
/** xerces classes **/
import org.apache.xerces.dom.TextImpl;
import org.apache.xerces.parsers.DOMParser;
import org.xml.sax.SAXException;

public class CocoonResourceBundle extends ListResourceBundle
{
    public CocoonResourceBundle()
    {
        super();
        try
        {
            setContents();
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }
    
    public Object[][] getContents()
    {
        return this.contents;
    }

    private void setContents() throws IOException, SAXException
    {
        DOMParser parser = new DOMParser();
        parser.parse("lang.xml");
        Document doc = parser.getDocument();
        Element e = doc.getDocumentElement();
        NodeList nodes = e.getElementsByTagName("word");
        int numberNodes = nodes.getLength();

        contents = new Object[numberNodes][2];
        
        for (int i = 0; i < numberNodes; i++)
        {
            Node node = nodes.item(i);
            contents[i][0] = getTextNodeData((Element) node, "key");
            contents[i][1] = getTextNodeData((Element) node, "value");
        }
    }
    
    
    public String getTextNodeData(Element element, String elementName)
    {
        NodeList nodes = element.getElementsByTagName(elementName);
        if (nodes.getLength() < 1)
            return "";
        Node node = nodes.item(0).getFirstChild();
        if (node.getNodeType() == Node.TEXT_NODE)
            return ((TextImpl) node).getData();
        else
            return "";
    }

    static Object[][] contents;
}

Then in XSP code or anywhere else in C2 you can do these kinds of accesses
(pseudo code follows)

<xsp:page>
    <document>
        <xsp:logic>
            Locale loc = request.getLocale();
            java.util.ResourceBundle bundle =
ResourceBundle.getBundle("org.apache.CocoonResourceBundle", loc);
        </xsp:logic>
        <!-- do something useful -->
        <p><xsp:expr>The word for 'Date' in your browsers default locale is
<xsp:expr>bundle.getString("_DATE")</xsp:expr></p >
    </document>
<xsp:page>

I'm not sure if this is possible but can we have Cocoon 2 (on startup) scan
the pre-defined language repository for files and then create classes on the
fly for each language?  Otherwise you need separate object code for each
language like CocoonResourceBundle_en.java, CocoonResouceBundle_es.java,
etc.   That's just how ResourceBundle's are designed.  If we could do this
without hardcoding the path names to the lang.xml files, that would be
ideal.  Again, this is just a concept so there is more work to do to get
this to work within C2 but it seems like a relatively simple solution that
gives us both of the best worlds.

Mike

Re: [RT] i18n in Cocoon and language independent semantic contexts

Posted by Stefano Mazzocchi <st...@apache.org>.

Berin Loritsch wrote:

> As long as we can include different files that will specify different
> languages.
> Most i18n systems simply need a way of getting to equivelant resources.
> While XML is powerful, it might be overkill here.  The main question here
> is how do we identify the resources.  GNU gettext generates certain tag
> files
> that have constants associated with a resource.  Whoever wants to translate
> the program that uses gettext simply takes this file and translates the
> phrases
> from one language to another.  A properties file will work nicely for this
> type of application.  Cocoon just needs to know which properties file to
> get.
> If we had a directory called "./i18n/" we can place the files within that
> directory in the format of the 2 letter country code followed by
> ".properties",
> and anyone who wants to provide a translation takes that file, changes the
> name to the proper country code and translates the messages in there.
> 
> Simply stated:
> ./i18n/en.properties
> 
> will be translated to become
> ./i18n/es.properties
> and so on.
> 
> The entries will look like:
> SERVER500="Server error."
> translated to:
> SERVER500="Error de Server."

This is exactly what Java ResourceBundles do, bit by bit. We just have
to use the facility.

> > 2) uri space: good URIs don't change and are human readable. The sitemap
> > allows you to enforce the first (if you don't use extentions to indicate
> > your resources), and your URI-space design should enforce the second
> > one.
> >
> > Be careful, something like "/news/today" is a perfectly designed URI for
> > a website and can stand ages without requiring to change. But it's  not
> > human readable by non-english speakers. So it would be the italian
> > equivalent "/notizie/oggi".
> 
> We could accomplish this with simple aliases.  We could also extend the
> previously stated (see #1) proposal to include a WEB-INF/i18n/en.properties
> suite of files to internationalize the URLs.  That way, we can provide a
> mechanism for site internationalization--not necessary for everyone, but
> a boon for whoever is willing to use it.  Such a directory should only be
> needed if some parameter is used in the sitemap.
> 
> That way, we can identify a new namespace so that we can access the
> site internationalization:
> <sitemap xmlns:i18n="http://xml.apache.org/cocoon/i18n">
>   <i18n:resource dir="WEB-INF/i18n/" lang="en"/>
>   <process i18n:uri="resourceName"/>
> </sitemap>
> 
> And i18n:uri, etc. would translate into whatever is the necessary attribute.
> 
> In this case, it would be uri="user/add" or something equivalent.  That
> way, if I don't want to go through the trouble of internationalizing my
> site I can still use the old sitemap schema.  If I find I want to do that,
> I have that ability using the namespace.

Good suggestion. I like this. Powerful yet hidden if not required.
 
> > And, most important, is something like this worth the effort? (I've
> > never seen translated URI spaces, is there a web site that does this?)
> 
> It may be the "wave of the future", it may be extra work, but the value
> of creating one XML document, and having the ability to perform
> translations easily is invaluable.  For Example:
> I specify a DTD that allows me to create a form like this:
> <form xmlns:i18n="http://xml.apache.org/cocoon/i18n">
>   <i18n:resource dir="WEB-INF/i18n/" lang="en"/>
>   <field name="user" type="drop-list">
>     <description>
>       <i18n:string resource="currentUser"/>
>     </description>
>     <selection>Stefano Mazzochi</selection>

It's "Mazzocchi" damn it! two "z"'s and two "c"'s :)

>     <selection>Berin Loritsch</selection>
>   </field>
> </form>
> 
> Using a simple mechanism like this is very powerful.  The ability
> to make this easily available to the site designer in multiple areas
> will make this an incredibly killer app--especially when we place
> the language detection in XSP, XSLT, or by the engine.  Basically,
> we would have one XML "form" representing the same information,
> displayed in the users native language.
> 
> C'est Manufique, non?

Oui, mais je dois te rappeler.... oh, sorry, wrong language :)

Yes, but we must not forget the schema you outlined above is
cocoon-aware... maybe an i18n filter could allow such processing to be
available without imposing cocoon-awareness of the schema.

Expecially, the above doesn't work for schemas you can't control, for
example XForm, where it would be more portable to defined namespaced
"attributes" instead of elements, following the XLink pattern.
 
> This approach works well as long as the resources are small.  If we
> have a press release or some other larger piece of information that
> is not a specific resource (the contents of the press release will be
> different for each release), then that would be best served by different
> XML files--one for each target language.
> 
> Forms and functional spaces on a web site would benefit from such
> a system.  Generic information, how-tos, etc. will not.

I agree, these are different things and should be handled differently.
 
> > 3) schemas: this is something I've been concerned about for quite some
> > time and maybe some of you who were into the SGML world before can give
> > us advices. Schema has one embedded natural language.
> >
> >  <page xml:lang="it">
> >   <title>Hello World!</title>
> >   <paragraph>
> >    <bold>Hello World!</bold>
> >   </paragraph>
> >  </page>
> >
> > can be translated into
> >
> >  <page xml:lang="it">
> >   <title>Ciao a tutti!</title>
> >   <paragraph>
> >    <bold>Ciao a tutti!</bold>
> >   </paragraph>
> >  </page>
> >
> > but this _requires_ authors to understand english to understand the
> > markup. The real translation is
> >
> >  <pagina xml:lang="it">
> >   <titolo>Ciao a tutti!</titolo>
> >   <paragrafo>
> >    <grassetto>Ciao a tutti!</grassetto>
> >   </paragrafo>
> >  </pagina>
> 
> AAAAAHHHHH! Noooo!

:)
 
> All markup should be done be the site designer.  If my native language
> is English (which it is), then I would use an English markup to my site.
> If it were Spanish (I'm only 30% mobile in that language), then I would
> use Spanish markup.  The end user should never see the actual markup.
> The goal of XML/XSL is to transform the information into a useable
> format for the client.  If this format is a graphical view of the
> information
> (which XSL:FO is designed to give), then the end user sees the information
> represented graphically.  If the format is a machine readable and
> processable format (i.e. Business to Business data exchange formats),
> then translating the tags is not only overkill, it will completely break
> the system.

You didn't get my point. I was _not_ concerned about the user, but about
the different concern areas on the cocoon-powered site.

For example, let's look at something like eurovolleyball.com: style
designers are swedish, administrators are german and you have
journalists in all the european countries. Most of the journalists don't
know german nor swedish and little english. They happen to be very good
writer in their native languages and know a lot about volleyball.

Today, you have to do XSLT tranformations to go from

 <page>
 </page>

to

 <pagina>
 </pagina>
 
True, this is a very simple XSLT template

 <xsl:template select="pagina">
  <page>
   <xsl:apply-templates/>
  </page>
 </xsl:template>

or you could write

 <tranlation>
  <rule lang="it" from="pagina" to="page">
 </translation>

then translate this (which is more manageable) into XSLT, then apply the
stylesheet.

But this is so mechanic it should be applied at the specification level.

> This type of thing will also violate the spirit of what the purpose of XML
> is to provide: standard useable information.  

I disagree. I proposed to unlock the semantic information with the
natural language used to translate that into schemas. Two schemas may
have totally different element sets but express the _exact_ same
semantic structure.

For example, DocBook in English and DocBook (DocLibro?) in Italian. Mind
you: didn't say "for English or for Italian" but "_in_ English and _in_
Italian".

These, for every XML meaning are different schemas, even if each element
is the translation of the other element.

> To use Microsoft's case for
> XML, we have a robot that goes to a site to get whether information.
> With HTML we observe that the information is in the 2nd table, 3rd cell.
> If the site designer has too much cafiene one night, our precious info
> is now in the 1st div on the page.  If the site had an XML representation,
> we know that we are looking for the info in the <weather/> tag.  If we
> start internationalizing the tags, then the information may be in the
> <weather/> tag for some people, but in a different tag for another person.

Thanks, I think I know this :)

My point is: what if the weather information you get is marked-up with
<tempo> instead of <weather>? How do you know it's stills something
about weather?

I hear people saying RDF. Sure, that's it, RDF and RDFSchema. <tempo>
might contained into an RDF sentence and the RDFSchema says that it
extends <weather>, so both share the same semantic meaning.

But language identities are such a special case it should be made much
simpler than this. RDF is and will remain a pain in the ass.

> That would create more chaos than it would solve.  I would venture to
> say that if your father is anything like mine, that he would care less what
> the markup looks like.

This is probably true :)
 
> As far as the sitemap is concerned, I still think i18n on that is too much.
> The sitemap is necessary for Cocoon to read.  If it used tags like <s/>
> and <p/> for <sitemap/> and <process/>, Cocoon wouldn't care as long
> as it can read it.  The longer names are necessary as long as we don't
> have a GUI to control the setup of the sitemap.

GUI propose a good filtering model for people that want this to be
i18n-ed... probably you're right indicating this flexibility is too much
and asking for trouble.
 
> > This allows another level of separation of concern where who creates the
> > XSLT is a english designer and who writes the XML document is an italian
> > journalist. (yes, the eurofootball.com web site triggered many of these
> > thoughts)
> 
> What happens when the situations are reversed?  I still say that the i18n
> on the actual markup introduces too much complexity, too much ability
> for human error, and too much difficulty in tracking down where the
> error lies.  Not to mention slows down performance to a crawl.
> 
> Simple "resource" based i18n works wonderfully for most situations,
> and takes very little time to process--and could potentially be easy
> to implement.  Anything above this level of i18n becomes very complex
> and almost impossible to follow.
> 
> There is such a thing as taking a good idea too far.

Sure, I'm fully aware of this danger. This is why these are RT not
"wisdom fragment" :)
 
> >                          ------------------ o ------------------
> >
> > Ok, but what can we do inside Cocoon without having to proprietarely
> > extend the XML specifications?
> 
> Simple resource files.
> 
> > Also, how can we simplify the sitemap evolution without compromising the
> > rest of the system?
> 
> See #2 above.
> 
> > I think a possible solution is sitemap pluggability and compilation.
> >
> > You could think at the sitemap like a big XSP taglib that is responsible
> > to drive directly the execution of the resource creation pipelines.
> 
> Talk about learning curve.

No, nothing changes from the outside. The only thing is that we don't
write the sitemap interpreter, we write the sitemap compiler and keep
the sitemap pluggable as we do for XSP and generators.
 
> > It would also increase performance, since matching could be optimized
> > and what not.
> 
> It would?  How?

During compilation you have the whole sitemap at hand. You could
optimize paths, refactor pipelines, optimize conditionals, evaluate
sitemap mistakes and create java code that simply executes the
request/response for you, using the instructions in the sitemap as well
as in the used components.

At least during development this could be an invaluable feature to drive
evolution.
 
> > It would also allow different sitemap schemas to be developped. In
> > theory, you could create your own sitemap schema.
> 
> Danger, Will Robinson, Danger!

I know, I know. :)
 
> > Well, this collection of RT is admittedly wild.
> 
> Agreed :P
> 
> > Digest with caution but think about it extensively since I know many FS
> > hides between the lines.
> 
> I'll keep an open mind.
> 
> I have to remember, that sometimes small and lean doesn't always mean
> elegant and optimized.
> 
> To pull an example from the analog audio world about the design techniques
> used by people of different nationalities:  The American circuit designers
> believe that the shortest simplest path for the audio to travel is the best
> because every component introduced increases distortion.  British circuit
> designers, however, use as many components it takes to counter-act the
> distortion introduced by other components.  The end result is that British
> electronics sound warmer and more elegant while American electronics
> sound crisper and more sterile.  It is the difference between attempting
> for minimal distortion, and attempting to have the distortion pleasing to
> the ear.  This analogy applies to Pro electronics, I have no experience
> with British consumer gear.
> 
> The way it applies here is that with my American mentalities, I am looking
> for the simplest, cleanest method to accomplish the same goal.  Stephano

STEFANO, damn it!!! "f" not "ph". Second time in the same message :)

> with a different mindset is proposing something that to the user can be
> more elegant and friendly.

I don't know. I need more feedback to find out... this is why I express
my thoughts as soon as they pop up.

sometimes they are plain silly, but some other times proved to be
useful.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------

Re: [RT] i18n in Cocoon and language independent semantic contexts

Posted by Berin Loritsch <bl...@infoplanning.com>.

Stefano Mazzocchi wrote:

> The problems i18n poses are big and it's the reason why both Java and
> XML have Unicode support right from their core (a big advantage over
> almost all other programming languages).

Agreed.

> Cocoon = Java + XML, so this means we need to place i18n support right
> into our core, or we'll be doomed by design limitations for the rest of
> its lifetime (and force us to do a cocoon3 to fix design problems)

True that.

>
> Let's see those problems:
>
> 1) internal messages: errors, logs, comments all should be driven by the
> JVM locale. Normally this is performed with Java ResourceBoundles.
>
> Is this enough? Should we create an XML version of those resource
> boundles? is this a following the golden-hammer antipattern of "do it
> all with XML"?

As long as we can include different files that will specify different
languages.
Most i18n systems simply need a way of getting to equivelant resources.
While XML is powerful, it might be overkill here.  The main question here
is how do we identify the resources.  GNU gettext generates certain tag
files
that have constants associated with a resource.  Whoever wants to translate
the program that uses gettext simply takes this file and translates the
phrases
from one language to another.  A properties file will work nicely for this
type of application.  Cocoon just needs to know which properties file to
get.
If we had a directory called "./i18n/" we can place the files within that
directory in the format of the 2 letter country code followed by
".properties",
and anyone who wants to provide a translation takes that file, changes the
name to the proper country code and translates the messages in there.

Simply stated:
./i18n/en.properties

will be translated to become
./i18n/es.properties
and so on.

The entries will look like:
SERVER500="Server error."
translated to:
SERVER500="Error de Server."

> 2) uri space: good URIs don't change and are human readable. The sitemap
> allows you to enforce the first (if you don't use extentions to indicate
> your resources), and your URI-space design should enforce the second
> one.
>
> Be careful, something like "/news/today" is a perfectly designed URI for
> a website and can stand ages without requiring to change. But it's  not
> human readable by non-english speakers. So it would be the italian
> equivalent "/notizie/oggi".

We could accomplish this with simple aliases.  We could also extend the
previously stated (see #1) proposal to include a WEB-INF/i18n/en.properties
suite of files to internationalize the URLs.  That way, we can provide a
mechanism for site internationalization--not necessary for everyone, but
a boon for whoever is willing to use it.  Such a directory should only be
needed if some parameter is used in the sitemap.

That way, we can identify a new namespace so that we can access the
site internationalization:
<sitemap xmlns:i18n="http://xml.apache.org/cocoon/i18n">
  <i18n:resource dir="WEB-INF/i18n/" lang="en"/>
  <process i18n:uri="resourceName"/>
</sitemap>

And i18n:uri, etc. would translate into whatever is the necessary attribute.

In this case, it would be uri="user/add" or something equivalent.  That
way, if I don't want to go through the trouble of internationalizing my
site I can still use the old sitemap schema.  If I find I want to do that,
I have that ability using the namespace.

> And, most important, is something like this worth the effort? (I've
> never seen translated URI spaces, is there a web site that does this?)

It may be the "wave of the future", it may be extra work, but the value
of creating one XML document, and having the ability to perform
translations easily is invaluable.  For Example:
I specify a DTD that allows me to create a form like this:
<form xmlns:i18n="http://xml.apache.org/cocoon/i18n">
  <i18n:resource dir="WEB-INF/i18n/" lang="en"/>
  <field name="user" type="drop-list">
    <description>
      <i18n:string resource="currentUser"/>
    </description>
    <selection>Stefano Mazzochi</selection>
    <selection>Berin Loritsch</selection>
  </field>
</form>

Using a simple mechanism like this is very powerful.  The ability
to make this easily available to the site designer in multiple areas
will make this an incredibly killer app--especially when we place
the language detection in XSP, XSLT, or by the engine.  Basically,
we would have one XML "form" representing the same information,
displayed in the users native language.

C'est Manufique, non?

This approach works well as long as the resources are small.  If we
have a press release or some other larger piece of information that
is not a specific resource (the contents of the press release will be
different for each release), then that would be best served by different
XML files--one for each target language.

Forms and functional spaces on a web site would benefit from such
a system.  Generic information, how-tos, etc. will not.

> 3) schemas: this is something I've been concerned about for quite some
> time and maybe some of you who were into the SGML world before can give
> us advices. Schema has one embedded natural language.
>
>  <page xml:lang="it">
>   <title>Hello World!</title>
>   <paragraph>
>    <bold>Hello World!</bold>
>   </paragraph>
>  </page>
>
> can be translated into
>
>  <page xml:lang="it">
>   <title>Ciao a tutti!</title>
>   <paragraph>
>    <bold>Ciao a tutti!</bold>
>   </paragraph>
>  </page>
>
> but this _requires_ authors to understand english to understand the
> markup. The real translation is
>
>  <pagina xml:lang="it">
>   <titolo>Ciao a tutti!</titolo>
>   <paragrafo>
>    <grassetto>Ciao a tutti!</grassetto>
>   </paragrafo>
>  </pagina>

AAAAAHHHHH! Noooo!

All markup should be done be the site designer.  If my native language
is English (which it is), then I would use an English markup to my site.
If it were Spanish (I'm only 30% mobile in that language), then I would
use Spanish markup.  The end user should never see the actual markup.
The goal of XML/XSL is to transform the information into a useable
format for the client.  If this format is a graphical view of the
information
(which XSL:FO is designed to give), then the end user sees the information
represented graphically.  If the format is a machine readable and
processable format (i.e. Business to Business data exchange formats),
then translating the tags is not only overkill, it will completely break
the system.

This type of thing will also violate the spirit of what the purpose of XML
is to provide: standard useable information.  To use Microsoft's case for
XML, we have a robot that goes to a site to get whether information.
With HTML we observe that the information is in the 2nd table, 3rd cell.
If the site designer has too much cafiene one night, our precious info
is now in the 1st div on the page.  If the site had an XML representation,
we know that we are looking for the info in the <weather/> tag.  If we
start internationalizing the tags, then the information may be in the
<weather/> tag for some people, but in a different tag for another person.

That would create more chaos than it would solve.  I would venture to
say that if your father is anything like mine, that he would care less what
the markup looks like.

As far as the sitemap is concerned, I still think i18n on that is too much.
The sitemap is necessary for Cocoon to read.  If it used tags like <s/>
and <p/> for <sitemap/> and <process/>, Cocoon wouldn't care as long
as it can read it.  The longer names are necessary as long as we don't
have a GUI to control the setup of the sitemap.

> This allows another level of separation of concern where who creates the
> XSLT is a english designer and who writes the XML document is an italian
> journalist. (yes, the eurofootball.com web site triggered many of these
> thoughts)

What happens when the situations are reversed?  I still say that the i18n
on the actual markup introduces too much complexity, too much ability
for human error, and too much difficulty in tracking down where the
error lies.  Not to mention slows down performance to a crawl.

Simple "resource" based i18n works wonderfully for most situations,
and takes very little time to process--and could potentially be easy
to implement.  Anything above this level of i18n becomes very complex
and almost impossible to follow.

There is such a thing as taking a good idea too far.

>                          ------------------ o ------------------
>
> Ok, but what can we do inside Cocoon without having to proprietarely
> extend the XML specifications?

Simple resource files.

> Also, how can we simplify the sitemap evolution without compromising the
> rest of the system?

See #2 above.

> I think a possible solution is sitemap pluggability and compilation.
>
> You could think at the sitemap like a big XSP taglib that is responsible
> to drive directly the execution of the resource creation pipelines.

Talk about learning curve.

> It would also increase performance, since matching could be optimized
> and what not.

It would?  How?

> It would also allow different sitemap schemas to be developped. In
> theory, you could create your own sitemap schema.

Danger, Will Robinson, Danger!

> Well, this collection of RT is admittedly wild.

Agreed :P

> Digest with caution but think about it extensively since I know many FS
> hides between the lines.

I'll keep an open mind.

I have to remember, that sometimes small and lean doesn't always mean
elegant and optimized.

To pull an example from the analog audio world about the design techniques
used by people of different nationalities:  The American circuit designers
believe that the shortest simplest path for the audio to travel is the best
because every component introduced increases distortion.  British circuit
designers, however, use as many components it takes to counter-act the
distortion introduced by other components.  The end result is that British
electronics sound warmer and more elegant while American electronics
sound crisper and more sterile.  It is the difference between attempting
for minimal distortion, and attempting to have the distortion pleasing to
the ear.  This analogy applies to Pro electronics, I have no experience
with British consumer gear.

The way it applies here is that with my American mentalities, I am looking
for the simplest, cleanest method to accomplish the same goal.  Stephano
with a different mindset is proposing something that to the user can be
more elegant and friendly.

_____________________________________________
NetZero - Defenders of the Free World
Click here for FREE Internet Access and Email
http://www.netzero.net/download/index.html

Re: [RT] i18n in Cocoon and language independent semantic contexts

Posted by Mark Washeim <es...@canuck.com>.

on 12/6/00 6:24 pm, Niclas Hedhman at niclas@l2w203.l2w.com wrote:

> Mark Washeim wrote:
> 
>> I really don't believe document editors are going to be plain text editors
>> for the ordinary users, rather, something akin to form editors...
> 
> Well said !!
> This thread had my head spin more than one turn, and I thought I totally lost
> my marbles and headed for an early retirement.
> You managed to drag the problem to the right table, and now the question to
> follow would be;
> 
> Where is the editor, capable of all that??
> 
> Niclas
> 

My company, Large Medium has been working on one for the past 4-5 months. We
will release it to the community (for what that may be worth) once we've
removed some of the dependancies on servlets that now pertain (in the main,
because the editor is used in a distributed environment, currently, file
saving is dependant on servlets to monitor and complete transactions . . .)

-- 
Mark (Poetaster) Washeim

'On the linen wrappings of certain mummified remains
found near the Etrurian coast are invaluable writings
that await translation.

Quem colorem habet sapientia?'

Evan S. Connell

Re: [RT] i18n in Cocoon and language independent semantic contexts

Posted by Niclas Hedhman <ni...@l2w203.l2w.com>.

Mark Washeim wrote:

> I really don't believe document editors are going to be plain text editors
> for the ordinary users, rather, something akin to form editors...

Well said !!
This thread had my head spin more than one turn, and I thought I totally lost
my marbles and headed for an early retirement.
You managed to drag the problem to the right table, and now the question to
follow would be;

Where is the editor, capable of all that??

Niclas

Re: [RT] i18n in Cocoon and language independent semantic contexts

Posted by Stefano Mazzocchi <st...@apache.org>.

I was waiting for your comments, Mark

[...]

> Ok, I think you may be inventing where no invention is called for.

I'm having the same feeling...

[...]
 
> I really don't believe document editors are going to be plain text editors
> for the ordinary users, rather, something akin to form editors . . . and
> that brings me back to the combination of:
> <apinfo> to facilitate, in our case, instantiating a localized interface
> <documentation> to facilitate maintenance of the schema itself . . . by
> whomever... ( which is true of document management in both cases,
> eurofootball.com and in several probjects for saab automobile)

The XML authoring tool is the key to all this, I agree. 

It's like having different translations of an API for each language...
it might render the contract so soft it could easily break or fragment.

Ok, let's forget this language-abstracted view of the semantic
contexts.... at least for now :)
 
> My immediate feedback . . . back to eurofootball . . . have to bring up the
> full cocoon version, damn it!
> 
> As always, thanks for your thoughts.

And I thank you for balancing them with your shares of reality :)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------

Re: [RT] i18n in Cocoon and language independent semantic contexts

Posted by Mark Washeim <es...@canuck.com>.

on 11/6/00 3:20 pm, Stefano Mazzocchi at stefano@apache.org wrote:

> The problems i18n poses are big and it's the reason why both Java and
> XML have Unicode support right from their core (a big advantage over
> almost all other programming languages).
> 
> Cocoon = Java + XML, so this means we need to place i18n support right
> into our core, or we'll be doomed by design limitations for the rest of
> its lifetime (and force us to do a cocoon3 to fix design problems)
> 
> Let's see those problems:
> 
> 1) internal messages: errors, logs, comments all should be driven by the
> JVM locale. Normally this is performed with Java ResourceBoundles.
> 
> Is this enough? Should we create an XML version of those resource
> boundles? is this a following the golden-hammer antipattern of "do it
> all with XML"?

It does appear that, as we build up a set of tools for managing documents,
any that are directed at a 'reader' may as well be in xml. Of course, we
have to have decent editors :)

> 
> 2) uri space: good URIs don't change and are human readable. The sitemap
> allows you to enforce the first (if you don't use extentions to indicate
> your resources), and your URI-space design should enforce the second
> one.
> 
> Be careful, something like "/news/today" is a perfectly designed URI for
> a website and can stand ages without requiring to change. But it's  not
> human readable by non-english speakers. So it would be the italian
> equivalent "/notizie/oggi".
> 
> This leads to something that was already expressed on the list: can the
> sitemap allow to enforce different views of the same URI space based on
> i18n issues? What's the best manageable way to do this? Where does
> separation of concerns accounts here? What's the best way to scale such
> a thing?

We're working on 2 sites currently where this been a fundemental issue...

currently, we have (for file based documents):

/root/hr/hr.xsd (schema) hr.xsl

/root/hr/se_SV/hr.xml
/root/hr/en_US/hr.xml

/root/pr/pr.xsd, pr.xsl
/root/pr/se_SV/pr.xml, instanceXXX.xml, instanceXXX.xml
/root/pr/en_US/pr.xml, instanceXXX.xml, instanceXXX.xml

in both cases, the xsd schema file is used to instantiate an editor... the
instantiated editor in turn reads the exemplar xml (for the sake of
instantiating with reasonable values for the document maintener).

The readers (browsers) view, is mediated in much the same way as the site
map proposes. In this case, with the addition that index.html => index.xml
at the web server.

All index.xml requests (using the cocoon configs) are responded to by a
custom producer. The producer is file system based. It presents views of xml
documents in the file system. It may be passed a series of filters (for
instance, document must have a TITLE element in order to be displayed) and
generates an xml document representing that sub-set of files.

A request for /sverige/nyheter/(index.xml implicitly) is mapped (using a
site-map, of course :) ) only in so far as the file system producer creates
a view of documents that are available as per the map's base. Namely,
/root/pr/se_SV/ <=> /sverige/nyheter/
index.xml will contain the list of instanceXXX.xml available in
/root/pr/se_SV/ . . .

We're doing something like this in the main to keep abreast of the cocoon
architectural changes. It seems to me this is not such big problem. That
both:
1. Human readability and
2. immutability, obtain.

I think. :) We're in rather more of a hurry and, in fact, have two different
forms of site map. Necessity being the mother of invention, we have a
plethora of inventions :) But, we're longing for the day when cocoon 2 is
there . . .
(our second map has the following:

<PAGE REQUEST="login.xml" XSL="ff.xsl" SERVLET="ffLogin">
        <CASE REDIRECT="select_game.xml"
PARAMS="MEMB_XML=file:///path/en_US/ff/registered_panel
.xml">
            <RULE PARAM="UID_NO" OP="NULL_EMPTY" VALUE="FALSE"/>
            <RULE PARAM="UID_NO" OP="GT" VALUE="0"/>
        </CASE>
        <CASE REDIRECT="login.xml"
PARAMS="MEMB_XML=file:///path/en_US/ff/register_panel.xml">
            <GROUP TYPE="OR">
                <RULE PARAM="UID_NO" OP="NULL_EMPTY" VALUE="TRUE"/>
                <RULE PARAM="UID_NO" OP="LTEQ" VALUE="0"/>
            </GROUP>
        </CASE>
    </PAGE>

) YIKES :)

> And, most important, is something like this worth the effort? (I've
> never seen translated URI spaces, is there a web site that does this?)

In the main, we don't have any choice when working with dispersed marketing
departments (18 countries) for a global organisation but to accomodate the
uris (and domain names, for that matter. luckily, we control the hosting . .
.)...

> 3) schemas: this is something I've been concerned about for quite some
> time and maybe some of you who were into the SGML world before can give
> us advices. Schema has one embedded natural language.
> 
> <page xml:lang="it">
> <title>Hello World!</title>
> <paragraph>
> <bold>Hello World!</bold>
> </paragraph>
> </page>
> 
> can be translated into
> 
> <page xml:lang="it">
> <title>Ciao a tutti!</title>
> <paragraph>
> <bold>Ciao a tutti!</bold>
> </paragraph>
> </page>
> 
> but this _requires_ authors to understand english to understand the
> markup. The real translation is

This assumes something I believe to be false. Namely, that document authors
edit plain text mark-up. They don't in most cases. They use forms interfaces
or wysiwyg interfaces of some kind, but, more below . . .

> <pagina xml:lang="it">
> <titolo>Ciao a tutti!</titolo>
> <paragrafo>
> <grassetto>Ciao a tutti!</grassetto>
> </paragrafo>
> </pagina>
> 
> which could easily pass my "father's test" (he doesn't speak english),
> while the previous one would not.
> 
> Are those pages different? No, they are different views of the same
> information.

But they impose, by virtue of translating structure uneccessarily, an undue
burden of maintenance. In fact, an insufferable one, as far as I'm
concerned.

Not that I don't empathise with the reader of a dtd who doesn't grok the
language. I know this problem well. I've been reading the sgml and derived
xml of two data providers (and working directly with rdbms from the same
companies). One of the companies is Swedish, the other Dutch. I speak
english, german and french. Sigh. Reading Entity declarations is an ODD way
to learn a language.

Needless to say, I had to obtain some help in getting the meaning of
elements right. Well, the DTDs in question where not expressive enough where
localization is concerned. XML schema, however, IS! More, below . . .
....

> [Note: Ok, we made a very strong hypothesis: each natural language has
> the same expressivity range. Many could argue this is far from being
> true. For example, there is no italian equivalent for the english word
> "privacy" and there is no english equivalent for the word "pizza". Also,
> everybody knows that many jokes loose their funny meaning if translated
> (italians use policemen like americans use blondes). Many italian
> dialects contain expressions that would require pages italian to express
> the same feeling to the listener (italian dialects are mostly oral-only
> languages), Japanese embeds several language constructs to indicate
> difference of social position and so on.]
> 
> But it can be reasonably assumed that schemas contain the same amount of
> information and expose themselves with different views. Natural
> languages as "knowledge representation syles" of abstract structured
> relationship between different semantic areas.
> 
> So, let us suppose there exists one schema and the reference schema is
> written in english.

Your example below is NOT of the schema (namely the abstract which could as
well be expressed in any language) but of the instance of that schema. I
mean, with reference to the w3c's specification for
a. an xml document
b. an xml schema defining and constraining said document

> It should be possible to introduce a view of this schema by allowing
> semantic inheritance of the elements.
> 
> Let's make an example:
> 
> <page:page xml:lang="en" xmlns:page="urn:page" xmlns:style="urn:style">
> <page:title>Hello World!</page:title>
> <page:paragraph>
> <style:bold>Hello World!</style:bold>
> </page:paragraph>
> </page:page>
> 
> and we want to translate this into HTML so we need page->html and
> markup->html (supposing page doesn't contain the equivalent of "style"
> semantic information)
> 
> No we want this to be readable for italians that don't know english, but
> want to keep the same stylesheets. How could we achieve that?
> 
> I have a solution that requires (unfortunately) patching both the
> namespace and XMLSchema specifications:
> 
> <pagina:pagina xml:lang="it"
> xmlns:pagina="urn:page" xmlns:pagina:lang="it"
> xmlns:stile="urn:style" xmlns:stile:lang="it">
> <pagina:titolo>Ciao a tutti!</pagina:titolo>
> <pagina:paragrafo>
> <stile:grassetto>Ciao a tutti!</stile:grassetto>
> </pagina:paragrafo>
> </pagina:pagina>
> 
> where the XMLSchema should indicate that
> 
> <pagina> -(equals)-> <page>
> <titolo> -(equals)-> <title>
> <paragrafo> -(equals)-> <paragraph>
> 
> and all create different natural languages views of the same namespace
> (urn:page) while
> 
> <grassetto> -(equals)-> <bold>
> 
> for the namespace (urn:style).

Now, the maintenance and administration of the document AND the document
type depend on as much as THREE people! The two document editors in their
respective languages and the person responsible for the schema (xml schema)
used to validate both types... I have a bad feeling about this . . .

> Then, it can be possible for XML parsers to map all those elements in
> "language-neutral semantic equivalent classes" where XPaths can access
> them indipendently of their natural language form.
> 
> For example, the XPath "/page/title" should return "Ciao a Tutti!" if
> applied to the italian version of the page and "Hello World!" if applied
> to the english version (version indicated with xml:lang), but should be
> transparent on the language used to present the schema elements.
> 
> This allows another level of separation of concern where who creates the
> XSLT is a english designer and who writes the XML document is an italian
> journalist. (yes, the eurofootball.com web site triggered many of these
> thoughts)
> 
> Today, XPath and XMLSchema create contracts on the "strings of unicode
> chars" used to express semantic ideas.
> 
> This is, IMO, a big limitation since what is "linked" is not the element
> name but the semantic context it represents.
> 
> This would allow the creation of classes of equivalence for XML schemas,
> each one representing a different view of the same language independent
> semantic context they all share.

Ok, in principal, it's a nice vision. In practice, I doubt it's supportable.
The journalist will never edit xml directly, and if they did, would
constantly break application. Hence, you create interfaces for them. Hence,
the semantic context is protected . . . where the interface itself is
concerned . .  . below . . .

> Where would something like this be useful in Cocoon?
> 
> For all schemas used to generate the resources (user level) and for
> Cocoon's own schemas (mainly the sitemap and configurations).
> 
> For example, non-english-speakers could install and maintain Cocoon's
> sitemaps or, sitemaps with localized schemas can be given to people with
> different language skills.
> 
> Being completely "orthogonal" on the schema (this is why it needs to
> patch both namespaces and schema capabilities), this would positively
> impact on every XML usage.

Ok, I think you may be inventing where no invention is called for.

We're using schema annotations to provide the locale specific 'translation'
of the structure (both machine and human parts) to alleviate this problem.
That is, we maintian the semantic context, as you put it. Of course, we are
taking risks in using schema, but, what the hell . . .

The structure is usually (not always) in english, but is annotated. There's
no other way that doesn't produce more labour and confusion . . .

The point I'm making, below, is simple. XML schema is already expressive
enough to yeild all that you require. The real problem is that people need
to be trained to use them. We're building applications that will use schema
to make the editor easy to use (<apinfo> for labels), so, that should keep
the ordinary editor in the clear. It's the person responsible for the schema
in the first place that may be a problem.... but, an example . . .

Part of a schema which is used to:
1. instantiate an editor
2. constrain the validity of the document....

<xsd:annotation name"JOBINFORMATIONTYPE">
  <documenation xml:lang="en_US">
    <name="Job Information"/>
  </documentation>
<SNIP reason="sake of brevity"/>
  <appinfo xml:lang="en_US">
  <label="Job Information"/>
  </appinfo>
</xsd:annotation>

<xsd:complexType name="JOBINFORMATIONTYPE" >
  <xsd:element name="JOBTITLE"    type="xsd:string"      />
  <xsd:element name="LOCATION"    type="xsd:string"      />
  <xsd:element name="DEPARTMENT"  type="DEPARTMENTTYPE"  />
  <xsd:element name="DESCRIPTION" type="DESCRIPTIONTYPE" />
  <xsd:element name="CONTACTLIST" type="CONTACTLISTTYPE" />
  <xsd:element name="REFNUMBER"   type="xsd:integer"     />
  <xsd:element name="HOWTOAPPLY"  type="xsd:string"      />
  <xsd:element name="CONTACT"     type="CONTACTTYPE"     />
  <xsd:element name="CLOSINGDATE" type="CLOSINGDATETYPE" />
</xsd:complexType>

and the much less happy making:

<xsd:annotation name"DEPARTMENTTYPE">
  <documenation xml:lang="en">
    <name="Department Type"/>
    <values>
       <value> Development</value>
       <value> Finance </value>
       <value> Marketing </value>
       <value> Procurement </value>
       <value> Production </value>
       <value> Other </value>
    </values>
  </documentation>
  <documenation xml:lang="de_DE">
    <name="Department Type"/>
    <values>
       <value> Entwicklung </value>
       <value> Finanzen </value>
       <value> Marketing </value>
       <value> Beschaffung </value>
       <value> Produktion </value>
       <value> Anderes </value>
    </values>
  </documentation>
<SNIP reason="sake of brevity"
  <appinfo xml:lang="en_US">
  <label="Department Type"/>
    <values>
       <value> Development</value>
       <value> Finance </value>
       <value> Marketing </value>
       <value> Procurement </value>
       <value> Production </value>
       <value> Other </value>
    </values>
  </appinfo>
  <appinfo xml:lang="en_UK">
  <label="Department"/>
    <values>
       <value> Development</value>
       <value> Finance </value>
       <value> Marketing </value>
       <value> Procurement </value>
       <value> Production </value>
       <value> Other </value>
    </values>
  </appinfo>
  <appinfo xml:lang="de_DE">
  <label="Abteilung"/>
    <values>
       <value> Entwicklung </value>
       <value> Finanzen </value>
       <value> Marketing </value>
       <value> Beschaffung </value>
       <value> Produktion </value>
       <value> Anderes </value>
    </values>
  </appinfo>
<SNIP reason="sake of brevity"
</xsd:annotation>

<xsd:simpleType name="DEPARTMENTTYPE" base="xsd:String" >
 <xsd:enumeration value="Development" />
 <xsd:enumeration value="Finance" />
 <xsd:enumeration value="Marketing" />
 <xsd:enumeration value="Procurement" />
 <xsd:enumeration value="Production" />
 <xsd:enumeration value="Other" />
</xsd:simpleType>

Ok. So we lost your father. We also lost most of the employees of the
company in question. Sigh. But, while the above schema is getting verbose.
One CAN decipher much more easily than was the case with the DTDs I was
referring to earlier. Document editors need never decipher it, at all... in
our context, but I believe that's what applications are for...

I understand you're trying to work at the level of the element tag itself.
However, I don't think this is an issue. Namely. If the application being
developed is intiated in Italy, where the production facilities are staffed
by Italians, it's very likely that the schema and documents will be marked
up in Italian (as in the case of the sqml I've been reading in Swedish). As
long as they provide annotations, as need be, there really isn't a problem.
If I need to develop XSL, there will be a reference... If all I get is the
XML, of course, I conceed your point. But, then, I also can't validate their
documents, nor is their any 'reasonable' to create an editor for those
documents. So, they fall into the domain of the 'unregulated'. Or, if I'm
lucky, literature :) In the latter case, I'll haul out my dictionary :)

...

In my experience using columns from dbs, it's the same story. I just need a
map. I don't have a problem using the column names as they are, and don't
see a justification for translation that isn't outweighed by the maintenance
cost. While I'm not fond of working under pressure to develop apps that use
SQL statements in which I'm obliged to decipher Dutch, I'll live with it, if
only I get decent documentation.

...

When it comes to the vast majority of documents, it's arguable that their
VALUE is no so great as to justify translating their structure! Facilitating
the translation of their content, on the other hand, is our responsibility.

I really don't believe document editors are going to be plain text editors
for the ordinary users, rather, something akin to form editors . . . and
that brings me back to the combination of:
<apinfo> to facilitate, in our case, instantiating a localized interface
<documentation> to facilitate maintenance of the schema itself . . . by
whomever... ( which is true of document management in both cases,
eurofootball.com and in several probjects for saab automobile)

My immediate feedback . . . back to eurofootball . . . have to bring up the
full cocoon version, damn it!

As always, thanks for your thoughts.

> ------------------ o ------------------
> 
> Ok, but what can we do inside Cocoon without having to proprietarely
> extend the XML specifications?
> 
> Also, how can we simplify the sitemap evolution without compromising the
> rest of the system?
> 
> I think a possible solution is sitemap pluggability and compilation.
> 
> You could think at the sitemap like a big XSP taglib that is responsible
> to drive directly the execution of the resource creation pipelines.
> 
> It would also increase performance, since matching could be optimized
> and what not.
> 
> It would also allow different sitemap schemas to be developped. In
> theory, you could create your own sitemap schema.
> 
> Well, this collection of RT is admittedly wild.
> 
> Digest with caution but think about it extensively since I know many FS
> hides between the lines.

-- 
Mark (Poetaster) Washeim

'On the linen wrappings of certain mummified remains
found near the Etrurian coast are invaluable writings
that await translation.

Quem colorem habet sapientia?'

Evan S. Connell