You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Fred Vos <fr...@fredvos.org> on 2006/02/14 23:51:35 UTC

Dynamically generating arabic texts with svg2png serializer

Hello,

Last september I started a course in Arabic language. Now I want to present
Arabic texts on a website, using Cocoon. Normal Arabic text is supported in
both Mozilla and Konqueror under Linux without a problem. Using entity encoded
unicode strings like &#x0628;&#x062d;&#x0631; , browsers will present Arabic
characters Beh, Hah and Reh from right to left. No problem.

But for beginners the Arabic language supports vocals like the Fatha, Damma or
Kasra, making it easier tho understand how one must pronounce the texts. You
can find these signs in the Unicode table as combining characters. To present
the above word as BaHRoenn (=sea), you can add combining characters Fatha,
Sukun and Dammatan: &#x0628;&#x064e;&#x062d;&#x0652;&#x0631;&#x064c;

Try this under Mozilla or Konqueror and a strange thing happens: it is
presented from left-to-right and gets unreadable, even for an Arab. Don't know
if IE does this right.

The only renderer that seems to work here is Batik. If I enter the above text
in an SVG file and convert it into a PNG file with the Batik rasterizer
(command line interface), it is presented correctly, from right to left and
with the combining characters.

Now my plan is as follows. I enter my texts including the combining characters
in an XML file and transform these texts by removing the forbidden
characters. I use the following XSL/XPath construct to remove the combining
characters:

<xsl:for-each select="str:tokenize(string(@ar),
'&#x064c;&#x064e;&#x064f;&#x0650;&#x0651;&#x0652;')">
  <xsl:value-of select="." />
</xsl:for-each>

(where @ar contains the string to convert)

This gives me Arab text without the vowels. Any browser will present this text
nice from right to left. To present the text with vowels I
want to convert the texts using an dynamically generated SVG file and the
svg2png serializer.

For western texts, things are easy. Using a basic SVG file for the generator,
I can transform this document with an XSL transformer, using the wildcard in
the matcher as a parameter to the transformer. The transformer adds the
parameter as text. This creates the SVG document including the text. Using the
svg2png serializer, I can get a PNG document containing my dynamic text.

Unfortunately this doesn't work for Arabic text, even without the combining
characters.

Here's the matcher in the sitemap:

      <map:match pattern="arab/artrans-*">
        <map:generate type="file" src="style/artrans.svg"/>
        <map:transform type="xslt" src="style/artranssvg.xsl">
          <map:parameter name="text" value="{1}"/>
        </map:transform>
        <map:serialize type="svg2png"/>
      </map:match>

If I try to use http://host:port/.../arab/artrans-<arab text for BaHRoenn
without vowels is pasted here> in my browser (mozilla), the url is converted
into http://host:port/.../arab/artrans-%D8%A8%D8%AD%D8%B1 and the picture
contains rubbish text.

Does anyone here have any idea how I can successfully use the Batik rasterizer
in the Cocoon environment for dynamically generating PNG or JPEG pictures with
Arabic texts?

Thanks in advance for your attention.

Fred

-- 
|E  R
| D  F
|
|fred at fredvos dot org
|5235 DG 52 NL +31 73 6411833

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Dynamically generating arabic texts with svg2png serializer

Posted by Bruno Dumon <br...@outerthought.org>.
Hi Fred,

thanks for confirming this works, and for reminding me about this issue
which I was almost forgotten.

On Tue, 2006-02-28 at 23:17 +0100, Fred Vos wrote:
> Hello Bruno,
> 
> Thank you so much for your help. I've got it working now! Sorry I didn't
> respond earlier.
> 
> On Wed, Feb 15, 2006 at 09:28:17AM +0100, Bruno Dumon wrote:
> 
> [...]
> 
> > 
> > I have had the same experience. While request parameters and post-bodies
> > are decoded correctly, the URL path itself is not.
> 
> After reading this, I thought I could solve the problem by moving the Arabic
> text to a request parameter. You've suggested this solution too in a private
> message to me. But this didn't work.
> 
> > 
> > This can be fixed though.
> > 
> > If you are running Jetty, supply the following parameter to the java
> > command line:
> > -Dorg.mortbay.util.URI.charset=UTF-8
> > 
> > If you are running Tomcat, you can do the same by editing
> > conf/server.xml, and on the Connector element (for http), add the
> > attribute URIEncoding="UTF-8".
> 
> I'm using Tomcat 5.0. I have multiple connectors for virtual hosts. This
> attribute must be set for every connector.
> 
> > 
> > Now, this will make that URL paths are correctly decoded as UTF-8.
> > However, this also means that request parameters will be decoded as
> > UTF-8, while Cocoon normally supposes the servlet container decodes them
> > as ISO-8859-1 and then corrects this itself.
> > 
> > The solution I have is to add a servlet filter which will set the
> > character encoding to UTF-8. Here's the source for such a filter:
> > 
> > package my;
> > 
> > import javax.servlet.*;
> > import java.io.IOException;
> > 
> > public class CharacterEncodingFilter implements Filter {
> >     private String encoding;
> > 
> >     public void init(FilterConfig filterConfig) throws ServletException {
> >         encoding = filterConfig.getInitParameter("encoding");
> >     }
> > 
> >     public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException {
> >         if (servletRequest.getCharacterEncoding() == null && this.encoding != null) {
> >             servletRequest.setCharacterEncoding(this.encoding);
> >         }
> >         filterChain.doFilter(servletRequest, servletResponse);
> >     }
> > 
> >     public void destroy() {
> >     }
> > }
> > 
> > Compile this, put it in a jara, put in in WEB-INF/lib. Edit the web.xml
> > file and add the following before the opening <servlet> element:
> > 
> >   <filter>
> >     <filter-name>encoding-filter</filter-name>
> >     <filter-class>my.CharacterEncodingFilter</filter-class>
> >     <init-param>
> >       <param-name>encoding</param-name>
> >       <param-value>UTF-8</param-value>
> >     </init-param>
> >   </filter>
> > 
> >   <filter-mapping>
> >     <filter-name>encoding-filter</filter-name>
> >     <url-pattern>/*</url-pattern>
> >   </filter-mapping>
> > 
> > In the same web.xml file, adjust both the form-encoding and
> > container-encoding parameters to be UTF-8 (these elements are already
> > there, don't add new ones):
> > 
> >     <init-param>
> >       <param-name>container-encoding</param-name>
> >       <param-value>UTF-8</param-value>
> >     </init-param>
> > 
> >     <init-param>
> >       <param-name>form-encoding</param-name>
> >       <param-value>UTF-8</param-value>
> >     </init-param>
> 
> At first cocoon didn't work after this. But your suggestion in a private
> message to me was right. The class wasn't located in the right directory in
> the jar. I uses javac and jar here and not Eclipse or Maven.
> 
> > 
> > (The container-encoding is now UTF-8 since the filter has instructed the
> > container to decode everything as UTF-8, while per default it will use
> > ISO-8859-1. This is needed because we otherwise can't destinguish
> > between the UTF-8 decoded URL and the ISO-8859-1 decoded post body)
> > 
> > And this should make everything working correctly.
> 
> It does!
> 
> > 
> > BTW, I have found out all this only very recently and will take up the
> > discussion on the dev list to make this the default in Cocoon.
> 
> Please do. I think Cocoon should use UTF-8 wherever possible.
> 
> &#x0634;&#x0643;&#x0631;&#x0627;&#x064b;
> = Shukran
> = Thank you 
> 
> Fred Vos

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org                          bruno@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Dynamically generating arabic texts with svg2png serializer

Posted by Fred Vos <fr...@fredvos.org>.
Hello Bruno,

Thank you so much for your help. I've got it working now! Sorry I didn't
respond earlier.

On Wed, Feb 15, 2006 at 09:28:17AM +0100, Bruno Dumon wrote:

[...]

> 
> I have had the same experience. While request parameters and post-bodies
> are decoded correctly, the URL path itself is not.

After reading this, I thought I could solve the problem by moving the Arabic
text to a request parameter. You've suggested this solution too in a private
message to me. But this didn't work.

> 
> This can be fixed though.
> 
> If you are running Jetty, supply the following parameter to the java
> command line:
> -Dorg.mortbay.util.URI.charset=UTF-8
> 
> If you are running Tomcat, you can do the same by editing
> conf/server.xml, and on the Connector element (for http), add the
> attribute URIEncoding="UTF-8".

I'm using Tomcat 5.0. I have multiple connectors for virtual hosts. This
attribute must be set for every connector.

> 
> Now, this will make that URL paths are correctly decoded as UTF-8.
> However, this also means that request parameters will be decoded as
> UTF-8, while Cocoon normally supposes the servlet container decodes them
> as ISO-8859-1 and then corrects this itself.
> 
> The solution I have is to add a servlet filter which will set the
> character encoding to UTF-8. Here's the source for such a filter:
> 
> package my;
> 
> import javax.servlet.*;
> import java.io.IOException;
> 
> public class CharacterEncodingFilter implements Filter {
>     private String encoding;
> 
>     public void init(FilterConfig filterConfig) throws ServletException {
>         encoding = filterConfig.getInitParameter("encoding");
>     }
> 
>     public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException {
>         if (servletRequest.getCharacterEncoding() == null && this.encoding != null) {
>             servletRequest.setCharacterEncoding(this.encoding);
>         }
>         filterChain.doFilter(servletRequest, servletResponse);
>     }
> 
>     public void destroy() {
>     }
> }
> 
> Compile this, put it in a jara, put in in WEB-INF/lib. Edit the web.xml
> file and add the following before the opening <servlet> element:
> 
>   <filter>
>     <filter-name>encoding-filter</filter-name>
>     <filter-class>my.CharacterEncodingFilter</filter-class>
>     <init-param>
>       <param-name>encoding</param-name>
>       <param-value>UTF-8</param-value>
>     </init-param>
>   </filter>
> 
>   <filter-mapping>
>     <filter-name>encoding-filter</filter-name>
>     <url-pattern>/*</url-pattern>
>   </filter-mapping>
> 
> In the same web.xml file, adjust both the form-encoding and
> container-encoding parameters to be UTF-8 (these elements are already
> there, don't add new ones):
> 
>     <init-param>
>       <param-name>container-encoding</param-name>
>       <param-value>UTF-8</param-value>
>     </init-param>
> 
>     <init-param>
>       <param-name>form-encoding</param-name>
>       <param-value>UTF-8</param-value>
>     </init-param>

At first cocoon didn't work after this. But your suggestion in a private
message to me was right. The class wasn't located in the right directory in
the jar. I uses javac and jar here and not Eclipse or Maven.

> 
> (The container-encoding is now UTF-8 since the filter has instructed the
> container to decode everything as UTF-8, while per default it will use
> ISO-8859-1. This is needed because we otherwise can't destinguish
> between the UTF-8 decoded URL and the ISO-8859-1 decoded post body)
> 
> And this should make everything working correctly.

It does!

> 
> BTW, I have found out all this only very recently and will take up the
> discussion on the dev list to make this the default in Cocoon.

Please do. I think Cocoon should use UTF-8 wherever possible.

&#x0634;&#x0643;&#x0631;&#x0627;&#x064b;
= Shukran
= Thank you 

Fred Vos

-- 
|E  R
| D  F
|
|fred at fredvos dot org
|5235 DG 52 NL +31 73 6411833

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Dynamically generating arabic texts with svg2png serializer

Posted by Bruno Dumon <br...@outerthought.org>.
On Tue, 2006-02-14 at 23:51 +0100, Fred Vos wrote:
> Hello,
> 
> Last september I started a course in Arabic language. Now I want to present
> Arabic texts on a website, using Cocoon. Normal Arabic text is supported in
> both Mozilla and Konqueror under Linux without a problem. Using entity encoded
> unicode strings like &#x0628;&#x062d;&#x0631; , browsers will present Arabic
> characters Beh, Hah and Reh from right to left. No problem.
> 
> But for beginners the Arabic language supports vocals like the Fatha, Damma or
> Kasra, making it easier tho understand how one must pronounce the texts. You
> can find these signs in the Unicode table as combining characters. To present
> the above word as BaHRoenn (=sea), you can add combining characters Fatha,
> Sukun and Dammatan: &#x0628;&#x064e;&#x062d;&#x0652;&#x0631;&#x064c;
> 
> Try this under Mozilla or Konqueror and a strange thing happens: it is
> presented from left-to-right and gets unreadable, even for an Arab. Don't know
> if IE does this right.
> 
> The only renderer that seems to work here is Batik. If I enter the above text
> in an SVG file and convert it into a PNG file with the Batik rasterizer
> (command line interface), it is presented correctly, from right to left and
> with the combining characters.
> 
> Now my plan is as follows. I enter my texts including the combining characters
> in an XML file and transform these texts by removing the forbidden
> characters. I use the following XSL/XPath construct to remove the combining
> characters:
> 
> <xsl:for-each select="str:tokenize(string(@ar),
> '&#x064c;&#x064e;&#x064f;&#x0650;&#x0651;&#x0652;')">
>   <xsl:value-of select="." />
> </xsl:for-each>
> 
> (where @ar contains the string to convert)
> 
> This gives me Arab text without the vowels. Any browser will present this text
> nice from right to left. To present the text with vowels I
> want to convert the texts using an dynamically generated SVG file and the
> svg2png serializer.
> 
> For western texts, things are easy. Using a basic SVG file for the generator,
> I can transform this document with an XSL transformer, using the wildcard in
> the matcher as a parameter to the transformer. The transformer adds the
> parameter as text. This creates the SVG document including the text. Using the
> svg2png serializer, I can get a PNG document containing my dynamic text.
> 
> Unfortunately this doesn't work for Arabic text, even without the combining
> characters.
> 
> Here's the matcher in the sitemap:
> 
>       <map:match pattern="arab/artrans-*">
>         <map:generate type="file" src="style/artrans.svg"/>
>         <map:transform type="xslt" src="style/artranssvg.xsl">
>           <map:parameter name="text" value="{1}"/>
>         </map:transform>
>         <map:serialize type="svg2png"/>
>       </map:match>
> 
> If I try to use http://host:port/.../arab/artrans-<arab text for BaHRoenn
> without vowels is pasted here> in my browser (mozilla), the url is converted
> into http://host:port/.../arab/artrans-%D8%A8%D8%AD%D8%B1 and the picture
> contains rubbish text.
> 
> Does anyone here have any idea how I can successfully use the Batik rasterizer
> in the Cocoon environment for dynamically generating PNG or JPEG pictures with
> Arabic texts?

salamu habibi,
(the only arab I know)

If I understand correctly, the problem here is not that Batik works
differently when used inside Cocoon, but that the characters in the URL
are not decoded correctly.

I have had the same experience. While request parameters and post-bodies
are decoded correctly, the URL path itself is not.

This can be fixed though.

If you are running Jetty, supply the following parameter to the java
command line:
-Dorg.mortbay.util.URI.charset=UTF-8

If you are running Tomcat, you can do the same by editing
conf/server.xml, and on the Connector element (for http), add the
attribute URIEncoding="UTF-8".

Now, this will make that URL paths are correctly decoded as UTF-8.
However, this also means that request parameters will be decoded as
UTF-8, while Cocoon normally supposes the servlet container decodes them
as ISO-8859-1 and then corrects this itself.

The solution I have is to add a servlet filter which will set the
character encoding to UTF-8. Here's the source for such a filter:

package my;

import javax.servlet.*;
import java.io.IOException;

public class CharacterEncodingFilter implements Filter {
    private String encoding;

    public void init(FilterConfig filterConfig) throws ServletException {
        encoding = filterConfig.getInitParameter("encoding");
    }

    public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException {
        if (servletRequest.getCharacterEncoding() == null && this.encoding != null) {
            servletRequest.setCharacterEncoding(this.encoding);
        }
        filterChain.doFilter(servletRequest, servletResponse);
    }

    public void destroy() {
    }
}

Compile this, put it in a jara, put in in WEB-INF/lib. Edit the web.xml
file and add the following before the opening <servlet> element:

  <filter>
    <filter-name>encoding-filter</filter-name>
    <filter-class>my.CharacterEncodingFilter</filter-class>
    <init-param>
      <param-name>encoding</param-name>
      <param-value>UTF-8</param-value>
    </init-param>
  </filter>

  <filter-mapping>
    <filter-name>encoding-filter</filter-name>
    <url-pattern>/*</url-pattern>
  </filter-mapping>

In the same web.xml file, adjust both the form-encoding and
container-encoding parameters to be UTF-8 (these elements are already
there, don't add new ones):

    <init-param>
      <param-name>container-encoding</param-name>
      <param-value>UTF-8</param-value>
    </init-param>

    <init-param>
      <param-name>form-encoding</param-name>
      <param-value>UTF-8</param-value>
    </init-param>

(The container-encoding is now UTF-8 since the filter has instructed the
container to decode everything as UTF-8, while per default it will use
ISO-8859-1. This is needed because we otherwise can't destinguish
between the UTF-8 decoded URL and the ISO-8859-1 decoded post body)

And this should make everything working correctly.

BTW, I have found out all this only very recently and will take up the
discussion on the dev list to make this the default in Cocoon.

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org                          bruno@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org