You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Patrick Schlaepfer <pa...@schlaepfer.com> on 2004/03/31 16:23:44 UTC

Unicode Umlauts/SQLTransformer

Made the observation that SQLTransformer, doesn't care
that much about character Encoding:

String retval = SQLTransformger.getStringValue(rs.getObject(i));
and then returns a new String((byte[]) object)

Would it make any sence to introduce there a
new String((byte[]) object, "CHARACTER_SET") which could be UTF8.

Just a suggestion, as I wrote a simple class testing the
connection to a MySQL 4.1.1 DB with UTF8 setting. And
there I used

byte[] col_1_b = rs.getBytes(1);
new String(col_1_b, "UTF8");

Any comments?
Thanks
Patrick



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Unicode Umlauts/SQLTransformer

Posted by Yves Vindevogel <yv...@implements.be>.
Alex, Betrand,

This is the same problem we're facing too.  I not that familiar with 
Java, so could someone check the sources for this ?
It's appearantly not only the SQLTransformer, but also the ESQL 
generator.

Yves

On 31 Mar 2004, at 16:23, Patrick Schlaepfer wrote:

> Made the observation that SQLTransformer, doesn't care
> that much about character Encoding:
>
> String retval = SQLTransformger.getStringValue(rs.getObject(i));
> and then returns a new String((byte[]) object)
>
> Would it make any sence to introduce there a
> new String((byte[]) object, "CHARACTER_SET") which could be UTF8.
>
> Just a suggestion, as I wrote a simple class testing the
> connection to a MySQL 4.1.1 DB with UTF8 setting. And
> there I used
>
> byte[] col_1_b = rs.getBytes(1);
> new String(col_1_b, "UTF8");
>
> Any comments?
> Thanks
> Patrick
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
>
>
>
Met vriendelijke groeten,
Bien à vous,
Kind regards,

Yves Vindevogel
Implements

Mail: yves.vindevogel@implements.be  - Mobile: +32 (478) 80 82 91

Kempische Steenweg 206 - 3500 Hasselt - Tel-Fax: +32 (11) 43 55 76
Markt 18c  -  9700 Oudenaarde  -  Tel: +32 (55) 30 55 76

Web: http://www.implements.be

First they ignore you.  Then they laugh at you.  Then they fight you.  
Then you win.
Mahatma Ghandi.

Re: Unicode Umlauts/SQLTransformer/mySQL

Posted by Patrick Schlaepfer <pa...@schlaepfer.com>.
For my part it is working now, patched SQLTransformer, that
everything gets back as byte[], and now the UTF-8 characters
are encoded correctly. Do know it's a hack, but if
someone is interested could send my setup, or post it somewhere.

Cheers
Patrick



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: AW: Unicode Umlauts/SQLTransformer

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
Le 1 avr. 04, à 08:11, Patrick Schlaepfer a écrit :

> Have a look at the example I have provided. There
> I use Connector/J V3.0.11-stable - which is the latest.
> If you use useUnicode=true and characterEncoding=UTF8
> getString, does not return the correct string.
> At least on a jdk1.4.1_02 on a Solaris host. Any other
> oberservations?...

I haven't done these sort of things with MySQL so I cannot make 
specific comments. But I've seen weird things with other databases, 
where data was stored in the wrong encoding after having been 
transferred between systems, and you wouldn't notice when using the 
database's native tools.

So (and assuming ResultSet.getString() is expected to handle encoding 
correctly, which I only assume), in your case it could be either the 
driver, the database configuration or the actual data that causes the 
problem.

> ...OOH, with getByte and new String(byte, characterEncoding) it
> does. So it might be a problem with the JDBC Driver. But
> it's certainly more difficult to "fix" it in the Driver
> than in the cocoon source..

Sure, but adding JDBC driver workarounds to the Cocoon CVS for specific 
drivers does not seem right. What we could do is to add a configurable 
encoding that SQLTransformer would use to create Strings from Objects 
(or Byte []), but it still sounds like a hack.

-Bertrand


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


AW: Unicode Umlauts/SQLTransformer

Posted by Patrick Schlaepfer <pa...@schlaepfer.com>.
Have a look at the example I have provided. There
I use Connector/J V3.0.11-stable - which is the latest.
If you use useUnicode=true and characterEncoding=UTF8
getString, does not return the correct string.
At least on a jdk1.4.1_02 on a Solaris host. Any other
oberservations?

OOH, with getByte and new String(byte, characterEncoding) it
does. So it might be a problem with the JDBC Driver. But
it's certainly more difficult to "fix" it in the Driver
than in the cocoon source.
That's what I think, and how it does work.

Patrick

> -----Ursprüngliche Nachricht-----
> Von: Bertrand Delacretaz [mailto:bdelacretaz@codeconsult.ch]
> Gesendet: Donnerstag, 1. April 2004 08:06
> An: users@cocoon.apache.org
> Betreff: Re: AW: Unicode Umlauts/SQLTransformer
>
>
> Le 1 avr. 04, à 07:52, Patrick Schlaepfer a écrit :
>
> > I wrote a small standalone class, to test
> >   .getString()
> > vs
> >   .getBytes()
> > and .getString doesn't handle the UTF8 characters
> > correctly....
>
> hmmm..are you sure that your JDBC driver is configured with the correct
> encoding?
>
> I might be wrong, but I think getString() should use the driver's
> encoding configuration to correctly interpret the database settings and
> convert the data to a correct String.
>
> Using getBytes() might work if you know how the characters are encoded
> in the database, but I don't think it is a general solution: with
> getBytes() the raw bytes need to be decoded by the application
> (=Cocoon), but I think this decoding should take place in the JDBC
> driver.
>
> IMHO the correct way would be to find out how to setup the JDBC driver
> so that getString() returns a correct value, and probably change the
> SQLTransformer to use getString() for string values.
>
> -Bertrand
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: AW: Unicode Umlauts/SQLTransformer

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
Le 1 avr. 04, à 07:52, Patrick Schlaepfer a écrit :

> I wrote a small standalone class, to test
>   .getString()
> vs
>   .getBytes()
> and .getString doesn't handle the UTF8 characters
> correctly....

hmmm..are you sure that your JDBC driver is configured with the correct 
encoding?

I might be wrong, but I think getString() should use the driver's 
encoding configuration to correctly interpret the database settings and 
convert the data to a correct String.

Using getBytes() might work if you know how the characters are encoded 
in the database, but I don't think it is a general solution: with 
getBytes() the raw bytes need to be decoded by the application 
(=Cocoon), but I think this decoding should take place in the JDBC 
driver.

IMHO the correct way would be to find out how to setup the JDBC driver 
so that getString() returns a correct value, and probably change the 
SQLTransformer to use getString() for string values.

-Bertrand


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


AW: Unicode Umlauts/SQLTransformer

Posted by Patrick Schlaepfer <pa...@schlaepfer.com>.
I wrote a small standalone class, to test
  .getString()
vs
  .getBytes()
and .getString doesn't handle the UTF8 characters
correctly.
You can download the source code at
http://patrick.schlaepfer.com/TestUTF8.tar.gz

With mysql getObject, returns a an Object an
not a byte[] - which makes sense. So the UTF8
encoding gets lost there.

So I changed in
cocoon-2.1.4/src/blocks/database/java/org/apache/cocoon/transformation/SQLTr
ansformer.java
The lines
// String retval =  SQLTransformer.getStringValue( rs.getObject( i ) );

String retval =  SQLTransformer.getStringValue( rs.getBytes( i ) );

and

// String retval =  SQLTransformer.getStringValue( rs.getObject( name ) );
String retval =  SQLTransformer.getStringValue( rs.getBytes( name ) );

and
retString = "B "+new String( (byte[]) object, "UTF8" );
(B is only for debugging)

And now ther characters are encoded correctly.

Have no idea, if this is also the case with other Databases
but at least with MySQL 4.1.1 it works.

Any comments are welcome
Patrick

> -----Ursprüngliche Nachricht-----
> Von: Bertrand Delacretaz [mailto:bdelacretaz@codeconsult.ch]
> Gesendet: Donnerstag, 1. April 2004 07:24
> An: users@cocoon.apache.org
> Betreff: Re: Unicode Umlauts/SQLTransformer
>
>
> Le 31 mars 04, à 16:23, Patrick Schlaepfer a écrit :
>
> > Made the observation that SQLTransformer, doesn't care
> > that much about character Encoding:
> >
> > String retval = SQLTransformger.getStringValue(rs.getObject(i));
> > and then returns a new String((byte[]) object)
>
> According to the Java API, this "Constructs a new String by decoding
> the specified array of bytes using the platform's default charset.".
>
> IIUC the platform's default charset is what can be set with the
> -Dfile.encoding parameter, so things should be fine *if* the encoding
> is correctly handled all the way down the pipeline. I don't know if
> this is the case though, you might want to test it by dumping the
> String at various stages or starting with minimal pipelines.
>
> OTOH I'm wondering if the use of rs.getObject(i) as opposed to
> rs.getString() isn't a problem regarding encoding. It would be
> interesting to compare the two, either in a simple test program outside
> of Cocoon, or by modifying the SQLTransformer to use rs.getString() if
> rs.getMetaData().getColumnType(i) says this is a String column.
>
> -Bertrand
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Unicode Umlauts/SQLTransformer

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
Le 31 mars 04, à 16:23, Patrick Schlaepfer a écrit :

> Made the observation that SQLTransformer, doesn't care
> that much about character Encoding:
>
> String retval = SQLTransformger.getStringValue(rs.getObject(i));
> and then returns a new String((byte[]) object)

According to the Java API, this "Constructs a new String by decoding 
the specified array of bytes using the platform's default charset.".

IIUC the platform's default charset is what can be set with the  
-Dfile.encoding parameter, so things should be fine *if* the encoding 
is correctly handled all the way down the pipeline. I don't know if 
this is the case though, you might want to test it by dumping the 
String at various stages or starting with minimal pipelines.

OTOH I'm wondering if the use of rs.getObject(i) as opposed to 
rs.getString() isn't a problem regarding encoding. It would be 
interesting to compare the two, either in a simple test program outside 
of Cocoon, or by modifying the SQLTransformer to use rs.getString() if 
rs.getMetaData().getColumnType(i) says this is a String column.

-Bertrand


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org