You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hc.apache.org by Sung-Gu <je...@apache.org> on 2003/02/04 11:37:21 UTC

Re: The use of UTIUtil.toUsingCharset?

Hi Oleg,
Again... well..
Ok... let me try to make you understand it again.  HmmHmm...

BTW, sorry to bother you that I haven't got you to get it right away
at that time even with a diagram and still...  :(

Actually, that's very easy...
And not that important unless it's not going to be support multilinqual.

As you see the diagram, bytes informations created from the original charset
should be restored.  That's all.

There isn't any uni-one to support the various charsets.(Let you regard it!)
Then, once it was tranformed, it should be tranformed back to the original.
That makes the transformed one to the original one.

If you understand this theory, you should try to understand why there would
need have some container for protocol processing and document browser
thingy.

You're supposed to find a way for tranfomation from and to Unicode to me?
It was your question?
But I don't get it really what you ask to me.
Because it's not for. There isn't that kinda of things.
(You should suppose it and agree with me!)

Then... why is that required?  Please, guess what for?
I said that it's required to support multi-langual!
Then why don't you try to do that?  Just simple.

Well, I'm afraid you still doesn't understand it...

Sung-Gu

P.S.: This time I just ignored any technical issues on my article.
  I hope that may help you...


> ----- Original Message -----
> From: "Kalnichevski, Oleg" <ol...@bearingpoint.com>
> To: "Commons HttpClient Project"
<co...@jakarta.apache.org>
> Sent: Tuesday, January 28, 2003 5:37 PM
> Subject: RE: The use of UTIUtil.toUsingCharset?
>
>
> Sung-Gu
> You are right. The examples I presented are meaningless. They are
meaningless, because URIUtil.toUsingCharset method is meaningless in the
very first place. I did my best to explain why
>
> Again, please give me an example (or better a unit test) demonstrating a
meaningful transformation of one Unicode string into another Unicode string
using the method in question
>
> Oleg
>
> -----Original Message-----
> From: Sung-Gu [mailto:jericho@apache.org]
> Sent: Montag, 27. Januar 2003 06:01
> To: Commons HttpClient Project
> Subject: Re: The use of UTIUtil.toUsingCharset?
>
>
> Hi,
>
> I'm sorry that I wasn't reaching your point...
> You're interested in only single-byte encodings with Unicode.
> I haven't realized it...
>
> That's why you haven't seen the correct use and display of that method.
> I guessed so though. (So, I tried to display byte code values)
>
> And I'd like to comment you that your below examples're not
> correct to use...   They're meaning-less...
> For display (what you want I guess), you should use code set
> or char set supported by your operating system or ISO-8859-1.
> For UTF-8 is capable to use only by purposes of transformation
> for storage and transmission.
> The case you want to use Unicode for display, ISO-10464 is
> fully supported and transformation to UTF-8 should be applied
> from UCS....
>
> I made it as TODO comment for simple diagram 2 in text file.
>  It was not my right previous issue.
> (As you know, I'm intersted in double-byte encodings...
>  and it would be general way to solve character encoding)
> I'll do it sometime later...
>
> Sung-Gu
>
> ----- Original Message -----
> From: <o....@dplanet.ch>
> Subject: Re: The use of UTIUtil.toUsingCharset?
>
>
> Please take no offense, but URIUtil.toUsingCharset method still does not
> make even slightest sense to me. Your example shows how to invoke this
> method but does not explain what it is useful for, apart from garbling
> unicode strings
>
> Have a look at a simpler example. Here I attempt to (supposedly) convert
> "Zurich" from one encoding into another. However, as you can see
> URIUtil.toUsingCharset() always produces garbage
>
> ===================================================================
> public static void main(String[] args) throws Exception
> {
>   System.out.println(
>     URIUtil.toUsingCharset("Zurich", "UTF-8", "US-ASCII"));
>   System.out.println(
>     URIUtil.toUsingCharset("Zurich", "ASCII", "UTF-8"));
>   System.out.println(
>     URIUtil.toUsingCharset("Zurich", "UTF-8", "ISO-8859-1"));
>   System.out.println(
>     URIUtil.toUsingCharset("Zurich", "ISO-8859-1", "UTF-8"));
> }
>
>
> Output:
>
> Zi¿½i¿½rich
> Z?rich
> ZA&#131;A¼rich
> Zi¿½
>
> =================================================================
>
> Java uses 16 bit to represent characters. Therefore the concept of
character
> encoding is only applicable when working with arrays of bytes, 8 bit
units,
> that represent a sequence of characters. One indeed needs to take
character
> encoding into account when converting from byte[] to String or visa versa.
> However, converting from Unicode String to an array of bytes to a Unicode
> String using different encoding (especially in one method call), in my
> opinion, does not produce any sensible results.
>
> If you see things differently, please help me understand what
> URIUtil.toUsingCharset() can be USEFUL for
>
> Cheers
>
> Oleg

Re: The use of UTIUtil.toUsingCharset?

Posted by Oleg Kalnichevski <o....@dplanet.ch>.

Hi Sung-Gu

On Tue, 2003-02-04 at 11:37, Sung-Gu wrote:
> Hi Oleg,
> Again... well..
> Ok... let me try to make you understand it again.  HmmHmm...
> 

Let's assume I am stupid

> BTW, sorry to bother you that I haven't got you to get it right away
> at that time even with a diagram and still...  :(
> 

Let's assume I am VERY stupid

> Actually, that's very easy...
> And not that important unless it's not going to be support multilinqual.
> 

Cmon, Java uses Unicode natively to represent strings. I'd like to hope
you are familiar with the concept of Unicode. Unicode automatically
enables multilingual support for all Java String objects. The concept of
character encoding is applicable only to String to byte[] or byte[] to
String transformations.

Think it over

Oleg

Re: The use of UTIUtil.toUsingCharset?

Posted by Ortwin Glück <or...@nose.ch>.

Sung-Gu wrote:
> ----- Original Message -----
> From: "Ortwin Glück" <or...@nose.ch>
> 
> Arrrg...  again...  :(
> Not surprising though...  :(((

Sung-Gu, I don't want to upset you. I just want to understand the 
problem that you are trying to solve with toUsingCharset. Your 
explanations did not help so far. Call me stupid but I guess I am not 
the only one here who doesn't understand the problem. (if I am wrong 
could someone else please tell me)

>>You speak of "transformation". What sort of transformation is that? The
> import sun.nio.cs.StandardCharsets;

Maybe you could just answer the following questions with yes or no each:

1. Is the problem related with characters that have no Unicode code 
assigned?

2. Is the problem that you want to pass non ISO-8859-1 data in POST or 
GET parameters?

3. Is a String object capable of containing characters that have no 
Unicode representation?

4. Is a byte[] capable of containing characters that have no Unicode 
representation?

---------
         CharsetProvider standardProvider = new StandardCharsets();
         for (Iterator i = standardProvider.charsets(); i.hasNext();) {
             System.out.println(i.next());
         }

What can you get it?
And what can you do it with them?
Could you please explain to me?
------

A Charset instance can convert String objects to byte[] and vice versa 
using a specific encoding. Charset instances are factored by the 
CharsetProvider. These classes are new as of JDK 1.4. In earlier JDKs 
these interfaces were burried deep inside the Sun implementation and not 
for public use.

HTH

Odi

Re: The use of UTIUtil.toUsingCharset?

Posted by Sung-Gu <je...@apache.org>.

----- Original Message -----
From: "Ortwin Glück" <or...@nose.ch>

Arrrg...  again...  :(
Not surprising though...  :(((

> by the String class. You must use byte[] in this case.
It was...

> You speak of "transformation". What sort of transformation is that? The

import sun.nio.cs.StandardCharsets;
import java.nio.charset.Charset;
import java.nio.charset.spi.CharsetProvider;
import java.util.Iterator;

main
        CharsetProvider standardProvider = new StandardCharsets();
        for (Iterator i = standardProvider.charsets(); i.hasNext();) {
            System.out.println(i.next());
        }

What can you get it?
And what can you do it with them?
Could you please explain to me?

Sung-Gu

P.S.: BTW, it's almost time to go home...

Re: The use of UTIUtil.toUsingCharset?

Posted by Ortwin Glück <or...@nose.ch>.

Sung-Gu wrote:
> There isn't any uni-one to support the various charsets.(Let you regard it!)
> Then, once it was tranformed, it should be tranformed back to the original.
> That makes the transformed one to the original one.

Sung-Gu,

I have problems understanding your English and I can only guess what you 
want to say.

Do you mean that there are characters that have no representation in 
Unicode? Your method uses String objects, which means Unicode! If there 
are characters not present in the Unicode set, they can not be handled 
by the String class. You must use byte[] in this case.

You speak of "transformation". What sort of transformation is that? The 
only "transformation" your method does is, it replaces some characters 
with '?'.

Odi

Re: The use of UTIUtil.toUsingCharset?

Posted by Ortwin Glück <or...@nose.ch>.

Thanks Laura for this excellent explanation. This really helps to clear 
things up! I am glad to have you and your indepth Unicode knowledge on 
the list.

I always thought you could roundtrip any charset to Unicode and get the 
same thing back. This is obviously wrong. It should be easy to write a 
test case for this once we have some of those characters.

Sung-Gu: Could you please post some of this problematic characters (hex 
values in different encodings and Unicode)? You are probably the only 
one who has knwoledge of Asian languages here.

Hopefully we can find an adequate solution for the problem now.

Cheers

Odi

Re: The use of UTIUtil.toUsingCharset?

Posted by Oleg Kalnichevski <o....@dplanet.ch>.

Laura

Finally, there's someone who can read Sung-Gu's mind! 

All right. A simple phrase "There are charsets that are not adequately
represented in Unicode" by Sung-Gu would have put the discussion into a
completely different perspective. And of course, Sung-Gu's stoical
refusal to provide a test case for the method did not help either. 

Many thanks

Oleg 



On Tue, 2003-02-04 at 22:51, Laura Werner wrote:
> Hi Sung-Gu,
> 
> >Actually, that's very easy...
> >And not that important unless it's not going to be support multilinqual.
> >
> >As you see the diagram, bytes informations created from the original charset
> >should be restored.  That's all.
> >  
> >
> My understanding of what you're saying is that if someone constructs a 
> URI using escaped characters in a particular charset (e.g. Big-5), using 
> the URI(char[] escaped) constructor, then URI needs to preserve those 
> characters.  If someone asks for the URI back as an escaped string in 
> the original charset (e.g. Big-5 again), we need to give them the 
> *exact* original string; it's not good enough to trancode from the 
> escaped Big-5 string to Unicode and back to Big-5.  Is this correct?
> 
> If this is true, I have a few comments on why this matters...
> 
> -- First, for those who don't understand why you can't just convert 
> everything to Unicode and stop worrying, there is some sense behind 
> this.  When Unicode was invented, the far-east languages were "Unified" 
> into the Han block of Unicode.  Some characters that have distinct codes 
> in the native double-byte character sets were mapped to single Unicode 
> characters.  This meant that some native character sets wouldn't round 
> trip to Unicode and back.  It was essentially a political compromise -- 
> the Unicode folks needed to save space in the 64k base plane, so they 
> merged Han characters that meant very similar things and looked almost 
> exactly same.  (Emphasis "similar" and "almost".)  But in native 
> charsets that didn't need to have room for Korean and Cyrillic and all 
> the other stuff that's in Unicode, there's room to split out multiple 
> versions of these characters that are merged together.
> 
> -- There are also a few new character sets like JIS-212 that contain 
> characters (like Japanese dental symbols, believe it or not) that 
> haven't been encoded in Unicode yet.  Presumably we'd want to keep the 
> encoded URI string around so that we can preserve this kind of character.
> 
> (In a past life I managed the Unicode group at IBM, and I remember far 
> more of this stuff than I thought I did.)
> 
> A few comments on URI.java and URIUtil.java
> 
> -- I think the comments need to be greatly improved.  It's very hard to 
> figure out what many of the methods do.  In the cases where I can figure 
> out what they do, it's hard to figure out *why*. 
> 
> -- It would be nice if the documentation explained the charset concepts: 
> What is a document charset and a protocol charset and so on.  A 
> reference to the RFC is nice, but a more concice explanation in the 
> JavaDoc would be better.
> 
> Laura, hoping I helped answer part of the "why" here, at least
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-httpclient-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-httpclient-dev-help@jakarta.apache.org
>

Re: The use of UTIUtil.toUsingCharset?

Posted by Laura Werner <la...@lwerner.org>.

Oleg Kalnichevski wrote:

>I apologize for restarting this conversation, but I have to confess I
>found myself not intelligent enough to be able to grasp grand designs of
>the UTIUtil#toUsingCharset method
>

Not a problem.  And it's not intelligence; a) URI and URIUtil are not 
well documented, and b) character sets are a very messy area.  Life 
would be a lot easier if everyone just switched to Unicode, but there's 
a lot of resistance (IMHO mostly political) to doing so.

>Ok. If I understand you right, you are saying is that there are charsets
>that are inadequately represented in Unicode or not represented at all.
>
Yes.  This explains why URI preserves the original string you pass in, 
with the escape sequences in it.  For example, if someone passes in a 
URI with % escapes in it, and the URI charset is JIS, you'd want to 
provide a way of accessing the original escaped string, and that's 
probably what you'd want to pass to the web server.

>Absolutely fine with me. So, UTIUtil#toUsingCharset is supposedly needed
>to help preserve those characters when performing charset translations.
>
I've never been able to figure out *what* most of the charset methods in 
URIUtil are supposed to do, actually.  The bit that confuses me is the 
same thing that's confusing you, I think...

>    return new String(target.getBytes(fromCharset), toCharset);
>  
>
That's the crux of the matter right there.  What the 
target.getBytes(fromCharset) does is ask the original "target" Unicode 
String (presumably containing % escapes) to convert itself to its byte 
representation in the original charset.  Then "new String(..., 
toCharset) creates a new Unicode string while pretending those very same 
bytes we just created are in "toCharset", which is presumably a 
different charset.  Any Unicode characters that have different encodings 
in those two character sets will end up changing in the second string, 
because the bytes will be written into the byte array using one 
character set, and then interpreted using another character set.  And 
since some character set encodings are stateful, it's conceivable that 
you could even have "fromCharset" and "toCharset" values that caused the 
new String construction to blow up because the byte array was invalid 
for the toCharset decoder.

The part I'm having trouble with is *why* you'd want to do this.  The 
whole point of Unicode (or one of them) is so that you don't have to 
remember what charset your byte arrays are in.  Once you convert from a 
String to a byte array, you need to preserve the charset of that byte 
array.  Suddenly pretending it's in a different charset is just going to 
screw things up.

I think I need to go read RFCs 1738 and 1808 and see if they're at all 
enlightening on this subject.

-- Laura

Re: The use of UTIUtil.toUsingCharset?

Posted by Oleg Kalnichevski <o....@dplanet.ch>.

Laura

I apologize for restarting this conversation, but I have to confess I
found myself not intelligent enough to be able to grasp grand designs of
the UTIUtil#toUsingCharset method

Sung-Gu apparently is too proud or too busy to spend his precious time
on such trifles as writing test cases or talking to such
primitive-minded fellas like me. I have no other choice but turn to you
for the guidance.

Ok. If I understand you right, you are saying is that there are charsets
that are inadequately represented in Unicode or not represented at all.
Absolutely fine with me. So, UTIUtil#toUsingCharset is supposedly needed
to help preserve those characters when performing charset translations.
Do I get it right?

Please have a look at the source code, though.

public static String toUsingCharset(
  String target, String fromCharset, String toCharset)
   throws URIException {
  try {
    //=======================================================
    return new String(target.getBytes(fromCharset), toCharset);
    //=======================================================
  } catch (UnsupportedEncodingException error) {
    throw new URIException(URIException.UNSUPPORTED_ENCODING,
     error.getMessage());
  }
}

As far as I can interpret these statements, a Unicode string is given as
input and another Unicode string is given back as output. 

My apologies, but was not the main thesis here that certain characters
simply cannot be represented in Unicode and therefore
UTIUtil#toUsingCharset was intended to address the problem?

Help! I must be really stupid, but I can't see how a direct translation
(not URLEncoding!!!!) of a Unicode string to byte array and back to a
Unicode string is supposed to help here.

I REALLY want to understand. Please help me

Oleg




On Tue, 2003-02-04 at 22:51, Laura Werner wrote:
> Hi Sung-Gu,
> 
> >Actually, that's very easy...
> >And not that important unless it's not going to be support multilinqual.
> >
> >As you see the diagram, bytes informations created from the original charset
> >should be restored.  That's all.
> >  
> >
> My understanding of what you're saying is that if someone constructs a 
> URI using escaped characters in a particular charset (e.g. Big-5), using 
> the URI(char[] escaped) constructor, then URI needs to preserve those 
> characters.  If someone asks for the URI back as an escaped string in 
> the original charset (e.g. Big-5 again), we need to give them the 
> *exact* original string; it's not good enough to trancode from the 
> escaped Big-5 string to Unicode and back to Big-5.  Is this correct?
> 
> If this is true, I have a few comments on why this matters...
> 
> -- First, for those who don't understand why you can't just convert 
> everything to Unicode and stop worrying, there is some sense behind 
> this.  When Unicode was invented, the far-east languages were "Unified" 
> into the Han block of Unicode.  Some characters that have distinct codes 
> in the native double-byte character sets were mapped to single Unicode 
> characters.  This meant that some native character sets wouldn't round 
> trip to Unicode and back.  It was essentially a political compromise -- 
> the Unicode folks needed to save space in the 64k base plane, so they 
> merged Han characters that meant very similar things and looked almost 
> exactly same.  (Emphasis "similar" and "almost".)  But in native 
> charsets that didn't need to have room for Korean and Cyrillic and all 
> the other stuff that's in Unicode, there's room to split out multiple 
> versions of these characters that are merged together.
> 
> -- There are also a few new character sets like JIS-212 that contain 
> characters (like Japanese dental symbols, believe it or not) that 
> haven't been encoded in Unicode yet.  Presumably we'd want to keep the 
> encoded URI string around so that we can preserve this kind of character.
> 
> (In a past life I managed the Unicode group at IBM, and I remember far 
> more of this stuff than I thought I did.)
> 
> A few comments on URI.java and URIUtil.java
> 
> -- I think the comments need to be greatly improved.  It's very hard to 
> figure out what many of the methods do.  In the cases where I can figure 
> out what they do, it's hard to figure out *why*. 
> 
> -- It would be nice if the documentation explained the charset concepts: 
> What is a document charset and a protocol charset and so on.  A 
> reference to the RFC is nice, but a more concice explanation in the 
> JavaDoc would be better.
> 
> Laura, hoping I helped answer part of the "why" here, at least
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-httpclient-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-httpclient-dev-help@jakarta.apache.org
>

Re: The use of UTIUtil.toUsingCharset?

Posted by Sung-Gu <je...@apache.org>.

----- Original Message -----
From: "Laura Werner" <la...@lwerner.org>


> Hi Sung-Gu,
>
> >Actually, that's very easy...
> >And not that important unless it's not going to be support multilinqual.
> >
> >As you see the diagram, bytes informations created from the original
charset
> >should be restored.  That's all.
> >
> >
> My understanding of what you're saying is that if someone constructs a
> URI using escaped characters in a particular charset (e.g. Big-5), using
> the URI(char[] escaped) constructor, then URI needs to preserve those
> characters.  If someone asks for the URI back as an escaped string in
> the original charset (e.g. Big-5 again), we need to give them the
> *exact* original string; it's not good enough to trancode from the
> escaped Big-5 string to Unicode and back to Big-5.  Is this correct?
>
> If this is true, I have a few comments on why this matters...
>
> -- First, for those who don't understand why you can't just convert
> everything to Unicode and stop worrying, there is some sense behind
> this.  When Unicode was invented, the far-east languages were "Unified"
> into the Han block of Unicode.  Some characters that have distinct codes
> in the native double-byte character sets were mapped to single Unicode
> characters.  This meant that some native character sets wouldn't round
> trip to Unicode and back.  It was essentially a political compromise --
> the Unicode folks needed to save space in the 64k base plane, so they
> merged Han characters that meant very similar things and looked almost
> exactly same.  (Emphasis "similar" and "almost".)  But in native
> charsets that didn't need to have room for Korean and Cyrillic and all
> the other stuff that's in Unicode, there's room to split out multiple
> versions of these characters that are merged together.
>
> -- There are also a few new character sets like JIS-212 that contain
> characters (like Japanese dental symbols, believe it or not) that
> haven't been encoded in Unicode yet.  Presumably we'd want to keep the
> encoded URI string around so that we can preserve this kind of character.
>
> (In a past life I managed the Unicode group at IBM, and I remember far
> more of this stuff than I thought I did.)

Excellent explantion!
It is described at a url that I poinited though on this mailling-list
before.
I think, your one is much nice! ;)

> A few comments on URI.java and URIUtil.java
>
> -- I think the comments need to be greatly improved.  It's very hard to


Not enough to just comment it out... I think...
Some article about this is written aleady in URI class for someone
to notice that...    and something is still left to do... as your comment...

> figure out what many of the methods do.  In the cases where I can figure
> out what they do, it's hard to figure out *why*.

>
> -- It would be nice if the documentation explained the charset concepts:
> What is a document charset and a protocol charset and so on.  A
> reference to the RFC is nice, but a more concice explanation in the
> JavaDoc would be better.

Actually, my problem is the fact that I just know how to, I guess.
It's hard for me to understand someones not to expience that....
I think I will have a chance sometime later...

> Laura, hoping I helped answer part of the "why" here, at least

Thank you very much, Laura! ;)

Sung-Gu

Re: The use of UTIUtil.toUsingCharset?

Posted by Laura Werner <la...@lwerner.org>.

Hi Sung-Gu,

>Actually, that's very easy...
>And not that important unless it's not going to be support multilinqual.
>
>As you see the diagram, bytes informations created from the original charset
>should be restored.  That's all.
>  
>
My understanding of what you're saying is that if someone constructs a 
URI using escaped characters in a particular charset (e.g. Big-5), using 
the URI(char[] escaped) constructor, then URI needs to preserve those 
characters.  If someone asks for the URI back as an escaped string in 
the original charset (e.g. Big-5 again), we need to give them the 
*exact* original string; it's not good enough to trancode from the 
escaped Big-5 string to Unicode and back to Big-5.  Is this correct?

If this is true, I have a few comments on why this matters...

-- First, for those who don't understand why you can't just convert 
everything to Unicode and stop worrying, there is some sense behind 
this.  When Unicode was invented, the far-east languages were "Unified" 
into the Han block of Unicode.  Some characters that have distinct codes 
in the native double-byte character sets were mapped to single Unicode 
characters.  This meant that some native character sets wouldn't round 
trip to Unicode and back.  It was essentially a political compromise -- 
the Unicode folks needed to save space in the 64k base plane, so they 
merged Han characters that meant very similar things and looked almost 
exactly same.  (Emphasis "similar" and "almost".)  But in native 
charsets that didn't need to have room for Korean and Cyrillic and all 
the other stuff that's in Unicode, there's room to split out multiple 
versions of these characters that are merged together.

-- There are also a few new character sets like JIS-212 that contain 
characters (like Japanese dental symbols, believe it or not) that 
haven't been encoded in Unicode yet.  Presumably we'd want to keep the 
encoded URI string around so that we can preserve this kind of character.

(In a past life I managed the Unicode group at IBM, and I remember far 
more of this stuff than I thought I did.)

A few comments on URI.java and URIUtil.java

-- I think the comments need to be greatly improved.  It's very hard to 
figure out what many of the methods do.  In the cases where I can figure 
out what they do, it's hard to figure out *why*. 

-- It would be nice if the documentation explained the charset concepts: 
What is a document charset and a protocol charset and so on.  A 
reference to the RFC is nice, but a more concice explanation in the 
JavaDoc would be better.

Laura, hoping I helped answer part of the "why" here, at least