You are viewing a plain text version of this content. The canonical link for it is here.

Posted to log4cxx-user@logging.apache.org by shadow king <sh...@gmail.com> on 2009/08/21 04:06:27 UTC

this is a bug in the encoding procedure

HI,

I am a chinese and I am using log4cxx as a logging facility in my
project(the locale in my linux server has been set to "zh_CN.GBK").

when I switch to the 0.10.0 release(I used version 0.97 beta before), I came
cross a problem: all the chinese logging message produced by my program
could not be displayed correctly.

Therefore, I decided to examine the source, and i found something which I
suspect was the cause of my problem, the suspected code is:

std::string Transcoder::encodeCharsetName(const LogString& val) {
     char asciiTable[] = { ' ', '!', '"', '#', '$', '%', '&', '\'', '(',
')', '*', '+', ',', '-', '.', '/',
                           '0', '1', '2', '3', '4', '5', '6' , '7', '8',
'9', ':', ';', '<', '=', '>', '?',
                           '@', 'A', 'B', 'C', 'D', 'E', 'F',  'G', 'H',
'I', 'J', 'K', 'L', 'M', 'N', 'O',
                           'P', 'Q', 'R', 'S', 'T', 'U', 'V',  'W', 'X',
'Y', 'Z', '[', '\\', ']', '^', '_',
                           '`', 'a', 'b', 'c', 'd', 'e', 'f',  'g', 'h',
'i', 'j', 'k', 'l', 'm', 'n', 'o',
                           'p', 'q', 'r', 's', 't', 'u', 'v',  'w', 'x',
'y', 'z', '{', '|', '}', '~', ' ' };
    std::string out;
    for(LogString::const_iterator iter = val.begin();
        iter != val.end();
        iter++) {
        if (*iter >= 0x30 && *iter < 0x7F) {
            out.append(1, asciiTable[*iter - 0x30]); // this is the
problematic line of code for me.
        } else {
            out.append(1, LOSSCHAR);
        }
    }
        printf(out.c_str());
    return out;
}

I replace the line "out.append(1, asciiTable[*iter - 0x30]);"  to
"out.append(1, *iter);", then my problem was solved.(The input arguement of
this function is "GBK" in my system. Before I hacked the code, this function
resturn "12;"; After the hacking, this function return "GBK" which is my
desire result).

I don't understand why we need to change the name of the charset name(for
the fear of non-ascii charset names? even with that fear, I can't see the
need of changing from "GBK" to "12;")

Re: this is a bug in the encoding procedure

Posted by Curt Arnold <ca...@apache.org>.

On Aug 21, 2009, at 4:32 AM, shadow king wrote:

> thanks for your reply.
>
> BTW, Is there a convinent way to swtich off all the "decoding &  
> encoding" thing completely? Because I don't want the performance  
> penalty imposed by the related function.
>
> For me, I cound not  see the benefit of using unicode as the  
> internal charset in log4cxx and I just want the log4cxx to log the  
> messages without any charset convertion.
>
> On Fri, Aug 21, 2009 at 12:00 PM, Curt Arnold <ca...@apache.org>  
> wrote:
> I'm thinking the constant should be 0x20, not 0x30.  The code was an  
> attempt to be able to handle non-ASCII platforms like EBCDIC but  
> looks like it was mangled and was done without access to a non-ASCII  
> platform.  Was just trying to do enough decoding to get the encoding  
> name to load a full charset.
>

You can hardwire the assumed encoding with

./configure --with-charset=utf-8
./configure --with-charset=usascii
./configure --with-charset=iso-8859-1

All three will replace conversion with glorified copy operations.

Specifying usascii will replace all non ASCII characters with a loss  
character ('?') but if you specify an particular encoding for a file,  
the resulting file will be valid.

utf-8 will blast characters directly into internal representation.  If  
you do not specify an encoding on any file appenders, the output file  
will have the same charset as the platform.  Filters, XML files,  
SocketAppenders, and any thing with a specified encoding may be  
invalid.  However, if you just going straight through to a file, you  
won't end up with loss characters.

iso-8859-1 will convert characters to utf-8.  If you do not specify  
any encoding, the output file will have the same encoding as the  
platform.  Won't result in illegal byte sequences like specifying  
UTF-8, but any explicit encoding may result in character substitution.

Re: this is a bug in the encoding procedure

Posted by shadow king <sh...@gmail.com>.

thanks for your reply.

BTW, Is there a convinent way to swtich off all the "decoding & encoding"
thing completely? Because I don't want the performance penalty imposed by
the related function.

For me, I cound not  see the benefit of using unicode as the internal
charset in log4cxx and I just want the log4cxx to log the messages without
any charset convertion.

On Fri, Aug 21, 2009 at 12:00 PM, Curt Arnold <ca...@apache.org> wrote:

> I'm thinking the constant should be 0x20, not 0x30.  The code was an
> attempt to be able to handle non-ASCII platforms like EBCDIC but looks like
> it was mangled and was done without access to a non-ASCII platform.  Was
> just trying to do enough decoding to get the encoding name to load a full
> charset.
>
>
>
> On Aug 20, 2009, at 9:06 PM, shadow king wrote:
>
>  HI,
>>
>> I am a chinese and I am using log4cxx as a logging facility in my
>> project(the locale in my linux server has been set to "zh_CN.GBK").
>>
>> when I switch to the 0.10.0 release(I used version 0.97 beta before), I
>> came cross a problem: all the chinese logging message produced by my program
>> could not be displayed correctly.
>>
>> Therefore, I decided to examine the source, and i found something which I
>> suspect was the cause of my problem, the suspected code is:
>>
>> std::string Transcoder::encodeCharsetName(const LogString& val) {
>>     char asciiTable[] = { ' ', '!', '"', '#', '$', '%', '&', '\'', '(',
>> ')', '*', '+', ',', '-', '.', '/',
>>                           '0', '1', '2', '3', '4', '5', '6' , '7', '8',
>> '9', ':', ';', '<', '=', '>', '?',
>>                           '@', 'A', 'B', 'C', 'D', 'E', 'F',  'G', 'H',
>> 'I', 'J', 'K', 'L', 'M', 'N', 'O',
>>                           'P', 'Q', 'R', 'S', 'T', 'U', 'V',  'W', 'X',
>> 'Y', 'Z', '[', '\\', ']', '^', '_',
>>                           '`', 'a', 'b', 'c', 'd', 'e', 'f',  'g', 'h',
>> 'i', 'j', 'k', 'l', 'm', 'n', 'o',
>>                           'p', 'q', 'r', 's', 't', 'u', 'v',  'w', 'x',
>> 'y', 'z', '{', '|', '}', '~', ' ' };
>>    std::string out;
>>    for(LogString::const_iterator iter = val.begin();
>>        iter != val.end();
>>        iter++) {
>>        if (*iter >= 0x30 && *iter < 0x7F) {
>>            out.append(1, asciiTable[*iter - 0x30]); // this is the
>> problematic line of code for me.
>>        } else {
>>            out.append(1, LOSSCHAR);
>>        }
>>    }
>>        printf(out.c_str());
>>    return out;
>> }
>>
>> I replace the line "out.append(1, asciiTable[*iter - 0x30]);"  to
>> "out.append(1, *iter);", then my problem was solved.(The input arguement of
>> this function is "GBK" in my system. Before I hacked the code, this function
>> resturn "12;"; After the hacking, this function return "GBK" which is my
>> desire result).
>>
>> I don't understand why we need to change the name of the charset name(for
>>  the fear of non-ascii charset names? even with that fear, I can't see the
>> need of changing from "GBK" to "12;")
>>
>
>

Re: this is a bug in the encoding procedure

Posted by Curt Arnold <ca...@apache.org>.

I'm thinking the constant should be 0x20, not 0x30.  The code was an  
attempt to be able to handle non-ASCII platforms like EBCDIC but looks  
like it was mangled and was done without access to a non-ASCII  
platform.  Was just trying to do enough decoding to get the encoding  
name to load a full charset.


On Aug 20, 2009, at 9:06 PM, shadow king wrote:

> HI,
>
> I am a chinese and I am using log4cxx as a logging facility in my  
> project(the locale in my linux server has been set to "zh_CN.GBK").
>
> when I switch to the 0.10.0 release(I used version 0.97 beta  
> before), I came cross a problem: all the chinese logging message  
> produced by my program could not be displayed correctly.
>
> Therefore, I decided to examine the source, and i found something  
> which I suspect was the cause of my problem, the suspected code is:
>
> std::string Transcoder::encodeCharsetName(const LogString& val) {
>      char asciiTable[] = { ' ', '!', '"', '#', '$', '%', '&', '\'',  
> '(', ')', '*', '+', ',', '-', '.', '/',
>                            '0', '1', '2', '3', '4', '5', '6' , '7',  
> '8', '9', ':', ';', '<', '=', '>', '?',
>                            '@', 'A', 'B', 'C', 'D', 'E', 'F',  'G',  
> 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O',
>                            'P', 'Q', 'R', 'S', 'T', 'U', 'V',  'W',  
> 'X', 'Y', 'Z', '[', '\\', ']', '^', '_',
>                            '`', 'a', 'b', 'c', 'd', 'e', 'f',  'g',  
> 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
>                            'p', 'q', 'r', 's', 't', 'u', 'v',  'w',  
> 'x', 'y', 'z', '{', '|', '}', '~', ' ' };
>     std::string out;
>     for(LogString::const_iterator iter = val.begin();
>         iter != val.end();
>         iter++) {
>         if (*iter >= 0x30 && *iter < 0x7F) {
>             out.append(1, asciiTable[*iter - 0x30]); // this is the  
> problematic line of code for me.
>         } else {
>             out.append(1, LOSSCHAR);
>         }
>     }
>         printf(out.c_str());
>     return out;
> }
>
> I replace the line "out.append(1, asciiTable[*iter - 0x30]);"  to  
> "out.append(1, *iter);", then my problem was solved.(The input  
> arguement of this function is "GBK" in my system. Before I hacked  
> the code, this function resturn "12;"; After the hacking, this  
> function return "GBK" which is my desire result).
>
> I don't understand why we need to change the name of the charset  
> name(for  the fear of non-ascii charset names? even with that fear,  
> I can't see the need of changing from "GBK" to "12;")