You are viewing a plain text version of this content. The canonical link for it is here.

Posted to log4cxx-user@logging.apache.org by Marshall Powers <mp...@appsecinc.com> on 2007/06/25 18:20:33 UTC

Problem with iconv charsets...

I'm trying to use APR-1.2.7 in Log4Cxx 0.10 on AIX 5.3. When I run my
program, I get an exception "APR_LOCALE_CHARSET" in the createDefaultEncoder
method. I think this problem is related to the iconv that is installed on
this machine. When I run iconv -l, among the various charsets I see
"ISO8859-1". However, in the source for Log4Cxx and APR, the only string
literals I see are for "ISO-8859-1" (note the extra dash). Is there any
simple way to work around this problem? Is this potentially a portability
issue with APR/log4cxx (that is, if I distribute some app that uses APR, and
my user doesn't have "ISO-8859-1" in their iconv, is my app going to crash?)


Thanks,
Marshall Powers

Re: Problem with iconv charsets...

Posted by Martin Sebor <se...@roguewave.com>.

William A. Rowe, Jr. wrote:
> William A. Rowe, Jr. wrote:
>> Some thoughts;
>>
>>  * At run-time this should probably be determined by parsing first the
>>    LC_CTYPE, or LC_ALL in it's absense, or the fallback to the LANG
>>    envvar if neither LC_ variable is defined.  The codepage follows
>>    the period, e.g. LANG=en_US.UTF-8 would be parsed as 'UTF-8'.
> 
> FYI - I pondered LC_COLLATE, but it didn't seem to particularly apply.
> 
> The obvious question, if LC_CTYPE specifies a language/no charset, then
> do we drill down to LC_ALL, LANG etc?

The character set of a locale is determined by the LC_CTYPE
category. On POSIX platforms it can be retrieved by passing
the CODESET constant to nl_langinfo().

Martin

Re: Problem with iconv charsets...

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

William A. Rowe, Jr. wrote:
> Some thoughts;
> 
>  * At run-time this should probably be determined by parsing first the
>    LC_CTYPE, or LC_ALL in it's absense, or the fallback to the LANG
>    envvar if neither LC_ variable is defined.  The codepage follows
>    the period, e.g. LANG=en_US.UTF-8 would be parsed as 'UTF-8'.

FYI - I pondered LC_COLLATE, but it didn't seem to particularly apply.

The obvious question, if LC_CTYPE specifies a language/no charset, then
do we drill down to LC_ALL, LANG etc?

Re: Problem with iconv charsets...

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

Jeff Trawick wrote:
> 
> apr_os_default_encoding() needs to return something that can be passed
> to apr_xlate_open() on the current platform in order to translate
> compiled-in strings.

Similarly, it's used to present filesystem names.  So, this needs to be
taken a step further, perhaps?  Multiple apr_os_*_encoding() results by
their function?

Re: Problem with iconv charsets...

Posted by Jeff Trawick <tr...@gmail.com>.

On 6/26/07, William A. Rowe, Jr. <wr...@rowe-clan.net> wrote:
> Eric Covener wrote:
> > On 6/25/07, William A. Rowe, Jr. <wr...@rowe-clan.net> wrote:
> >>  * At run-time this should probably be determined by parsing first the
> >>    LC_CTYPE, or LC_ALL in it's absense, or the fallback to the LANG
> >>    envvar if neither LC_ variable is defined.  The codepage follows
> >>    the period, e.g. LANG=en_US.UTF-8 would be parsed as 'UTF-8'.
> >
> > Wouldn't runtime checks would mean xlate/xlate.c needs to find a new
> > way to figure out what the codepage of the source code was (to
> > translate compiled-in strings)?
> >
> > Perhaps APR_DEFAULT_CHARSET could be split into two different
> > identifiers APR_CURRENT_CHARSET/APR_BUILD_CHARSET that xlate callers
> > would have to think about.
>
> I'm confused.  APR messages are all english (regrettably) in US-ASCII.

Taking that and massaging just a bit: APR strings in the source code
are either in US-ASCII or EBCDIC (simplifying just a bit on the
latter).

apr_os_default_encoding() needs to return something that can be passed
to apr_xlate_open() on the current platform in order to translate
compiled-in strings.

Re: Problem with iconv charsets...

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

Eric Covener wrote:
> On 6/25/07, William A. Rowe, Jr. <wr...@rowe-clan.net> wrote:
>>  * At run-time this should probably be determined by parsing first the
>>    LC_CTYPE, or LC_ALL in it's absense, or the fallback to the LANG
>>    envvar if neither LC_ variable is defined.  The codepage follows
>>    the period, e.g. LANG=en_US.UTF-8 would be parsed as 'UTF-8'.
> 
> Wouldn't runtime checks would mean xlate/xlate.c needs to find a new
> way to figure out what the codepage of the source code was (to
> translate compiled-in strings)?
> 
> Perhaps APR_DEFAULT_CHARSET could be split into two different
> identifiers APR_CURRENT_CHARSET/APR_BUILD_CHARSET that xlate callers
> would have to think about.

I'm confused.  APR messages are all english (regrettably) in US-ASCII.
clib-errstring messages should respect LC_CTYPE for most modern, dynamic
c libraries, no?

Re: Problem with iconv charsets...

Posted by Eric Covener <co...@gmail.com>.

On 6/25/07, William A. Rowe, Jr. <wr...@rowe-clan.net> wrote:
>  * At run-time this should probably be determined by parsing first the
>    LC_CTYPE, or LC_ALL in it's absense, or the fallback to the LANG
>    envvar if neither LC_ variable is defined.  The codepage follows
>    the period, e.g. LANG=en_US.UTF-8 would be parsed as 'UTF-8'.

Wouldn't runtime checks would mean xlate/xlate.c needs to find a new
way to figure out what the codepage of the source code was (to
translate compiled-in strings)?

Perhaps APR_DEFAULT_CHARSET could be split into two different
identifiers APR_CURRENT_CHARSET/APR_BUILD_CHARSET that xlate callers
would have to think about.

-- 
Eric Covener
covener@gmail.com

Re: Problem with iconv charsets...

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

Marshall Powers wrote:
> The string literal "ISO-8859-1" appears in APR and log4cxx source code. For
> example, from apr-1.2.7/misc/unix/charset.c:
> 
> APR_DECLARE(const char*) apr_os_default_encoding (apr_pool_t *pool)
> {
> #ifdef __MVS__
> #    ifdef __CODESET__
>         return __CODESET__;
> #    else
>         return "IBM-1047";
> #    endif
> #endif
> 
>     if ('}' == 0xD0) {
>         return "IBM-1047";
>     }
> 
>     if ('{' == 0xFB) {
>         return "EDF04";
>     }
> 
>     if ('A' == 0xC1) {
>         return "EBCDIC"; /* not useful */
>     }
> 
>     if ('A' == 0x41) {
>         return "ISO-8859-1"; /* not necessarily true */
>     }
> 
> Are these files generated by configure scripts/ant build files? It doesn't
> seem like they are...

Nope.  That is raw, native hackery in an effort not to think through the
problem set.  As with all APR code, patches are welcome.

Some thoughts;

 * At run-time this should probably be determined by parsing first the
   LC_CTYPE, or LC_ALL in it's absense, or the fallback to the LANG
   envvar if neither LC_ variable is defined.  The codepage follows
   the period, e.g. LANG=en_US.UTF-8 would be parsed as 'UTF-8'.

 * It's reasonably trivial, if iconv is present, to validate the -fallback-
   charset name against iconv within autoconf, presuming this even should
   be ISO-8859-1

Comments?

RE: Problem with iconv charsets...

Posted by Marshall Powers <mp...@appsecinc.com>.

The string literal "ISO-8859-1" appears in APR and log4cxx source code. For
example, from apr-1.2.7/misc/unix/charset.c:

APR_DECLARE(const char*) apr_os_default_encoding (apr_pool_t *pool)
{
#ifdef __MVS__
#    ifdef __CODESET__
        return __CODESET__;
#    else
        return "IBM-1047";
#    endif
#endif

    if ('}' == 0xD0) {
        return "IBM-1047";
    }

    if ('{' == 0xFB) {
        return "EDF04";
    }

    if ('A' == 0xC1) {
        return "EBCDIC"; /* not useful */
    }

    if ('A' == 0x41) {
        return "ISO-8859-1"; /* not necessarily true */
    }

    return "unknown";
}




Also, in log4cxx/src/charsetencoder.cpp:

CharsetEncoderPtr CharsetEncoder::getEncoder(const std::string& charset) {
    if (StringHelper::equalsIgnoreCase(charset, "US-ASCII", "us-ascii") ||
        StringHelper::equalsIgnoreCase(charset, "ISO646-US", "iso646-US") ||
        StringHelper::equalsIgnoreCase(charset, "ANSI_X3.4-1968",
"ansi_x3.4-1968")) {
        return new USASCIICharsetEncoder();
    } else if (StringHelper::equalsIgnoreCase(charset, "ISO-8859-1",
"iso-8859-1") ||
        StringHelper::equalsIgnoreCase(charset, "ISO-LATIN-1",
"iso-latin-1")) {
        return new ISOLatinCharsetEncoder();
    } else if (StringHelper::equalsIgnoreCase(charset, "UTF-8", "utf-8")) {
        return new UTF8CharsetEncoder();
    } else if (StringHelper::equalsIgnoreCase(charset, "UTF-16BE",
"utf-16be")
        || StringHelper::equalsIgnoreCase(charset, "UTF-16", "utf-16")) {
        return new UTF16BECharsetEncoder();
    } else if (StringHelper::equalsIgnoreCase(charset, "UTF-16LE",
"utf-16le")) {
        return new UTF16LECharsetEncoder();
    }
#if defined(_WIN32)
    throw IllegalArgumentException(charset);
#else
    return new APRCharsetEncoder(charset.c_str());
#endif
}


Are these files generated by configure scripts/ant build files? It doesn't
seem like they are...




-----Original Message-----
From: dev-return-18563-mpowers=appsecinc.com@apr.apache.org
[mailto:dev-return-18563-mpowers=appsecinc.com@apr.apache.org] On Behalf Of
William A. Rowe, Jr.
Sent: 2007-Jun-25 Mon 12:33 PM
To: Marshall Powers
Cc: dev@apr.apache.org; 'Log4CXX User'
Subject: Re: Problem with iconv charsets...

Marshall Powers wrote:
> I'm trying to use APR-1.2.7 in Log4Cxx 0.10 on AIX 5.3. When I run my
> program, I get an exception "APR_LOCALE_CHARSET" in the
createDefaultEncoder
> method. I think this problem is related to the iconv that is installed on
> this machine. When I run iconv -l, among the various charsets I see
> "ISO8859-1". However, in the source for Log4Cxx and APR, the only string
> literals I see are for "ISO-8859-1" (note the extra dash). Is there any
> simple way to work around this problem? Is this potentially a portability
> issue with APR/log4cxx (that is, if I distribute some app that uses APR,
and
> my user doesn't have "ISO-8859-1" in their iconv, is my app going to
crash?)

Unfortunately, aliases are within the domain of iconv.

The question is, where did it pull out "ISO-8859-1" from as the default
locale on your box?  If *that* is from apr-util, we need to unwind where
it was resolved.  If that was an envvar set on your login, well, that would
be called shooting oneself in ones foot.

RE: Problem with iconv charsets...

Posted by Marshall Powers <mp...@appsecinc.com>.

The string literal "ISO-8859-1" appears in APR and log4cxx source code. For
example, from apr-1.2.7/misc/unix/charset.c:

APR_DECLARE(const char*) apr_os_default_encoding (apr_pool_t *pool)
{
#ifdef __MVS__
#    ifdef __CODESET__
        return __CODESET__;
#    else
        return "IBM-1047";
#    endif
#endif

    if ('}' == 0xD0) {
        return "IBM-1047";
    }

    if ('{' == 0xFB) {
        return "EDF04";
    }

    if ('A' == 0xC1) {
        return "EBCDIC"; /* not useful */
    }

    if ('A' == 0x41) {
        return "ISO-8859-1"; /* not necessarily true */
    }

    return "unknown";
}




Also, in log4cxx/src/charsetencoder.cpp:

CharsetEncoderPtr CharsetEncoder::getEncoder(const std::string& charset) {
    if (StringHelper::equalsIgnoreCase(charset, "US-ASCII", "us-ascii") ||
        StringHelper::equalsIgnoreCase(charset, "ISO646-US", "iso646-US") ||
        StringHelper::equalsIgnoreCase(charset, "ANSI_X3.4-1968",
"ansi_x3.4-1968")) {
        return new USASCIICharsetEncoder();
    } else if (StringHelper::equalsIgnoreCase(charset, "ISO-8859-1",
"iso-8859-1") ||
        StringHelper::equalsIgnoreCase(charset, "ISO-LATIN-1",
"iso-latin-1")) {
        return new ISOLatinCharsetEncoder();
    } else if (StringHelper::equalsIgnoreCase(charset, "UTF-8", "utf-8")) {
        return new UTF8CharsetEncoder();
    } else if (StringHelper::equalsIgnoreCase(charset, "UTF-16BE",
"utf-16be")
        || StringHelper::equalsIgnoreCase(charset, "UTF-16", "utf-16")) {
        return new UTF16BECharsetEncoder();
    } else if (StringHelper::equalsIgnoreCase(charset, "UTF-16LE",
"utf-16le")) {
        return new UTF16LECharsetEncoder();
    }
#if defined(_WIN32)
    throw IllegalArgumentException(charset);
#else
    return new APRCharsetEncoder(charset.c_str());
#endif
}


Are these files generated by configure scripts/ant build files? It doesn't
seem like they are...




-----Original Message-----
From: dev-return-18563-mpowers=appsecinc.com@apr.apache.org
[mailto:dev-return-18563-mpowers=appsecinc.com@apr.apache.org] On Behalf Of
William A. Rowe, Jr.
Sent: 2007-Jun-25 Mon 12:33 PM
To: Marshall Powers
Cc: dev@apr.apache.org; 'Log4CXX User'
Subject: Re: Problem with iconv charsets...

Marshall Powers wrote:
> I'm trying to use APR-1.2.7 in Log4Cxx 0.10 on AIX 5.3. When I run my
> program, I get an exception "APR_LOCALE_CHARSET" in the
createDefaultEncoder
> method. I think this problem is related to the iconv that is installed on
> this machine. When I run iconv -l, among the various charsets I see
> "ISO8859-1". However, in the source for Log4Cxx and APR, the only string
> literals I see are for "ISO-8859-1" (note the extra dash). Is there any
> simple way to work around this problem? Is this potentially a portability
> issue with APR/log4cxx (that is, if I distribute some app that uses APR,
and
> my user doesn't have "ISO-8859-1" in their iconv, is my app going to
crash?)

Unfortunately, aliases are within the domain of iconv.

The question is, where did it pull out "ISO-8859-1" from as the default
locale on your box?  If *that* is from apr-util, we need to unwind where
it was resolved.  If that was an envvar set on your login, well, that would
be called shooting oneself in ones foot.

Re: Problem with iconv charsets...

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

Marshall Powers wrote:
> I'm trying to use APR-1.2.7 in Log4Cxx 0.10 on AIX 5.3. When I run my
> program, I get an exception "APR_LOCALE_CHARSET" in the createDefaultEncoder
> method. I think this problem is related to the iconv that is installed on
> this machine. When I run iconv -l, among the various charsets I see
> "ISO8859-1". However, in the source for Log4Cxx and APR, the only string
> literals I see are for "ISO-8859-1" (note the extra dash). Is there any
> simple way to work around this problem? Is this potentially a portability
> issue with APR/log4cxx (that is, if I distribute some app that uses APR, and
> my user doesn't have "ISO-8859-1" in their iconv, is my app going to crash?)

Unfortunately, aliases are within the domain of iconv.

The question is, where did it pull out "ISO-8859-1" from as the default
locale on your box?  If *that* is from apr-util, we need to unwind where
it was resolved.  If that was an envvar set on your login, well, that would
be called shooting oneself in ones foot.