You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@httpd.apache.org by mickg <mi...@mickg.net> on 2006/11/08 06:48:39 UTC

Re: [users@httpd] Question about mod_charset_light and mod_proxy_html (Solved!)

Just to put my money where my mouth is, I have implemented a (stupid) prototype
that does: If no known charset is native to libxml2 detected , a recompiled version
of mod_proxy_html now uses iconv (eventually via the xmlFindCharEncodingHandler
function) to convert from the source encoding to UTF-8.

If no encoding info is specified, it assumes windows-1251 (yes, stupid, but still).

The main work is done by adding a
const char * enc_from  to ctxt
	this specifies, in iconv compatible terms, the source encoding.

sniff_encoding is modified to return 0 when it encounters a non-native coding,
and to set ctxt->enc_from (ctxt is added as a parameter to it)

The function:
size_t ConvertCtxtBuffer(const char * buf, char ** newbuf, size_t bytes, saxctxt *ctxt, ap_filter_t *f) {
         size_t len=0;
         if (ctxt->enc_from) {
             if (!xmlFindCharEncodingHandler(ctxt->enc_from)) {
                 ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0, f->r,"ConvertInput: no encoding handler found for '%s'", ctxt->enc_from);
                 *newbuf=buf;
                 return bytes;
             } else {
                 ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: bytes: %d, ", bytes);
                 len=ConvertInput(buf,newbuf,bytes,f->r,ctxt->enc_from);
                 ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: len: %d, ", len);
                 if (len<0) {
                         ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0, f->r,"ConvertInput: conversion failed from '%s'", ctxt->enc_from);
                         *newbuf=buf;
                         return bytes;
                 }
                 buf=*newbuf;
                 ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: encoding handler found for '%s'", buf);
                 return len;
             }
         } else {
                 *newbuf=buf;
                 return bytes;
         }
}

calls the actual conversion.

The function
size_t
ConvertInput(const char *in, char ** newbuf, int size, void * r, const char *encoding)
{
   xmlChar *out;
   xmlChar *oldout;
   int ret;
   int out_size;
   int temp;
   size_t len=0;
   xmlCharEncodingHandlerPtr handler;

   if (in == 0)
     return 0;
         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z1") ;

   handler = xmlFindCharEncodingHandler(encoding);

         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2 %d %d %d",handler->input, handler->output, handler->iconv_in) ;
   if (!handler) {
         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2a") ;
     printf("ConvertInput: no encoding handler found for '%s'\n",
            encoding ? encoding : "");
     return 0;
   }
         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z3") ;

   out_size = (size+1) * 2 - 1;
   out = (unsigned char *) xmlMalloc((size_t) out_size);
   oldout=out;
         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z4 %d %d %s %s %d",size,out_size,encoding,in,handler->output) ;
         if (out != 0) {
                 temp = size ;
                 if (handler->input) {
                         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z5") ;
                         ret = handler->input(out, &out_size, in, &temp);
                 }
                 else {
                         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z5a") ;
                         ret = iconv(handler->iconv_in,&in,&temp,&out,&out_size);
                 }
                 ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z6 %d %d %d",ret,temp,out_size) ;
                 if ((ret < 0)) {
                         if (ret < 0) {
                                 ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"ConvertInput: conversion wasn't succesful") ;
                         } else {
                                 ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"ConvertInput: conversion wasn't succesful. Converter %i octets.",temp) ;
                         }
                         xmlFree(oldout);
                         out = 0;
                         out_size=-1;
                 } else {
                         out_size=( (size+1) * 2 - 1) - out_size;
                         out = (unsigned char *) xmlRealloc(oldout, out_size+1 );
                         out[out_size] = 0;  /*null terminating out */
                         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"out %d, oldout %d",out,oldout) ;

                         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"len(OUT): %d",strlen(out)) ;
                 }
         } else {
                 ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"No memory!") ;
         }
   *newbuf=out;
   return out_size;
}

does the actual conversion. It currently output a bit too much log info, and I
suspect a memory leak from xmlMalloc. I honestly do not know enough about Apache
to figure out when to free it (especially at 1AM).

Oh, also, the proxy_html_filter function is modified at 4 points, so that
bytes=ConvertCtxtBuffer(buf,&buf,bytes,ctxt,f);
is called, so that the conversion actually takes place, and so that when
sniff_... returns 0, the return value is converted to XML_CHAR_ENCODING_UTF8.



******************************************************************************
*              !!!THIS CODE IS *NOT* PRODUCTION QUALITY!!!                   *
*IT HAS AT LEAST ONE MEMORY LEAK, AND LOGS WAY TOO MUCH TO THE ERROR LOG.    *
*Also, I am not sure of the security implications of passing the decoding off*
*to iconv (Are there any buffer overflows in it? Could it be exploited by a  *
*specially crafted file in a particular encoding?)                           *
******************************************************************************

Also, I am not sure what this code will do to get&put method data.

It does work on my _own_ website, where it quite happily converts win-1251 to
utf-8. Once I fix the memory leak (any help appreciated), I'll be happy.


And a great many thanks to Nick Kew for getting me off my lazy ... to start
coding  (which, honestly, I am better at than administering systems).

Hopefully this helps someone.


BTW, I still have no clue why I cannot do this with mod_charset_lite.



mickg wrote:
> Nick Kew wrote:
>> On Tue, 07 Nov 2006 17:49:25 -0500
>> mickg <mi...@mickg.net> wrote:
>>
>>
>>> 2 questions:
>>>> I think I'd have to play with that hands-on to figure it out
>>>> with your attempted configuration.  
>>> Was that an offer :) If yes, please say so, and shell account will be
>>> provided. (As the system is a VM, I will just clone it, and give
>>> access to that, so, if you mess it up, no problem).
>>
>> Well it could be, if you have the budget for my time.
>> That's your most expensive option.
>>
> Understood :)
>>>> It might be worth trying
>>>> mod_line_edit instead of mod_proxy_html.  You sacrifice the
>>>> markup support, but in your case the markup isn't properly
>>>> supported anyway, and you probably benefit from the fact that
>>>> it is also unaware of charsets.
>>>>
>>> Hmm. Did not know about that module. Any idea where I can get
>>> the .so ?
>>
>> Same place you get the mod_proxy_html.so.  Except I guess you
>> got that from a third-party package.  I supply binaries and
>> basic support to registered users.
>>
>>> Or an ubuntu package?
>>>
>>> Or how to compile the source, given a development environment?
>>
>> Read the apache docs on apxs.  You'll probably need an apache-dev
>> package on ubuntu.  It's simpler than mod_proxy_html, because it
>> doesn't rely on additional libraries.
>>
> Understood, will do. Thank you!
>> I should add that today's correspondence has prompted me to blog
>> about mod_proxy_html 3.0, which will enable you to fix that
>> charset problem by aliasing an unsupported charset to a similar
>> supported one (windows cyrillic is probably similar enough to
>> ISO cyrillic - aka ISO-8859-5 - for that to work).  I'm inviting
>> blog comments from anyone with great ideas for the next major
>> release of mod_proxy_html.
>>
> Actually, I think the characters are different in the upper register.
> 
> What about letting mod_proxy do it's own transcoding, via iconv or
> some such?
> Maybe even a filter-architecture of it's own?
> As in, given a match, apply this filter to it?
> Although, that may be overkill for a simple matcher.
> 
> 
> 
> mickg
> 
> 
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server Project.
> See <URL:http://httpd.apache.org/userslist.html> for more info.
> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>   "   from the digest: users-digest-unsubscribe@httpd.apache.org
> For additional commands, e-mail: users-help@httpd.apache.org
> 
  (Solved!)


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Question about mod_charset_light and mod_proxy_html (Solved!)

Posted by Nick Kew <ni...@webthing.com>.
On Wed, 08 Nov 2006 00:48:39 -0500
mickg <mi...@mickg.net> wrote:

> Just to put my money where my mouth is, I have implemented a (stupid)
> prototype that does: If no known charset is native to libxml2
> detected , a recompiled version of mod_proxy_html now uses iconv
> (eventually via the xmlFindCharEncodingHandler function) to convert
> from the source encoding to UTF-8.
> 
> If no encoding info is specified, it assumes windows-1251 (yes,
> stupid, but still).
> 
> The main work is done by adding a
> const char * enc_from  to ctxt
> 	this specifies, in iconv compatible terms, the source
> encoding.
> 
> sniff_encoding is modified to return 0 when it encounters a
> non-native coding, and to set ctxt->enc_from (ctxt is added as a
> parameter to it)
> 
> The function:
> size_t ConvertCtxtBuffer(const char * buf, char ** newbuf, size_t
> bytes, saxctxt *ctxt, ap_filter_t *f) { size_t len=0;
>          if (ctxt->enc_from) {
>              if (!xmlFindCharEncodingHandler(ctxt->enc_from)) {
>                  ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0,
> f->r,"ConvertInput: no encoding handler found for '%s'",
> ctxt->enc_from); *newbuf=buf; return bytes;
>              } else {
>                  ap_log_rerror(APLOG_MARK, APLOG_INFO, 0,
> f->r,"ConvertInput: bytes: %d, ", bytes);
> len=ConvertInput(buf,newbuf,bytes,f->r,ctxt->enc_from);
> ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: len: %d,
> ", len); if (len<0) { ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0,
> f->r,"ConvertInput: conversion failed from '%s'", ctxt->enc_from);
> *newbuf=buf; return bytes;
>                  }
>                  buf=*newbuf;
>                  ap_log_rerror(APLOG_MARK, APLOG_INFO, 0,
> f->r,"ConvertInput: encoding handler found for '%s'", buf); return
> len; }
>          } else {
>                  *newbuf=buf;
>                  return bytes;
>          }
> }
> 
> calls the actual conversion.
> 
> The function
> size_t
> ConvertInput(const char *in, char ** newbuf, int size, void * r,
> const char *encoding) {
>    xmlChar *out;
>    xmlChar *oldout;
>    int ret;
>    int out_size;
>    int temp;
>    size_t len=0;
>    xmlCharEncodingHandlerPtr handler;
> 
>    if (in == 0)
>      return 0;
>          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z1") ;
> 
>    handler = xmlFindCharEncodingHandler(encoding);
> 
>          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2 %d %d
> %d",handler->input, handler->output, handler->iconv_in) ; if
> (!handler) { ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2a") ;
>      printf("ConvertInput: no encoding handler found for '%s'\n",
>             encoding ? encoding : "");
>      return 0;
>    }
>          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z3") ;
> 
>    out_size = (size+1) * 2 - 1;
>    out = (unsigned char *) xmlMalloc((size_t) out_size);
>    oldout=out;
>          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z4 %d %d %s %s
> %d",size,out_size,encoding,in,handler->output) ; if (out != 0) {
>                  temp = size ;
>                  if (handler->input) {
>                          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0,
> r,"z5") ; ret = handler->input(out, &out_size, in, &temp);
>                  }
>                  else {
>                          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0,
> r,"z5a") ; ret = iconv(handler->iconv_in,&in,&temp,&out,&out_size);
>                  }
>                  ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z6 %d %d
> %d",ret,temp,out_size) ; if ((ret < 0)) {
>                          if (ret < 0) {
>                                  ap_log_rerror(APLOG_MARK,
> APLOG_INFO, 0, r,"ConvertInput: conversion wasn't succesful") ; }
> else { ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"ConvertInput:
> conversion wasn't succesful. Converter %i octets.",temp) ; }
>                          xmlFree(oldout);
>                          out = 0;
>                          out_size=-1;
>                  } else {
>                          out_size=( (size+1) * 2 - 1) - out_size;
>                          out = (unsigned char *) xmlRealloc(oldout,
> out_size+1 ); out[out_size] = 0;  /*null terminating out */
>                          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0,
> r,"out %d, oldout %d",out,oldout) ;
> 
>                          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0,
> r,"len(OUT): %d",strlen(out)) ; }
>          } else {
>                  ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"No
> memory!") ; }
>    *newbuf=out;
>    return out_size;
> }
> 
> does the actual conversion. It currently output a bit too much log
> info, and I suspect a memory leak from xmlMalloc. I honestly do not
> know enough about Apache to figure out when to free it (especially at
> 1AM).
> 
> Oh, also, the proxy_html_filter function is modified at 4 points, so
> that bytes=ConvertCtxtBuffer(buf,&buf,bytes,ctxt,f);
> is called, so that the conversion actually takes place, and so that
> when sniff_... returns 0, the return value is converted to
> XML_CHAR_ENCODING_UTF8.
> 
> 
> 
> ******************************************************************************
> *              !!!THIS CODE IS *NOT* PRODUCTION
> QUALITY!!!                   * *IT HAS AT LEAST ONE MEMORY LEAK, AND
> LOGS WAY TOO MUCH TO THE ERROR LOG.    * *Also, I am not sure of the
> security implications of passing the decoding off* *to iconv (Are
> there any buffer overflows in it? Could it be exploited by a  *
> *specially crafted file in a particular
> encoding?)                           *
> ******************************************************************************
> 
> Also, I am not sure what this code will do to get&put method data.
> 
> It does work on my _own_ website, where it quite happily converts
> win-1251 to utf-8. Once I fix the memory leak (any help appreciated),
> I'll be happy.
> 
> 
> And a great many thanks to Nick Kew for getting me off my lazy ... to
> start coding  (which, honestly, I am better at than administering
> systems).
> 
> Hopefully this helps someone.
> 
> 
> BTW, I still have no clue why I cannot do this with mod_charset_lite.
> 
> 
> 
> mickg wrote:
> > Nick Kew wrote:
> >> On Tue, 07 Nov 2006 17:49:25 -0500
> >> mickg <mi...@mickg.net> wrote:
> >>
> >>
> >>> 2 questions:
> >>>> I think I'd have to play with that hands-on to figure it out
> >>>> with your attempted configuration.  
> >>> Was that an offer :) If yes, please say so, and shell account
> >>> will be provided. (As the system is a VM, I will just clone it,
> >>> and give access to that, so, if you mess it up, no problem).
> >>
> >> Well it could be, if you have the budget for my time.
> >> That's your most expensive option.
> >>
> > Understood :)
> >>>> It might be worth trying
> >>>> mod_line_edit instead of mod_proxy_html.  You sacrifice the
> >>>> markup support, but in your case the markup isn't properly
> >>>> supported anyway, and you probably benefit from the fact that
> >>>> it is also unaware of charsets.
> >>>>
> >>> Hmm. Did not know about that module. Any idea where I can get
> >>> the .so ?
> >>
> >> Same place you get the mod_proxy_html.so.  Except I guess you
> >> got that from a third-party package.  I supply binaries and
> >> basic support to registered users.
> >>
> >>> Or an ubuntu package?
> >>>
> >>> Or how to compile the source, given a development environment?
> >>
> >> Read the apache docs on apxs.  You'll probably need an apache-dev
> >> package on ubuntu.  It's simpler than mod_proxy_html, because it
> >> doesn't rely on additional libraries.
> >>
> > Understood, will do. Thank you!
> >> I should add that today's correspondence has prompted me to blog
> >> about mod_proxy_html 3.0, which will enable you to fix that
> >> charset problem by aliasing an unsupported charset to a similar
> >> supported one (windows cyrillic is probably similar enough to
> >> ISO cyrillic - aka ISO-8859-5 - for that to work).  I'm inviting
> >> blog comments from anyone with great ideas for the next major
> >> release of mod_proxy_html.
> >>
> > Actually, I think the characters are different in the upper
> > register.
> > 
> > What about letting mod_proxy do it's own transcoding, via iconv or
> > some such?
> > Maybe even a filter-architecture of it's own?
> > As in, given a match, apply this filter to it?
> > Although, that may be overkill for a simple matcher.
> > 
> > 
> > 
> > mickg
> > 
> > 
> > ---------------------------------------------------------------------
> > The official User-To-User support forum of the Apache HTTP Server
> > Project. See <URL:http://httpd.apache.org/userslist.html> for more
> > info. To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
> >   "   from the digest: users-digest-unsubscribe@httpd.apache.org
> > For additional commands, e-mail: users-help@httpd.apache.org
> > 
>   (Solved!)
> 
> 
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server
> Project. See <URL:http://httpd.apache.org/userslist.html> for more
> info. To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>    "   from the digest: users-digest-unsubscribe@httpd.apache.org
> For additional commands, e-mail: users-help@httpd.apache.org
> 


-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Question about mod_charset_light and mod_proxy_html (Solved!)

Posted by mickg <mi...@mickg.net>.
Nick Kew wrote:
> On Wed, 08 Nov 2006 12:56:28 -0500
> mickg <mi...@mickg.net> wrote:
> 
> 
>> Do you want the full working code once I clean up the memory problem?
>> It is, after all, GPL, so it would be in good spirit for me to release
>> the modified source. :)
> 
> Yes please.
> 
> I haven't thought through whether to incorporate this or something
> similar.  If I do, I'll want to base it on apr_iconv, rather than
> native iconv.  But having your code there to look at can't hurt,
> regardless of what I end up doing.
> 
Attached.
Code compiles on Ubuntu, assuming apache-dev, libxml2-dev, and a
	ln -s /usr/include/libxml2/libxml /usr/include/libxml

apxs2 -i -c mod_proxy_html.c
No warnings on the new functions are emitted.

I am now using it on a webserver, and will say tomorrow whether there
are any major memory leaks (A decent amount of traffic is going through
it).

Essential Missing:
Rewriting of get & post request data.

The reason for using iconv, and not apache's iconv:
libxml already opens the iconv handle during initialization.
Might as well use it.

Standard disclaimers apply.
Code is GPL, my modifications are, for WebThing's use, BSDed.

TODO list:
Add rewriting of POST/GET requests.
Add directive to set default encoding if non available
	(once I figure out how to add directives).
Add directive to set output encoding (and convert to it)
	(once I figure out how to modify data post-processing)

Maybe make a mod_charset_libxml charset converter.
As the mod_charset_light is not working, and I am not sure I want to fix that.


(
  For the record, *why oh why* are we doing text munging in C/C++ ?
  As someone who coded in C a long, long time ago ,
  I find I am much more productive in various HLLs, such as Python.
  This, of course, excepts kernel code.
  I have half a mind to make a Python, Perl, or Lisp-based filter.
)

mickg

Re: [users@httpd] Question about mod_charset_light and mod_proxy_html (Solved!)

Posted by mickg <mi...@mickg.net>.
Nick Kew wrote:
> On Wed, 08 Nov 2006 12:56:28 -0500
> mickg <mi...@mickg.net> wrote:
> 
> 
>> Do you want the full working code once I clean up the memory problem?
>> It is, after all, GPL, so it would be in good spirit for me to release
>> the modified source. :)
> 
> Yes please.
> 
> I haven't thought through whether to incorporate this or something
> similar.  If I do, I'll want to base it on apr_iconv, rather than
> native iconv.  But having your code there to look at can't hurt,
> regardless of what I end up doing.
> 
Attached.
Code compiles on Ubuntu, assuming apache-dev, libxml2-dev, and a
	ln -s /usr/include/libxml2/libxml /usr/include/libxml

apxs2 -i -c mod_proxy_html.c
No warnings on the new functions are emitted.

I am now using it on a webserver, and will say tomorrow whether there
are any major memory leaks (A decent amount of traffic is going through
it).

Essential Missing:
Rewriting of get & post request data.

The reason for using iconv, and not apache's iconv:
libxml already opens the iconv handle during initialization.
Might as well use it.

Standard discalimers apply.
Code is GPL, my modifications are, for WebThing's use, BSDed.

TODO list:
Add rewriting of POST/GET requests.
Add directive to set default encoding if non availible
	(once I figure out how to add directives).
Add directive to set output encoding (and convert to it)
	(once I figure out how to modify data post-processing)

Maybe make a mod_charset_libxml charset converter.
As the mod_charset_light is not working, and I am not sure I want to fix that.


(
  For the record, *why oh why* are we doing text munging in C/C++ ?
  As someone who coded in C a long, long time ago ,
  I find I am much more productive in various HLLs, such as Python.
  This, of course, excepts kernel code.
  I have half a mind to make a Python, Perl, or Lisp-based filter.
)

mickg

Re: [users@httpd] Question about mod_charset_light and mod_proxy_html (Solved!)

Posted by Nick Kew <ni...@webthing.com>.
On Wed, 08 Nov 2006 12:56:28 -0500
mickg <mi...@mickg.net> wrote:


> Do you want the full working code once I clean up the memory problem?
> It is, after all, GPL, so it would be in good spirit for me to release
> the modified source. :)

Yes please.

I haven't thought through whether to incorporate this or something
similar.  If I do, I'll want to base it on apr_iconv, rather than
native iconv.  But having your code there to look at can't hurt,
regardless of what I end up doing.

-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Question about mod_charset_light and mod_proxy_html (Solved!)

Posted by mickg <mi...@mickg.net>.
Nick Kew wrote:
> On Wed, 08 Nov 2006 00:48:39 -0500
> mickg <mi...@mickg.net> wrote:
> 
>> Just to put my money where my mouth is, I have implemented a (stupid)
>> prototype that does: If no known charset is native to libxml2
>> detected , a recompiled version of mod_proxy_html now uses iconv
>> (eventually via the xmlFindCharEncodingHandler function) to convert
>> from the source encoding to UTF-8.
> 
> Interesting.  You've gone one up on my aliasing proposal, for
> what looks like rather less work than I thought that would take.
> I might snarf the basic idea for Version 3.
Do you want the full working code once I clean up the memory problem?
It is, after all, GPL, so it would be in good spirit for me to release
the modified source. :)

Although, to be truly honest, what the thing is doing IS somewhat backwards.

The dataflow would be such (And I am more familiar with Python code, as the
next snippet will show).

data comes in
if ctxt.encoder==None:
	obtain charset
	if need iconv to convert charset:
		ctxt.encoder=charset
		return enc=UTF-8
	else:
		return enc
proir to processing buf,
	if ctxt.encoder!=None:
		convert(buf)
	convert if encoder is set (non-null).
	
This guarantees that either the data is in known enc to libxml, or was utf8 to
begin with, or was converted to utf8, or conversion failed miserably (the
miserable failure was logged.)


> 
>> If no encoding info is specified, it assumes windows-1251 (yes,
>> stupid, but still).
> 
> But not stupid if we make it a configurable default!
> 

Yeah, preferably via a directive such as HTMLSourceDefaultEnc windows-1251
or some such.

> 
>> It does work on my _own_ website, where it quite happily converts
>> win-1251 to utf-8. Once I fix the memory leak (any help appreciated),
>> I'll be happy.
> 
> See http://www.apachetutor.org/dev/pools for an easy way to
> deal with the memory.
> 
>> And a great many thanks to Nick Kew for getting me off my lazy ... to
>> start coding  (which, honestly, I am better at than administering
>> systems).
> 
> :-)
> 
>> BTW, I still have no clue why I cannot do this with mod_charset_lite.
> 
> Neither am I.  But a closer look at mod_charset_lite has been on
> my TODO list for so long it's probably on a permanent back-burner.
> Did you also look at the full mod_charset?   AIUI it was written by
> Russian developers, so cyrillic was presumably important to them.
> 

The thing about mod_charset, is that they assume no iconv, and do all
internal translation. With translation settings and weird maps, where
needed. This seems a bit insane to me, unless needed.
I believe the reason was that we had:
	win1251 read as koi8, transcoded into LATIN1
Now, we need to make sense of *that*.
Also, they do not cleanly support utf8 translation (they do not support
translation back from utf8). iconv does.



Honestly, remaking mod_proxy_html into mod_proxy_charset_convert would
be trivial now, IMO.
And maybe that's the better idea. Although that does duplicate
mod_charset_light, at least I know it'll work.
And , it would use libxml2 where possible, not iconv.




mickg


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Question about mod_charset_light and mod_proxy_html (Solved!)

Posted by Nick Kew <ni...@webthing.com>.
On Wed, 08 Nov 2006 00:48:39 -0500
mickg <mi...@mickg.net> wrote:

> Just to put my money where my mouth is, I have implemented a (stupid)
> prototype that does: If no known charset is native to libxml2
> detected , a recompiled version of mod_proxy_html now uses iconv
> (eventually via the xmlFindCharEncodingHandler function) to convert
> from the source encoding to UTF-8.

Interesting.  You've gone one up on my aliasing proposal, for
what looks like rather less work than I thought that would take.
I might snarf the basic idea for Version 3.

> If no encoding info is specified, it assumes windows-1251 (yes,
> stupid, but still).

But not stupid if we make it a configurable default!



> It does work on my _own_ website, where it quite happily converts
> win-1251 to utf-8. Once I fix the memory leak (any help appreciated),
> I'll be happy.

See http://www.apachetutor.org/dev/pools for an easy way to
deal with the memory.

> And a great many thanks to Nick Kew for getting me off my lazy ... to
> start coding  (which, honestly, I am better at than administering
> systems).

:-)

> BTW, I still have no clue why I cannot do this with mod_charset_lite.

Neither am I.  But a closer look at mod_charset_lite has been on
my TODO list for so long it's probably on a permanent back-burner.
Did you also look at the full mod_charset?   AIUI it was written by
Russian developers, so cyrillic was presumably important to them.

-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org