You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by John Dougrez-Lewis <jl...@lightblue.com> on 2015/10/20 21:23:02 UTC

Apache Module Development Query on character encodings.

Hi,

 

I'm considering writing an Apache Module that will involve using libxml2 and
CLucene. 

 

I need to be able to service and respond to requests as follows:

 

For processing with libxml2:

 

Request Input character encoding => UTF-8 

processing of UTF-8

UTF-8 => Response output character encoding

 

For processing with CLucene:

 

Request Input character encoding => UCS-2 (fixed width 16-bit Unicode)

processing of UCS-2

UCS-2  => Response output character encoding

 

I'm new to Apache Modules, but have read Nick Kew's book (which is very
good).

 

The input and output buffers appears to be 8-bit char* based but I can't see
any references to specific encodings.

 

How do I go about massaging the input & output into UTF-8 and fixed width
16-bit Unicode?

 

Are there any good references on how to achieve this?

 

 

Regards,

 

John

 

--

John Dougrez-Lewis

Re: Apache Module Development Query on character encodings.

Posted by Nick Kew <ni...@webthing.com>.

On Wed, 21 Oct 2015 07:04:27 +0100
"John Dougrez-Lewis" <jl...@lightblue.com> wrote:

> Hi Nick,
> 
> > Hi, are you by any chance the Raving Loony I once knew at Cambridge?
> 
> Yes indeed - that must be 35 years ago now - these days I'm a bit more
> sensible (although the legacy of the OMRLP lives on).

OMLRP?  It was ∇ ∇ back then (CURLS, if email screws that up).


> > Basically there are three parts to working with character encodings:
> >  * Detecting them in incoming data.
> >  * Converting them to order.
> >  * Correctly labelling outgoing data.
> 
> > mod_xml2enc will do all that for libxml2-based filters, and could easily
> be tweaked to drop the libxml2-specific optimisations for general-
> > purpose use.  Alternatively the charset-detection from mod_xml2enc could
> probably be folded into mod_charset_lite.
> 
> So basically mod_xml2enc will detect the incoming encoding (whatever it may
> be)?

I suggest instead of debating here, take a look at it.
Start with the docs, and then move on to the code if necessary.

-- 
Nick Kew

RE: Apache Module Development Query on character encodings.

Posted by John Dougrez-Lewis <jl...@lightblue.com>.

Hi Nick,

> Hi, are you by any chance the Raving Loony I once knew at Cambridge?

Yes indeed - that must be 35 years ago now - these days I'm a bit more
sensible (although the legacy of the OMRLP lives on).


> Basically there are three parts to working with character encodings:
>  * Detecting them in incoming data.
>  * Converting them to order.
>  * Correctly labelling outgoing data.

> mod_xml2enc will do all that for libxml2-based filters, and could easily
be tweaked to drop the libxml2-specific optimisations for general-
> purpose use.  Alternatively the charset-detection from mod_xml2enc could
probably be folded into mod_charset_lite.

So basically mod_xml2enc will detect the incoming encoding (whatever it may
be)?


Are there not HTTP headers which give a good indication of the input format
(albeit that you have to detect the format and read the stream to confirm
it)?


I'm new to Apache coding/configuration - how would xml2enc/mod_charset_lite
input & output modules/filters be setup in configuration and/or chained in
code?


Do you have any views on libxml2 suitability for use within Apache module
code?

It appears to have good all-round performance compared to other XML
libraries. I note that it has a C++ wrapper which is LGPL'ed so there are
likely to licensing/distribution issues if I ever decided to try release
code under an Apache License.



Regards,

John

Re: Apache Module Development Query on character encodings.

Posted by Nick Kew <ni...@apache.org>.

On Tue, 20 Oct 2015 20:23:02 +0100
"John Dougrez-Lewis" <jl...@lightblue.com> wrote:

> Hi,

Hi, are you by any chance the Raving Loony I once knew at Cambridge?

> I need to be able to service and respond to requests as follows:

Basically there are three parts to working with character encodings:
 * Detecting them in incoming data.
 * Converting them to order.
 * Correctly labelling outgoing data.

mod_xml2enc will do all that for libxml2-based filters,
and could easily be tweaked to drop the libxml2-specific
optimisations for general-purpose use.  Alternatively
the charset-detection from mod_xml2enc could probably
be folded into mod_charset_lite.

> The input and output buffers appears to be 8-bit char* based but I can't see
> any references to specific encodings.
> 
>  
> 
> How do I go about massaging the input & output into UTF-8 and fixed width
> 16-bit Unicode?
> 
>  
> 
> Are there any good references on how to achieve this?

It's a bit of a mess, because there are several different
standards (HTTP, XML and HTML), and in real life those are
sometimes in conflict.  The detection in mod_xml2enc has
been fine-tuned over the years and test-driven on a wide
range of scripts, including non-Latin charsets such
as Russian/Cyrillic and Arabic.

-- 
Nick Kew

Re: Apache Module Development Query on character encodings.

Posted by Paul Spangler <pa...@ni.com>.

On 10/20/2015 2:23 PM, John Dougrez-Lewis wrote:
> How do I go about massaging the input & output into UTF-8 and fixed
> width 16-bit Unicode?
>
> Are there any good references on how to achieve this?

I can't speak for whether or not it's a good reference, but 
mod_authnz_ldap can do the "Request Input character encoding => UTF-8" 
part of what you described to convert usernames and passwords to UTF-8 
before passing them to LDAP.

It uses the apr_xlate API along with a file specified by the 
AuthLDAPCharsetConfig directive to map the Accept-Language header to a 
character encoding. I believe the xlate API is using iconv under the 
hood, which can be its own source of problems if the server environment 
isn't set up properly.

I'd also be interested to hear if there are other ways for modules to 
handle character encodings, which is never an easy topic. I suppose 
ideally the protocol would dictate a required encoding that the client 
must use (via Content-Types), which simplifies things quite a bit.

Regards,

Paul Spangler
LabVIEW R&D
National Instruments