You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Mark Phippard <Ma...@softlanding.com> on 2005/02/10 19:38:15 UTC

Porting Subversion to EBCDIC

In writing a reply to Julian about our use of escape characters in our 
EBCDIC port, it seemed like a good time to offer up a more detailed 
explanation of what we are doing and where we are.

You can download our current patch against the 1.1.x branch here:

http://support.softlanding.com/ebcdic.diff

The patch is 362K

We are working against the branch because we needed something stable to 
work off and we also plan on releasing this pretty soon to OS/400 users. 
Currently, we use svk to keep a mirror of the Subversion repository that 
we can commit against and take advantage of Subversion for managing this 
work.  It would be great if we could get an "ebcdic" branch in the real 
repository where we could start hosting this code.  It would make it 
easier for us when we need to catch up to trunk as I am currently only 
mirroring the branch.

The patch is still a work in progress.  Our immediate goal is only to 
allow a Subversion server to be hosted on OS/400.  With that in mind, we 
have not touched ra_dav, or most of the client or working copy code.  We 
also have not ported BDB, and are just planning on using fsfs.  The patch 
is complete for svnadmin and svnserve.  We are still in the process of 
getting mod_dav_svn working, although we have made a lot of progress in 
recent weeks.  It is by far the most challenging part of the port as so 
much of the code is out of our control.

With all of that out of the way, here is our attempt to explain what we 
are doing, and the issues we have had to solve in porting to EBCDIC.  One 
final thing to keep in mind.  On OS/400, Apache and APR are supplied by 
IBM and are not completely open source.  We have to live with what they 
give us and how they have implemented things.

Our patch follows a few general assumptions to run in an ebcdic 
environment::

A) We separate the code into four logical groupings:
 
   MOD_DAV_SVN: \subversion\mod_dav_svn\*.c

   SVN: All subversion code that's not in \subversion\mod_dav_svn\

   APACHE: Mod_dav and Apache code.

   APR: APR and C Standard Library functions.

B) MOD_DAV_SVN - Here we assume all strings/chars are ebcdic.

C) SVN - Here we assume all strings/chars are utf-8.

D) APACHE & APR - Here we assume all strings/chars are ebcdic.  Strings 
passed to these functions may need to be in ebcdic if the semantics of the 
string matter.  e.g. calling atoi(const char *str) needs an ebcdic encoded 
str to work properly, but strchr(const char *str, int c) is just searching 
for a byte pattern and doesn't care what encoding is used. 

E) Strings passed between these groups may need conversion:

   ebcdic      ------------> utf-8
   MOD_DAV_SVN --> calls --> SVN

   utf-8       ------------> ebcdic
   SVN         --> calls --> APR
 
   "Conversion" may involve strings passed as arguments, strings returned 
by the function, or char ** args.  You may be asking, "How does one 
convert non-ascii utf-8 to ebcdic without losing information?"  IBM uses a 
"utf-8esque" encoding scheme similar to unicode's utf-ebcdic 
specification. 

F) Strings passed between these groups should share the same encoding and 
need no special handling:
 
   ebcdic      ------------> ebcdic
   APACHE      --> calls --> MOD_DAV_SVN
   MOD_DAV_SVN --> calls --> APACHE
   MOD_DAV_SVN --> calls --> APR
__________________________________________________

To meet these assumptions we use four core approaches:

1) "Global" symbolic constants in svn_utf.h for commonly used char and 
string literals, where the literal is a hex-escaped ascii value:
 
   e.g. #define SVN_UTF8_FSLASH      '\x2F' /* '/' */
        #define SVN_UTF8_FSLASH_STR  "\x2F" /* "/" */
 
   At the time this seemed the most logical place to put these, but 
svn_ctype.h probably makes more sense.  Where the code implicitly assumes 
ascii values when using char or string literals, these would be used 
instead.

   e.g. in path.c's const char *svn_path_internal_style (const char *path, 
apr_pool_t *pool):
        - if ('/' != SVN_PATH_LOCAL_SEPARATOR)
        + if (SVN_UTF8_FSLASH != SVN_PATH_LOCAL_SEPARATOR)

2) Also in svn_utf.h, ascii aware macros to replace apr_isalpha, 
apr_isdigit, apr_isspace, apr_isxdigit, and tolower if compiled on an 
ebcdic system (determined by the value of APR_CHARSET_EBCDIC from apr.h).

   e.g. #if !APR_CHARSET_EBCDIC
          #define APR_IS_ASCII_DIGIT(x) apr_isdigit(x)
        #else
          #define APR_IS_ASCII_DIGIT(x) ( (unsigned char)x >= SVN_UTF8_0 
&& \
                                          (unsigned char)x <= SVN_UTF8_9 )
        #endif
 
   Where the code calls these functions, the apr_* call is replaced with 
the macro.

3) "Private" symbolic constants in *.c files for commonly used string 
literals in that file, where the literal is a hex-escaped ascii value:

   e.g. In fs_fs.c:
        /* Names of special files and file extensions for transactions */
        #define PATH_CHANGES \
        "\x63\x68\x61\x6e\x67\x65\x73"
        /* "changes" - Records changes made so far */
 
   We didn't put these in svn_utf.h per approach 1 as it seemed the list 
would become absurdly large.  Nor did we want to have multi-line hex 
escapes cluttering the code.

4) Large blocks of string literals are converted to utf-8 with IBM's 
convert pragma.

   e.g. #if APR_CHARSET_EBCDIC
        #pragma convert(1208)
        #endif
         static const char * const readme_contents =
         "This is a Subversion repository; use the 'svnadmin' tool to 
examine"
         APR_EOL_STR
         ...
         "Visit http://subversion.tigris.org/ for more information."
         APR_EOL_STR;
        #if APR_CHARSET_EBCDIC
        #pragma convert(37)
        #endif 

5) APR_CHARSET_EBCDIC dependent code blocks in the subversion code convert 
strings where assumption 'E' is relevant.

   e.g. In fs_fs.c's read_rep_offsets (representation_t **rep_p, char 
*string, const char *txn_id, svn_boolean_t mutable_rep_truncated, 
apr_pool_t *pool), SVN_STR_TO_REV (which is just atol) needs an ebcdic 
string:
        ...
          str = apr_strtok (string, SVN_UTF8_SPACE_STR, &last_str);
          if (str == NULL)
            return svn_error_create (SVN_ERR_FS_CORRUPT, NULL,
                                     _("Malformed text rep offset line in 
node-rev"));
        #if APR_CHARSET_EBCDIC
          SVN_ERR (svn_utf_cstring_from_utf8 (&str, str, pool));
        #endif 
          rep->revision = SVN_STR_TO_REV (str);
        ...

To answer some of Julian's specific questions:

> How extensive is this?  Did you just need to do this for a few odd 
characters 
> here and there, or does this involve replacing hundreds of literal 
strings and 
> characters all over the code base?

1 through 4 are fairly extensive, but they are straightforward and less 
intrusive in the sense that understanding what they are doing is easy. 
Number 4 is more intrusive, but maybe not as bad as imagined.  Code that 
sits "between" the groups described in A have a lot of APR_CHARSET_EBCDIC 
dependent blocks; e.g. fs_fs.c has 36 blocks.  Code that operates within a 
group have few, if any; e.g. tree.c has none.  These are off-the-cuff 
examples, we haven't done an in depth statistical analysis or anything. 

> Bear in mind that I don't know whether 
> EBCDIC has any overlap with ASCII, or what your other options are (such 
as 
> controls to make certain parts be compiled with ASCII as the execution 
> character set).

We have explored, and continue to explore, other approaches, but the 
above, as intrusive as it may seem, has shown the most promise so far. Our 
elementary problem is that IBM Apache/MOD_DAV sends MOD_DAV_SVN a 
request_rec with ebcdic strings and wants ebcdic strings sent back to it 
(which are converted to utf-8 before being sent out on the wire).  On the 
other hand we have repository files that contain utf-8 content.  Barring 
some way of making IBM Apache run in a utf-8 environment (which has been a 
dead end thus far) somewhere between the two we need to convert strings. 
We are very open to better ideas on where to do this and welcome any 
feedback or suggestions, but this is where we stand today.

We have actually gone pretty far on several different approaches to this 
port, including "building a wall" around Subversion that made it think it 
was just working on a UTF-8 system.  This approach was less intrusive on 
the code, but added a lot of extra string conversion and also completely 
fell apart when we got to mod_dav_svn.  Brane suggested we just let 
Subversion do the conversion and that inspired us to start over with the 
above approach, which has yielded much better results.

We would certainly welcome any feedback on the approach, I realize it is a 
lot to review.  Also, I will just come back to whether it would be 
possible to establish an ebcdic branch where we could work on this, and 
whether it seems like now would be a good time to start that process.

Thanks

Mark








_____________________________________________________________________________
Scanned for SoftLanding Systems, Inc. by IBM Email Security Management Services powered by MessageLabs. 
_____________________________________________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Porting Subversion to EBCDIC

Posted by "C. Michael Pilato" <cm...@collab.net>.
Mark Phippard <Ma...@softlanding.com> writes:

> In writing a reply to Julian about our use of escape characters in
> our EBCDIC port, it seemed like a good time to offer up a more
> detailed explanation of what we are doing and where we are.

[...]

> It would be great if we could get an "ebcdic" branch in the real
> repository where we could start hosting this code.  It would make it
> easier for us when we need to catch up to trunk as I am currently
> only mirroring the branch.

I have no problem with setting up such a branch for you, as long as
"catching up" is done in reasonably sized chunks of work, not one or
two strings at a time.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Porting Subversion to EBCDIC

Posted by Branko Čibej <br...@xbc.nu>.
Mark Phippard wrote:

>In writing a reply to Julian about our use of escape characters in our 
>EBCDIC port, it seemed like a good time to offer up a more detailed 
>explanation of what we are doing and where we are.
>
>You can download our current patch against the 1.1.x branch here:
>
>http://support.softlanding.com/ebcdic.diff
>
>The patch is 362K
>  
>
It sure is...

Well, as expected, string literals are the biggest problem and the 
biggest change. I think that, if we ever want to merge EBCDIC support 
into the mainline -- and I think we do in the long run, because 
supporing two branches would be too much work -- then something has to 
be done about that. Using character escapes like that is simply not 
maintainable.

But I think this can be solved by inventing some kind of string-literal 
conversion policy similar to what we're doing on Windows with the 
console charset.

Today, the SVN libraries juggle with four different encodings (and 
character sets):

    * Internal: the encoding expected by most public APIs. This is (and
      will most probably remain) UTF-8.
    * Native: the encoding of string literals, program arguments, etc.
      99% of the code today assumes this to be a strict (7-bit) subset
      of UTF-8.
    * APR: the encoding that APR (and Apache) functions expect. On Unix
      and Win9x, this is the same as Native; on WinNT, it's the same as
      Internal, i.e., UTF-8.
    * Console: the encoding used for writing to the console and reading
      from the console. On Unix, this is the same as Native. On Windows,
      it's something else (usually some kind of OEM crap).

So for example, converting a string from internal to APR encoding and 
back is a no-op on WinNT.

In order to support EBCDIC, we have to remove the second assumption 
(Native is a subset of Internal). Where character literals are involved, 
defining char escapes is viable since there aren't that many of them. 
String literals are a bigger problem, though, because as I said, seeing 
"\x64\x61\x76" instead of "dav" in the code is an instant turn-off.

It seems to me that if we strictly follow the string conversion rules we 
already have in place (something we don't do, IIRC, at least in 
mod_dav_svn and probably a few other places), everything _except_ 
handling of string literals would be solved in a satisfactory way (read: 
mergeable-to-trunk).

For string literals, we want is a solution that

    * leaves readable string literals in the code;
    * allows static initialisation of struct members with string literals;
    * magics the literals to be in a UTF-8 subset at runfime.

By this time, the words "source pre-processor" should be ringing between 
your ears. I propose a filter that converts string literals in source 
files to ASCII-based char escapes before sending them to the compiler, 
of course inserting appropriate #line directives so that debuggers still 
show the original source. This filter could be inserted into the build 
on all platforms where the "native" encoding isn't ASCII.

That would make the EBCDIC patch much smaller, and correct on all 
platforms. As a bomus, if the filter were to recognise character 
excapes, too, we could rely on those being ASCII at runtime, too, and 
eliminate the character constant defines. the only remaining problem are 
literals that don't come from Subversion, e.g., APR_EOL_STR; but we can 
always define and use SVN_EOL_STR instead.

-- Brane


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Porting Subversion to EBCDIC

Posted by Paul Burba <Pa...@softlanding.com>.
Julian Foad <ju...@btopenworld.com> wrote on 02/10/2005 06:42:16 PM:

> Mark Phippard wrote:
> > In writing a reply to Julian about our use of escape characters in our 

> > EBCDIC port, it seemed like a good time to offer up a more detailed 
> > explanation of what we are doing and where we are.
> 
> Thanks.  This will be very interesting.
> 
> > You can download our current patch against the 1.1.x branch here:
> > 
> > http://support.softlanding.com/ebcdic.diff
> 
> While looking at your patch, I couldn't help reviewing it a bit...

Hi Julian, thanks for taking a quick look at the patch, appreciate the 
comments. 
 
> > Index: libsvn_subr/io.c
> > ===================================================================
> [...]
> > @@ -1968,7 +2088,8 @@
> >      return SVN_NO_ERROR;
> > 
> >    err = file_name_get (&name, file, pool);
> > -  name = (! err && name) ? apr_psprintf (pool, "file '%s'", name)
> : "stream";
> > +  name = (! err && name) ? apr_psprintf (pool, FILE_STR 
SVN_UTF8_SPACE_STR 
> > +                                         "\x27%s\x27", name) : 
STREAM_STR;
> >    svn_error_clear (err);
> > 
> >    return svn_error_wrap_apr (status, "Can't %s %s", op, name);
> 
> Here you are passing a native format-string ("Can't %s %s") but 
UTF8arguments 
> ("name", at least) to svn_error_wrap_apr.

The explanation that we wrote up and Mark recently posted is somewhat of a 
simplification.  Our intent was to give a high level view of the problems 
we face and our current solutions.  So I left out some of the details...or 
less charitably, I forget about them :) 

When creating the patch we accidentally left out two new files, which are 
essential to this tale of ebcdic induced woe:

  /subversion/include/svn_ebcdic.h
  /subversion/libsvn_subr/svn_ebcdic.c

This is a temporary "holding pen" for functions we use to solve 
ebcdic/iSeries related issues, but are unsure about where in the 
subversion code base they ultimately belong or if we will even need them 
in the end.  One of these functions is:
 
  char *svn_ebcdic_pvsprintf (apr_pool_t *p, const char *fmt, va_list ap)

This works like apr_pvsprintf except that it assumes and %c or %s variable 
args are utf-8 encoded and converts them to ebcdic when building the 
return string.  svn_error_createf and svn_error_wrap_apr both call 
svn_ebcdic_pvsprintf rather than apr_pvsprintf if APR_CHARSET_EBCDIC is 
true. 

Given our assumption "C" (that all strings/chars in the non-mod_dav_svn 
subversion code are utf-8) the numerous calls to svn_error_createf and 
svn_error_wrap_apr pass utf-8 char/string var args.  In an exception to 
our assumption C though, we leave the format strings as ebcdic, as they 
need to eventually be native for APR to use.  The above approach saved us 
having to convert every string or char var arg to ebcdic prior to 
creating/wrapping errors - which would make for a much larger patch.  This 
reveals our "secret" assumption G) Don't muck up the code with endless 
APR_CHARSET_EBCDIC dependent conversions if there is a way around it!

> I haven't reviewed carefully.  I just noticed these while flicking 
through it.
> 
> - Julian

Mark mentioned earlier that our biggest challenge is getting IBM 
Apache/MOD_DAV to work correctly with MOD_DAV_SVN.  As a result we tried 
to solve the "simpler" issues and move along quickly to the mod_dav_svn 
crux, leaving some of the early work in a less than polished state.  I 
mention this for the benefit of anyone who does do a thorough review on 
the current patch, we are aware it's not quite ready for prime time and 
will inspect and clean it up prior to any commit to the proposed ebcdic 
branch.

Thanks again,

Paul B.
SoftLanding Systems, Inc.

_____________________________________________________________________________
Scanned for SoftLanding Systems, Inc. by IBM Email Security Management Services powered by MessageLabs. 
_____________________________________________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Porting Subversion to EBCDIC

Posted by Julian Foad <ju...@btopenworld.com>.
Mark Phippard wrote:
> In writing a reply to Julian about our use of escape characters in our 
> EBCDIC port, it seemed like a good time to offer up a more detailed 
> explanation of what we are doing and where we are.

Thanks.  This will be very interesting.

> You can download our current patch against the 1.1.x branch here:
> 
> http://support.softlanding.com/ebcdic.diff

While looking at your patch, I couldn't help reviewing it a bit...

> Index: include/svn_path.h
> ===================================================================
[...]
> +/** Same as svn_path_basename but operates on ebcdic encoded @a path
> + */
> +char *svn_path_join_ebcdic (const char *base,
> +                            const char *component,
> +                            apr_pool_t *pool);

Should the comment say "Same as svn_path_join" instead of "svn_path_basename"?

> Index: libsvn_subr/io.c
> ===================================================================
[...]
> @@ -1968,7 +2088,8 @@
>      return SVN_NO_ERROR;
>  
>    err = file_name_get (&name, file, pool);
> -  name = (! err && name) ? apr_psprintf (pool, "file '%s'", name) : "stream";
> +  name = (! err && name) ? apr_psprintf (pool, FILE_STR SVN_UTF8_SPACE_STR 
> +                                         "\x27%s\x27", name) : STREAM_STR;
>    svn_error_clear (err);
>  
>    return svn_error_wrap_apr (status, "Can't %s %s", op, name);

Here you are passing a native format-string ("Can't %s %s") but UTF8 arguments 
("name", at least) to svn_error_wrap_apr.

> Index: libsvn_ra_local/split_url.c
> ===================================================================
[...]
> @@ -38,7 +54,7 @@
>    /* Verify that the URL is well-formed (loosely) */
>  
>    /* First, check for the "file://" prefix. */
> -  if (strncmp (URL, "file://", 7) != 0)
> +  if (strncmp (URL, FILE_PREFIX_STR, 7) != 0)
>      return svn_error_createf 
>        (SVN_ERR_RA_ILLEGAL_URL, NULL, 
>         _("Local URL '%s' does not contain 'file://' prefix"), URL);

Same here, I think, and in many other error messages with arguments.

> Index: mod_dav_svn/log.c
> ===================================================================
[...]
>    if (msg)
> +#if APR_CHARSET_EBCDIC
> +    SVN_ERR (svn_utf_utfcstring_from_utf8(&msg, msg, pool));        
> +#endif   
>      SVN_ERR( send_xml(lrb, "<D:comment>%s</D:comment>" DEBUG_CR,
>                        apr_xml_quote_string(pool, msg, 0)) );

It looks like you need braces around those two SVN_ERR statements to keep them 
both under the "if".

I haven't reviewed carefully.  I just noticed these while flicking through it.

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org