You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Mark Phippard <Ma...@softlanding.com> on 2005/02/10 19:38:15 UTC
Porting Subversion to EBCDIC
In writing a reply to Julian about our use of escape characters in our
EBCDIC port, it seemed like a good time to offer up a more detailed
explanation of what we are doing and where we are.
You can download our current patch against the 1.1.x branch here:
http://support.softlanding.com/ebcdic.diff
The patch is 362K
We are working against the branch because we needed something stable to
work off and we also plan on releasing this pretty soon to OS/400 users.
Currently, we use svk to keep a mirror of the Subversion repository that
we can commit against and take advantage of Subversion for managing this
work. It would be great if we could get an "ebcdic" branch in the real
repository where we could start hosting this code. It would make it
easier for us when we need to catch up to trunk as I am currently only
mirroring the branch.
The patch is still a work in progress. Our immediate goal is only to
allow a Subversion server to be hosted on OS/400. With that in mind, we
have not touched ra_dav, or most of the client or working copy code. We
also have not ported BDB, and are just planning on using fsfs. The patch
is complete for svnadmin and svnserve. We are still in the process of
getting mod_dav_svn working, although we have made a lot of progress in
recent weeks. It is by far the most challenging part of the port as so
much of the code is out of our control.
With all of that out of the way, here is our attempt to explain what we
are doing, and the issues we have had to solve in porting to EBCDIC. One
final thing to keep in mind. On OS/400, Apache and APR are supplied by
IBM and are not completely open source. We have to live with what they
give us and how they have implemented things.
Our patch follows a few general assumptions to run in an ebcdic
environment::
A) We separate the code into four logical groupings:
MOD_DAV_SVN: \subversion\mod_dav_svn\*.c
SVN: All subversion code that's not in \subversion\mod_dav_svn\
APACHE: Mod_dav and Apache code.
APR: APR and C Standard Library functions.
B) MOD_DAV_SVN - Here we assume all strings/chars are ebcdic.
C) SVN - Here we assume all strings/chars are utf-8.
D) APACHE & APR - Here we assume all strings/chars are ebcdic. Strings
passed to these functions may need to be in ebcdic if the semantics of the
string matter. e.g. calling atoi(const char *str) needs an ebcdic encoded
str to work properly, but strchr(const char *str, int c) is just searching
for a byte pattern and doesn't care what encoding is used.
E) Strings passed between these groups may need conversion:
ebcdic ------------> utf-8
MOD_DAV_SVN --> calls --> SVN
utf-8 ------------> ebcdic
SVN --> calls --> APR
"Conversion" may involve strings passed as arguments, strings returned
by the function, or char ** args. You may be asking, "How does one
convert non-ascii utf-8 to ebcdic without losing information?" IBM uses a
"utf-8esque" encoding scheme similar to unicode's utf-ebcdic
specification.
F) Strings passed between these groups should share the same encoding and
need no special handling:
ebcdic ------------> ebcdic
APACHE --> calls --> MOD_DAV_SVN
MOD_DAV_SVN --> calls --> APACHE
MOD_DAV_SVN --> calls --> APR
__________________________________________________
To meet these assumptions we use four core approaches:
1) "Global" symbolic constants in svn_utf.h for commonly used char and
string literals, where the literal is a hex-escaped ascii value:
e.g. #define SVN_UTF8_FSLASH '\x2F' /* '/' */
#define SVN_UTF8_FSLASH_STR "\x2F" /* "/" */
At the time this seemed the most logical place to put these, but
svn_ctype.h probably makes more sense. Where the code implicitly assumes
ascii values when using char or string literals, these would be used
instead.
e.g. in path.c's const char *svn_path_internal_style (const char *path,
apr_pool_t *pool):
- if ('/' != SVN_PATH_LOCAL_SEPARATOR)
+ if (SVN_UTF8_FSLASH != SVN_PATH_LOCAL_SEPARATOR)
2) Also in svn_utf.h, ascii aware macros to replace apr_isalpha,
apr_isdigit, apr_isspace, apr_isxdigit, and tolower if compiled on an
ebcdic system (determined by the value of APR_CHARSET_EBCDIC from apr.h).
e.g. #if !APR_CHARSET_EBCDIC
#define APR_IS_ASCII_DIGIT(x) apr_isdigit(x)
#else
#define APR_IS_ASCII_DIGIT(x) ( (unsigned char)x >= SVN_UTF8_0
&& \
(unsigned char)x <= SVN_UTF8_9 )
#endif
Where the code calls these functions, the apr_* call is replaced with
the macro.
3) "Private" symbolic constants in *.c files for commonly used string
literals in that file, where the literal is a hex-escaped ascii value:
e.g. In fs_fs.c:
/* Names of special files and file extensions for transactions */
#define PATH_CHANGES \
"\x63\x68\x61\x6e\x67\x65\x73"
/* "changes" - Records changes made so far */
We didn't put these in svn_utf.h per approach 1 as it seemed the list
would become absurdly large. Nor did we want to have multi-line hex
escapes cluttering the code.
4) Large blocks of string literals are converted to utf-8 with IBM's
convert pragma.
e.g. #if APR_CHARSET_EBCDIC
#pragma convert(1208)
#endif
static const char * const readme_contents =
"This is a Subversion repository; use the 'svnadmin' tool to
examine"
APR_EOL_STR
...
"Visit http://subversion.tigris.org/ for more information."
APR_EOL_STR;
#if APR_CHARSET_EBCDIC
#pragma convert(37)
#endif
5) APR_CHARSET_EBCDIC dependent code blocks in the subversion code convert
strings where assumption 'E' is relevant.
e.g. In fs_fs.c's read_rep_offsets (representation_t **rep_p, char
*string, const char *txn_id, svn_boolean_t mutable_rep_truncated,
apr_pool_t *pool), SVN_STR_TO_REV (which is just atol) needs an ebcdic
string:
...
str = apr_strtok (string, SVN_UTF8_SPACE_STR, &last_str);
if (str == NULL)
return svn_error_create (SVN_ERR_FS_CORRUPT, NULL,
_("Malformed text rep offset line in
node-rev"));
#if APR_CHARSET_EBCDIC
SVN_ERR (svn_utf_cstring_from_utf8 (&str, str, pool));
#endif
rep->revision = SVN_STR_TO_REV (str);
...
To answer some of Julian's specific questions:
> How extensive is this? Did you just need to do this for a few odd
characters
> here and there, or does this involve replacing hundreds of literal
strings and
> characters all over the code base?
1 through 4 are fairly extensive, but they are straightforward and less
intrusive in the sense that understanding what they are doing is easy.
Number 4 is more intrusive, but maybe not as bad as imagined. Code that
sits "between" the groups described in A have a lot of APR_CHARSET_EBCDIC
dependent blocks; e.g. fs_fs.c has 36 blocks. Code that operates within a
group have few, if any; e.g. tree.c has none. These are off-the-cuff
examples, we haven't done an in depth statistical analysis or anything.
> Bear in mind that I don't know whether
> EBCDIC has any overlap with ASCII, or what your other options are (such
as
> controls to make certain parts be compiled with ASCII as the execution
> character set).
We have explored, and continue to explore, other approaches, but the
above, as intrusive as it may seem, has shown the most promise so far. Our
elementary problem is that IBM Apache/MOD_DAV sends MOD_DAV_SVN a
request_rec with ebcdic strings and wants ebcdic strings sent back to it
(which are converted to utf-8 before being sent out on the wire). On the
other hand we have repository files that contain utf-8 content. Barring
some way of making IBM Apache run in a utf-8 environment (which has been a
dead end thus far) somewhere between the two we need to convert strings.
We are very open to better ideas on where to do this and welcome any
feedback or suggestions, but this is where we stand today.
We have actually gone pretty far on several different approaches to this
port, including "building a wall" around Subversion that made it think it
was just working on a UTF-8 system. This approach was less intrusive on
the code, but added a lot of extra string conversion and also completely
fell apart when we got to mod_dav_svn. Brane suggested we just let
Subversion do the conversion and that inspired us to start over with the
above approach, which has yielded much better results.
We would certainly welcome any feedback on the approach, I realize it is a
lot to review. Also, I will just come back to whether it would be
possible to establish an ebcdic branch where we could work on this, and
whether it seems like now would be a good time to start that process.
Thanks
Mark
_____________________________________________________________________________
Scanned for SoftLanding Systems, Inc. by IBM Email Security Management Services powered by MessageLabs.
_____________________________________________________________________________
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: Porting Subversion to EBCDIC
Posted by "C. Michael Pilato" <cm...@collab.net>.
Mark Phippard <Ma...@softlanding.com> writes:
> In writing a reply to Julian about our use of escape characters in
> our EBCDIC port, it seemed like a good time to offer up a more
> detailed explanation of what we are doing and where we are.
[...]
> It would be great if we could get an "ebcdic" branch in the real
> repository where we could start hosting this code. It would make it
> easier for us when we need to catch up to trunk as I am currently
> only mirroring the branch.
I have no problem with setting up such a branch for you, as long as
"catching up" is done in reasonably sized chunks of work, not one or
two strings at a time.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: Porting Subversion to EBCDIC
Posted by Branko Čibej <br...@xbc.nu>.
Mark Phippard wrote:
>In writing a reply to Julian about our use of escape characters in our
>EBCDIC port, it seemed like a good time to offer up a more detailed
>explanation of what we are doing and where we are.
>
>You can download our current patch against the 1.1.x branch here:
>
>http://support.softlanding.com/ebcdic.diff
>
>The patch is 362K
>
>
It sure is...
Well, as expected, string literals are the biggest problem and the
biggest change. I think that, if we ever want to merge EBCDIC support
into the mainline -- and I think we do in the long run, because
supporing two branches would be too much work -- then something has to
be done about that. Using character escapes like that is simply not
maintainable.
But I think this can be solved by inventing some kind of string-literal
conversion policy similar to what we're doing on Windows with the
console charset.
Today, the SVN libraries juggle with four different encodings (and
character sets):
* Internal: the encoding expected by most public APIs. This is (and
will most probably remain) UTF-8.
* Native: the encoding of string literals, program arguments, etc.
99% of the code today assumes this to be a strict (7-bit) subset
of UTF-8.
* APR: the encoding that APR (and Apache) functions expect. On Unix
and Win9x, this is the same as Native; on WinNT, it's the same as
Internal, i.e., UTF-8.
* Console: the encoding used for writing to the console and reading
from the console. On Unix, this is the same as Native. On Windows,
it's something else (usually some kind of OEM crap).
So for example, converting a string from internal to APR encoding and
back is a no-op on WinNT.
In order to support EBCDIC, we have to remove the second assumption
(Native is a subset of Internal). Where character literals are involved,
defining char escapes is viable since there aren't that many of them.
String literals are a bigger problem, though, because as I said, seeing
"\x64\x61\x76" instead of "dav" in the code is an instant turn-off.
It seems to me that if we strictly follow the string conversion rules we
already have in place (something we don't do, IIRC, at least in
mod_dav_svn and probably a few other places), everything _except_
handling of string literals would be solved in a satisfactory way (read:
mergeable-to-trunk).
For string literals, we want is a solution that
* leaves readable string literals in the code;
* allows static initialisation of struct members with string literals;
* magics the literals to be in a UTF-8 subset at runfime.
By this time, the words "source pre-processor" should be ringing between
your ears. I propose a filter that converts string literals in source
files to ASCII-based char escapes before sending them to the compiler,
of course inserting appropriate #line directives so that debuggers still
show the original source. This filter could be inserted into the build
on all platforms where the "native" encoding isn't ASCII.
That would make the EBCDIC patch much smaller, and correct on all
platforms. As a bomus, if the filter were to recognise character
excapes, too, we could rely on those being ASCII at runtime, too, and
eliminate the character constant defines. the only remaining problem are
literals that don't come from Subversion, e.g., APR_EOL_STR; but we can
always define and use SVN_EOL_STR instead.
-- Brane
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: Porting Subversion to EBCDIC
Posted by Paul Burba <Pa...@softlanding.com>.
Julian Foad <ju...@btopenworld.com> wrote on 02/10/2005 06:42:16 PM:
> Mark Phippard wrote:
> > In writing a reply to Julian about our use of escape characters in our
> > EBCDIC port, it seemed like a good time to offer up a more detailed
> > explanation of what we are doing and where we are.
>
> Thanks. This will be very interesting.
>
> > You can download our current patch against the 1.1.x branch here:
> >
> > http://support.softlanding.com/ebcdic.diff
>
> While looking at your patch, I couldn't help reviewing it a bit...
Hi Julian, thanks for taking a quick look at the patch, appreciate the
comments.
> > Index: libsvn_subr/io.c
> > ===================================================================
> [...]
> > @@ -1968,7 +2088,8 @@
> > return SVN_NO_ERROR;
> >
> > err = file_name_get (&name, file, pool);
> > - name = (! err && name) ? apr_psprintf (pool, "file '%s'", name)
> : "stream";
> > + name = (! err && name) ? apr_psprintf (pool, FILE_STR
SVN_UTF8_SPACE_STR
> > + "\x27%s\x27", name) :
STREAM_STR;
> > svn_error_clear (err);
> >
> > return svn_error_wrap_apr (status, "Can't %s %s", op, name);
>
> Here you are passing a native format-string ("Can't %s %s") but
UTF8arguments
> ("name", at least) to svn_error_wrap_apr.
The explanation that we wrote up and Mark recently posted is somewhat of a
simplification. Our intent was to give a high level view of the problems
we face and our current solutions. So I left out some of the details...or
less charitably, I forget about them :)
When creating the patch we accidentally left out two new files, which are
essential to this tale of ebcdic induced woe:
/subversion/include/svn_ebcdic.h
/subversion/libsvn_subr/svn_ebcdic.c
This is a temporary "holding pen" for functions we use to solve
ebcdic/iSeries related issues, but are unsure about where in the
subversion code base they ultimately belong or if we will even need them
in the end. One of these functions is:
char *svn_ebcdic_pvsprintf (apr_pool_t *p, const char *fmt, va_list ap)
This works like apr_pvsprintf except that it assumes and %c or %s variable
args are utf-8 encoded and converts them to ebcdic when building the
return string. svn_error_createf and svn_error_wrap_apr both call
svn_ebcdic_pvsprintf rather than apr_pvsprintf if APR_CHARSET_EBCDIC is
true.
Given our assumption "C" (that all strings/chars in the non-mod_dav_svn
subversion code are utf-8) the numerous calls to svn_error_createf and
svn_error_wrap_apr pass utf-8 char/string var args. In an exception to
our assumption C though, we leave the format strings as ebcdic, as they
need to eventually be native for APR to use. The above approach saved us
having to convert every string or char var arg to ebcdic prior to
creating/wrapping errors - which would make for a much larger patch. This
reveals our "secret" assumption G) Don't muck up the code with endless
APR_CHARSET_EBCDIC dependent conversions if there is a way around it!
> I haven't reviewed carefully. I just noticed these while flicking
through it.
>
> - Julian
Mark mentioned earlier that our biggest challenge is getting IBM
Apache/MOD_DAV to work correctly with MOD_DAV_SVN. As a result we tried
to solve the "simpler" issues and move along quickly to the mod_dav_svn
crux, leaving some of the early work in a less than polished state. I
mention this for the benefit of anyone who does do a thorough review on
the current patch, we are aware it's not quite ready for prime time and
will inspect and clean it up prior to any commit to the proposed ebcdic
branch.
Thanks again,
Paul B.
SoftLanding Systems, Inc.
_____________________________________________________________________________
Scanned for SoftLanding Systems, Inc. by IBM Email Security Management Services powered by MessageLabs.
_____________________________________________________________________________
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: Porting Subversion to EBCDIC
Posted by Julian Foad <ju...@btopenworld.com>.
Mark Phippard wrote:
> In writing a reply to Julian about our use of escape characters in our
> EBCDIC port, it seemed like a good time to offer up a more detailed
> explanation of what we are doing and where we are.
Thanks. This will be very interesting.
> You can download our current patch against the 1.1.x branch here:
>
> http://support.softlanding.com/ebcdic.diff
While looking at your patch, I couldn't help reviewing it a bit...
> Index: include/svn_path.h
> ===================================================================
[...]
> +/** Same as svn_path_basename but operates on ebcdic encoded @a path
> + */
> +char *svn_path_join_ebcdic (const char *base,
> + const char *component,
> + apr_pool_t *pool);
Should the comment say "Same as svn_path_join" instead of "svn_path_basename"?
> Index: libsvn_subr/io.c
> ===================================================================
[...]
> @@ -1968,7 +2088,8 @@
> return SVN_NO_ERROR;
>
> err = file_name_get (&name, file, pool);
> - name = (! err && name) ? apr_psprintf (pool, "file '%s'", name) : "stream";
> + name = (! err && name) ? apr_psprintf (pool, FILE_STR SVN_UTF8_SPACE_STR
> + "\x27%s\x27", name) : STREAM_STR;
> svn_error_clear (err);
>
> return svn_error_wrap_apr (status, "Can't %s %s", op, name);
Here you are passing a native format-string ("Can't %s %s") but UTF8 arguments
("name", at least) to svn_error_wrap_apr.
> Index: libsvn_ra_local/split_url.c
> ===================================================================
[...]
> @@ -38,7 +54,7 @@
> /* Verify that the URL is well-formed (loosely) */
>
> /* First, check for the "file://" prefix. */
> - if (strncmp (URL, "file://", 7) != 0)
> + if (strncmp (URL, FILE_PREFIX_STR, 7) != 0)
> return svn_error_createf
> (SVN_ERR_RA_ILLEGAL_URL, NULL,
> _("Local URL '%s' does not contain 'file://' prefix"), URL);
Same here, I think, and in many other error messages with arguments.
> Index: mod_dav_svn/log.c
> ===================================================================
[...]
> if (msg)
> +#if APR_CHARSET_EBCDIC
> + SVN_ERR (svn_utf_utfcstring_from_utf8(&msg, msg, pool));
> +#endif
> SVN_ERR( send_xml(lrb, "<D:comment>%s</D:comment>" DEBUG_CR,
> apr_xml_quote_string(pool, msg, 0)) );
It looks like you need braces around those two SVN_ERR statements to keep them
both under the "if".
I haven't reviewed carefully. I just noticed these while flicking through it.
- Julian
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org