You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Charles Bailey <ba...@gmail.com> on 2005/04/21 19:43:21 UTC

Re: [PATCH] Include offending XML in "Malformed XML" error message

On 2/28/05, Peter N. Lundblad <pe...@famlundblad.se> wrote:
> 
> Not its output, but the strings internally. With few exceptions our APIs
> work with UTF8 internally. The problem with adding "raw data" is that the
> error message will be invalid UTF8 and you will have recoding errors
> later. Our own error output routines will escape the whole string then,
> but that might not other libraries do. So, yes, keeping our strings valid
> UTF8 is a goal.
> 
> > If it is a general policy to convert to UTF-8, should I code this as a
> > separate function, rather than putting the logic into parse_xml?
> >
> You can put it in a separate function. Keep it internal to the file,
> though, until we see another use case for it. We can export it if that
> happens.

Well, after umpteen interrupts from the rest of life,I finally got a
few hours to look at this again.    In checking was was already
available, I found a handful of "string escaping" function in various
places which perform similar tasks (at least one with the comment
"this should share code with other_string_escaping_routine()").  Since
I'd have to add ya such function, I thought I'd try to abstract it a
bit, with the hope that similar routines could use a common base. 
I've appended a short proposal at the bottom of this messages,
containing a common "engine" and an example implementation for
creating a UTF-8-safe version of an arbitrary string.

While it's certainly more complex than just coding another
fuzzy_escape_foo(), I think it'd be a win in the long run to minimize
the number of near-parallel routines in subversion.  Of course,
there'll still have to be multiple glue routines for C strings,
counted strings, etc.

BTW, on the topic of common behaviors, what led to the use of decimal
"?\uuu" as the common representation of "illegal" bytes?  I don't see
a technical problem; I'm just asking because my brain thinks aout bit
patterns more easily in "?\xx".

Comments welcomed, before I invest any more time along this path.

-- 
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

/*
 * Quick prototype of "string escaping" common routine and UTF-8 task.
 * Builds with
 *    gcc -I/usr/local/include/subversion-1 -I/usr/local/apr/include/apr-0 \
 *           -L/usr/local/apr/lib -lsvn_subr-1 -lapr-0 -o esctest  esctest.c
 */

#include <stdio.h>
#include <string.h>
#include <assert.h>

#include "svn_pools.h"
#include "svn_string.h"

/** Scan @a instr of length @a len bytes, copying to stringbuf @a outsbuf,
 * escaping bytes as indicated by the lookup array @isok and the mapping
 * function @mapper.  Memory is allocated from @a pool.
 * 
 * This is designed to be the common pathway for various string "escaping"
 * functions scattered through subversion.  The basic approach is to scan
 * an input and decide whether each byte is OK as it stands, needs to be
 * "escaped" using subversion's "?\uuu" default format, or needs to be
 * transformed in some other way.  The decision is made using a two step
 * process, which is designed to handle the simple cases quickly but allow
 * for more complex mappings.  Since the typical string will (we hope)
 * comprise mostly simple cases, this shouldn't require much code
 * complexity or loss of efficiency.  The two steps used are:
 *
 * 1. The value of a byte from the input string ("test byte") is used as an
 *    index into a (usually 255 byte) array passed in by the caller.
 *      - If the value of the appropriate array element is 0xff, 
 *        then the test byte is escaped as a "?\uuu" string in the output.
 *      - If the value of the appropriate element is otherwise non-zero,
 *        that many bytes are copied verbatim from the input to the output.
 * 2. If the array yields a 0 value, then a mapping function provided by
 *    the caller is used to allow for more complex evaluation.  This function
 *    receives five arguments:
 *      - a pointer to the pointer used by svn_do_char_escape to
 *        mark the test byte in the input string
 *      - a pointer to the start of the input string
 *      - the length of the input string
 *      - a pointer to the output stringbuf
 *      - the ever-helpful pool.
 *    The mapping function may return an nonzero value, which is interpreted
 *    as described in step 1 above, or zero, indicating that the test byte
 *    should be ignored.  In the latter case, this is generally because the
 *    mapping function has done the necessary work itself; it's free to
 *    modify the output stringbuf and adjust the pointer to the test byte
 *    as it sees fit (within the bounds of the input string).  At a minimum,
 *    it should at least increment the pointer to the test byte before
 *    returning 0, in order to avoid an infinite loop.
 */
static void
svn_do_char_escape (svn_stringbuf_t **outsbuf,
		    unsigned char *instr,
		    apr_size_t len,
		    unsigned char *isok,
		    int (*mapper)(),
		    apr_pool_t *pool)
{
  unsigned char *base, *c;

  if (*outsbuf == NULL)
    *outsbuf = svn_stringbuf_create ("", pool);
 
  for (c = base = instr; c < instr + len; ) {
    apr_size_t count = isok[*c];
    if (count == 0) {
      if (c > base)
	svn_stringbuf_appendbytes (*outsbuf, base, c - base);
      count = mapper (&c,instr,len,*outsbuf,pool);
    }
    if (count == 255) {
      char esc[6];

      if (c > base)
	svn_stringbuf_appendbytes (*outsbuf, base, c - base);
      sprintf (esc,"?\\%03u",*c);
      svn_stringbuf_appendcstr (*outsbuf, esc);
      c++;
      base = c;
    }
    else c += count;
  }
  if (c > base)
    svn_stringbuf_appendbytes (*outsbuf, base, c - base);
}

/** Determine whether the (presumably high-half) byte pointed to by
 * @a *cur is the start of a legal UTF-8 sequence in @a str, and tell
 * the caller to either copy the legal sequence or escape the current
 * byte as illegal.
 */
static int
svn_utf8_mapper (char **cur,
		 unsigned char *str,
		 apr_size_t len,
		 svn_stringbuf_t *target,
		 apr_pool_t *pool)
{
  unsigned char *c, *end = str + len;

  if (!cur || !*cur) return 255;  /* Can't help you; sorry */
  c = *cur;
  if (c[0] < 0x80) return 1; /* Shouldn't happen */
  if ( (c[0] >= 0xc2 && c[0] < 0xdf)   &&
       (c + 1 <= end)                  &&
       (c[1] >= 0x80 && c[1] <= 0xbf) )
    return 2;
  if (c[0] >= 0xe0 && c[0] <= 0xef) {
    if ( (c + 2 > end)                 ||
	 (c[1] <  0x80 || c[1] > 0xbf) ||
	 (c[2] <  0x80 || c[2] > 0xbf) ||
	 (c[0] == 0xe0 && c[1] < 0xa0) ||
	 (c[0] == 0xed && c[1] > 0x9f) )
      return 255;
    return 3;
  }
  if (c[0] >= 0xf0 && c[0] <= 0xf4) {
    if ( (c + 3 > end)                 ||
	 (c[1] <  0x80 || c[1] > 0xbf) ||
	 (c[2] <  0x80 || c[2] > 0xbf) ||
	 (c[3] <  0x80 || c[3] > 0xbf) ||
	 (c[0] == 0xf0 && c[1] < 0x90) ||
	 (c[0] == 0xf4 && c[1] > 0x8f) )
      return 255;
    return 4;
  }
  return 255;  /* Shouldn't happen */
}

/** Copy @a str of length @a len to @a *outsbuf, escaping bytes which are
 * not legal UTF-8 sequences.  Memory is allocated from @a pool as needed.
 * If @a *outsbuf is NULL, a new stringbuf is created.
 *
 * This is the "public" entry point to the common escaping engine for
 * building UTF-8-safe strings.
 */
static void
svn_escape_utf8 (svn_stringbuf_t **outsbuf,
		 unsigned char *str,
		 apr_size_t len,
		 apr_pool_t *pool)
{
  static unsigned char utf8safe[255];

  if (!utf8safe[0]) {
    memset (utf8safe, 1, 0x7f);         /* 0x00 - 0x7f always legal */
    memset (utf8safe + 0x80, 0, 0x7f);  /* 0x80 - 0xff need check, except */
    memset (utf8safe + 0xc0, 255, 2);   /* 0xc0 - 0xc1 always illegal */
    memset (utf8safe + 0xf5, 255, 11);  /* oxf5 - 0xff always illegal */
  }

  svn_do_char_escape (outsbuf, str, len, utf8safe, svn_utf8_mapper, pool);
}


/* Quick hack to run some simple tests */
int main() {
  char test[4][6] = { { 'A', 'S', 'C', 'I', 'I', '\0'  },
		      { 'B', 'E', 'L', 0x07, '!', '\0' },
		      { 0xe2, 0x98, 0xba, 'O', 'K', '\0' },
		      { 0xe2, 0xff, 0xba, 'N', 'O', '\0' }, };

  int i;
  svn_stringbuf_t *outsbuf;
  apr_pool_t *pool;

  apr_initialize();
  apr_pool_create(&pool,NULL);

  for (i = 0; i < 4; i++) {
    outsbuf = NULL;
    printf("Original: \"%s\"\n", test[i]);
    svn_escape_utf8(&outsbuf, test[i], 5, pool);
    printf("Escaped: \"%s\"\n", outsbuf->data);
  }
  return 0;
}

/*  End of message */

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Michael W Thelen <mi...@pietdepsi.com>.

kfogel@collab.net wrote:
> Charles Bailey <ba...@gmail.com> writes:
>>>So, after another too-long hiatus, here's a patch which implements a
>>>"common" string escaping function , uses it for UTF-8 escaping, and
>>>uses that to sanitize the offending XML, which is then output in the
>>>error message that Jack built^W^Wstarted this thread.
>>
>>I'm new enough that I'm not certain of list etiquette at this point:
>>the patch has sat without comment for about two weeks.  Does this
>>indicate that it should be allowed to expire quietly, saving someone
>>the chore of saying, "I wouldn't put this twisted mess into my
>>competitor's code, much less my own!"?  Should it be dumped into an
>>issue that someone might review someday when they've had a free time
>>infusion?  Should it just bide its time here?
> 
> Usually our redoubtable patch manager makes an issue for it, however,
> he may have forgotten this time, in which case your reminder is a good
> idea...

Oh ye of little faith! :-)  Actually, the patch is on my list, and if no
one responds to this within a day or so, I'll file it as a new issue.
Thank you, Charles, for following up!

-- 
Michael W Thelen
It is a mistake to think you can solve any major problems just with
potatoes.       -- Douglas Adams

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by kf...@collab.net.

Charles Bailey <ba...@gmail.com> writes:
> > So, after another too-long hiatus, here's a patch which implements a
> > "common" string escaping function , uses it for UTF-8 escaping, and
> > uses that to sanitize the offending XML, which is then output in the
> > error message that Jack built^W^Wstarted this thread.
> 
> I'm new enough that I'm not certain of list etiquette at this point:
> the patch has sat without comment for about two weeks.  Does this
> indicate that it should be allowed to expire quietly, saving someone
> the chore of saying, "I wouldn't put this twisted mess into my
> competitor's code, much less my own!"?  Should it be dumped into an
> issue that someone might review someday when they've had a free time
> infusion?  Should it just bide its time here?

Usually our redoubtable patch manager makes an issue for it, however,
he may have forgotten this time, in which case your reminder is a good
idea...

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On 6/6/05, Charles Bailey <ba...@gmail.com> wrote:
> 
> So, after another too-long hiatus, here's a patch which implements a
> "common" string escaping function , uses it for UTF-8 escaping, and
> uses that to sanitize the offending XML, which is then output in the
> error message that Jack built^W^Wstarted this thread.

I'm new enough that I'm not certain of list etiquette at this point:
the patch has sat without comment for about two weeks.  Does this
indicate that it should be allowed to expire quietly, saving someone
the chore of saying, "I wouldn't put this twisted mess into my
competitor's code, much less my own!"?  Should it be dumped into an
issue that someone might review someday when they've had a free time
infusion?  Should it just bide its time here?

No hurry in any event -- I've certainly not had all that much time to
write it, and certainly can't expect someone to have all that much
time to read it.  I just want to make sure I shouldn't be doing
something at this point to movit along/out of the way.

-- 
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Michael W Thelen <mi...@pietdepsi.com>.

Charles Bailey wrote:
> On Wed, 23 Feb 2005, Charles Bailey wrote:> > Attached is a patchlet that...

It looks like something bad happened to your message, as if something
stripped out all the newlines.  Would you mind resending the patch?

-- 
Michael W Thelen
It is a mistake to think you can solve any major problems just with
potatoes.       -- Douglas Adams

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Michael W Thelen <mi...@pietdepsi.com>.

Charles Bailey tried to respond to my message, but for some reason it
hasn't shown up on the list.  He asked me to forward it, so here it is:

Forwarded message:
--On Jun 6, 2005 12:34 PM, Michael W Thelen <mi...@pietdepsi.com> wrote:

>
> It looks like something bad happened to your message, as if something
> stripped out all the newlines.  Would you mind resending the patch?

Sorry; I think I've been Gmailed again; let me try from Mulberry.
Here's the leading text:

On Wed, 23 Feb 2005, Charles Bailey wrote:

>
> Attached is a patchlet that, when expat fails to parse an hunk of XML,
> appends at least part of the offending hunk to the error message.  It

which led to an exchange regarding the need to make strings
UTF-8-safe.  After too long a haitus, I posted for comment:

On 4/21/05, Charles Bailey <ba...@gmail.com> wrote:

>
> Well, after umpteen interrupts from the rest of life,I finally got a
> few hours to look at this again.    In checking was was already
> available, I found a handful of "string escaping" function in various
> places which perform similar tasks (at least one with the comment
> "this should share code with other_string_escaping_routine()").  Since
> I'd have to add ya such function, I thought I'd try to abstract it a
> bit, with the hope that similar routines could use a common base.
> I've appended a short proposal at the bottom of this messages,
> containing a common "engine" and an example implementation for
> creating a UTF-8-safe version of an arbitrary string.

Julian Foad was kind enough to point out a dumb thinko, but no other
comments were forthcoming, possibly because the core developers were
busy with pre-1.2 cleanup.

So, after another too-long hiatus, here's a patch which implements a
"common" string escaping function , uses it for UTF-8 escaping, and
uses that to sanitize the offending XML, which is then output in the
error message that Jack built^W^Wstarted this thread.

I've interspersed my comments in the code, since there's imho zero
chance that this version of the patch will be
substantially/stylistically suitable for committing.  They're far from
exhaustive, but this message is long enough already.

Conceptual "Log message":
[[[
Add function that escapes illegal UTF-8 characters, along the way
refactoring core of
string-escaping routines, and insure that illegal XML error message
outputs legal UTF-8.
### Probably best applied as several patches, but collected here for review.

* subversion/libsvn_subr/escape.c:
  New file
  (svn_subr__escape_string): Final-common-path function for escaping
strings.

* subversion/libsvn_subr/escape_impl.h:
  New file, declaring svn_subr__escape_string and convenience macros.
  ### Logical candidate for consolidation with utf_impl.h, perhaps as
subr_impl.h

* subversion/libsvn_subr/utf.c:
  (fuzzy_escape): Renamed to ascii_fuzzy_escape, and rewritten to use
   svn_subr__escape_string.
  (svn_utf__stringbuf_escape_utf8_fuzzy): New function which escapes illegal
   UTF-8 in a string, returning the escaped string in a stringbuf.
  (utf8_escape_mapper): Helper function for
svn_utf__stringbuf_escape_utf8_fuzzy.

* subversion/libsvn_subr/utf_impl.h:
  Add prototype for svn_utf__stringbuf_escape_utf8_fuzzy.
  (svn_utf__cstring_escape_utf8_fuzzy):  Macro implementing variant
of above that
   returns NUL-terminated string.

* subversion/libsvn_subr/xml.c:
  (svn_xml_parse): If parse fails, print (sanitized) (part of) offending XML
   with error message.

* subversion/tests/libsvn_subr/utf-test.c:
  (utf_escape): New function testing UTF-8 string-escaping functions.

* subversion/po/de.po, subversion/po/es.po, subversion/po/ja.po,
 subversion/po/ko.po, subversion/po/nb.po, subversion/po/pl.po,
 subversion/po/pt_BR.po, subversion/po/sv.po,
 subversion/po/zh_CN.po, subversion/po/zh_TW.po:
 Courtesy to translators, since I've changed a localized string.
]]]

The patch, with interspersed comments, is appended as an attachment.

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On 23 Jun 2005 09:43:08 -0500, kfogel@collab.net <kf...@collab.net> wrote:
> Daniel Berlin <db...@dberlin.org> writes:
> > You've asked two different questions.
> > The answer to the first is no, the second is yes.
> > :)
> > If you ask "has he assigned copyright to CollabNet", the answer is no.
> > Assignment of copyright is a "term of art", and has very specific meaning.
> >
> > If you ask "would this make any difference in a court of law given
> > what he has done", the answer is probably also no, because he'd
> > probably be estopped from asserting copyright infringement claims.
> >
> > However, not having assigned the code does mean he doesn't have to
> > worry about how CollabNet is going to license it, because he can still
> > do what he wants.
> 
> Thanks, perfectly clear.

Likewise, and fine with me.  What caught my attention the first time
out was that the stock statement in the source files says "Copyright
(c) 2000-2004 CollabNet.", which it wouldn't actually be without my
assigning copyright.   (I'd guess one could argue that by placing that
notice in text I submitted I was assigning copyright, but that'd be
hard to establish formally.)  I'm happy to use the boilerplate
(assuming, of course, that the patch is worth adding :-)) as current
convention, or to simply state that a perpetual license is granted to
CollabNet, or any user, to use the code with Subversion.

Even apart copyright assignment, I agree (as a non-attorney) with Mr.
Berlin that it'd be hard for someone who submitted a patch to
Subversion through the current review process to argue that there was
no intent for it to be incorporated into Subversion under its current
license for distribution, which is what I think it'd take to assert
copyright infringement based on inclusion in Subversion.  There might
be more of an argument over its inclusion in dervative works (since
the Subversion COPYING license isn't really clear about revocation and
severability), and moreso about inclusion in broadly unrelated works
by cutting-and-pasting from Subversion's sources, but I'd guess these
aren't things about which CollabNet's too worried.

-- 
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by kf...@collab.net.

Daniel Berlin <db...@dberlin.org> writes:
> You've asked two different questions.
> The answer to the first is no, the second is yes.
> :)
> If you ask "has he assigned copyright to CollabNet", the answer is no.
> Assignment of copyright is a "term of art", and has very specific meaning.
> 
> If you ask "would this make any difference in a court of law given
> what he has done", the answer is probably also no, because he'd
> probably be estopped from asserting copyright infringement claims.
> 
> However, not having assigned the code does mean he doesn't have to
> worry about how CollabNet is going to license it, because he can still
> do what he wants.

Thanks, perfectly clear.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Daniel Berlin <db...@dberlin.org>.

On Thu, 22 Jun 2005 kfogel@collab.net wrote:

> Daniel Berlin <db...@dberlin.org> writes:
>>> Does
>>> that suffice to Collabnet as assignment of copyright?
>>
>> No, at least in the US, copyright can only be assigned in writing.
>
> Wait a minute: I thought in the past you'd said that email and bona
> fide belief of author's intent was sufficient to assign copyright?  Or
> maybe you meant, sufficient for CollabNet to proceed as if copyright
> had been assigned?

You've asked two different questions.
The answer to the first is no, the second is yes.
:)
If you ask "has he assigned copyright to CollabNet", the answer is no.
Assignment of copyright is a "term of art", and has very specific meaning.

If you ask "would this make any difference in a court of law given what he 
has done", the answer is probably also no, because he'd probably be 
estopped from asserting copyright infringement claims.

However, not having assigned the code does mean he doesn't have to 
worry about how CollabNet is going to license it, because he can still do 
what he wants.

>
> -K
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Brian Behlendorf <br...@collab.net>.

On Wed, 22 Jun 2005 kfogel@collab.net wrote:
> Daniel Berlin <db...@dberlin.org> writes:
>>> Does
>>> that suffice to Collabnet as assignment of copyright?
>>
>> No, at least in the US, copyright can only be assigned in writing.
>
> Wait a minute: I thought in the past you'd said that email and bona
> fide belief of author's intent was sufficient to assign copyright?  Or
> maybe you meant, sufficient for CollabNet to proceed as if copyright
> had been assigned?

We're not worrying about assigning copyright; for the same reasons Apache 
does not.  All we legally need is the right to sublicense the code under 
the relevant license of the whole code base, in this case the Apache 1.1 
license.  We're considering submission of a patch or idea to this list or 
the bug databases, etc., as an expression of such a license.  We've known 
that we can and we should get a more formal CLA like Apache has, to 
protect against someone posting something here and then claiming later 
that we don't have the right to integrate what they posted into 
Subversion.

 	Brian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by kf...@collab.net.

Daniel Berlin <db...@dberlin.org> writes:
> > Does
> > that suffice to Collabnet as assignment of copyright? 
> 
> No, at least in the US, copyright can only be assigned in writing.

Wait a minute: I thought in the past you'd said that email and bona
fide belief of author's intent was sufficient to assign copyright?  Or
maybe you meant, sufficient for CollabNet to proceed as if copyright
had been assigned?

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Daniel Berlin <db...@dberlin.org>.

> Right.  I'd noticed that, but since I don't know Collabnet's policy
> here I didn't want
> to presume.  Is it acceptable to clone the copyright from an existing
> file?  

Yes, but the notice still needs to be made proper in terms of dates.

> Does
> that suffice to Collabnet as assignment of copyright? 

No, at least in the US, copyright can only be assigned in writing.

>  Do I have a pledge from
> Collabnet that any code whose copyright is assigned in this way will
> remain freely
> available in perpetuity?

You haven't assigned copyright, so you can still do what you want with
your code.





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On 22 Jun 2005 15:14:51 -0500, kfogel@collab.net <kf...@collab.net> wrote:
> Charles Bailey <ba...@gmail.com> writes:
> > I've no problems with that.  I would have problems with CollabNet
> > asserting next Tuesday that Subversion was (t)henceforth a proprietary
> > product, and any further use would require payment of a license fee.
> > I assume that's not the case (and it would arguably be hard to do in
> > any event), but thought it prudent to ask, since I'm new to this
> > process.
> 
> Just to make absolutely clear here, in case others are worried by
> these questions:
> 
> CollabNet cannot do that.  The "latest free copy of the code" would
> then become the starting point for a new free fork, and every
> developer here would simply follow that fork.

Thanks; that sounds great.  I'm not a lawyer (really!) nor am I trying
to create a problem where none exists.  I did want to make sure I
understood the standards of the Subversion community -- and don't want
to deal with grumpy folks at the university that employs me if it
turned out I was writing code (solely) for someone else's commercial
venture.

We now return to our regularly scheduled discussion of string escaping.

-- 
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by kf...@collab.net.

Charles Bailey <ba...@gmail.com> writes:
> I've no problems with that.  I would have problems with CollabNet
> asserting next Tuesday that Subversion was (t)henceforth a proprietary
> product, and any further use would require payment of a license fee. 
> I assume that's not the case (and it would arguably be hard to do in
> any event), but thought it prudent to ask, since I'm new to this
> process.

Just to make absolutely clear here, in case others are worried by
these questions:

CollabNet cannot do that.  The "latest free copy of the code" would
then become the starting point for a new free fork, and every
developer here would simply follow that fork.

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On 6/21/05, Philip Martin <ph...@codematters.co.uk> wrote:
> Charles Bailey <ba...@gmail.com> writes:
> 
> > On 6/20/05, Philip Martin <ph...@codematters.co.uk> wrote:
> >> Charles Bailey <ba...@gmail.com> writes:
> >>
> >> > --- /dev/null Mon Jun  6 11:06:27 2005
> >> > +++ subversion/libsvn_subr/escape.c   Fri Jun  3 19:16:09 2005
> >>
> >> New files need a copyright notice.
> >
> > Right.  I'd noticed that, but since I don't know Collabnet's policy
> > here I didn't want
> > to presume.  Is it acceptable to clone the copyright from an existing
> > file?  Does
> > that suffice to Collabnet as assignment of copyright?
> 
> I think so.

OK, will do.

> >  Do I have a pledge from
> > Collabnet that any code whose copyright is assigned in this way will
> > remain freely
> > available in perpetuity?  (By "freely available", I mean terms
> > substantially identical
> > to Subversion's current licensing; I'm not trying to  start a skirmish
> > over software
> > licensing.)
> 
> I can't speak for CollabNet, but I'll point out that the Subversion
> licence is not a copyleft licence and so anyone, including CollabNet,
> can make a proprietary version.

I've no problems with that.  I would have problems with CollabNet
asserting next Tuesday that Subversion was (t)henceforth a proprietary
product, and any further use would require payment of a license fee. 
I assume that's not the case (and it would arguably be hard to do in
any event), but thought it prudent to ask, since I'm new to this
process.

> >> Those big macros also make me uneasy.
> >
> > Is the concern that it'll choke come compilers, or that they're hard
> > to maintain?
> 
> I don't think compilers will have a problem, I think it makes it
> harder for humans.

OK.  Paradoxically, I'm trying to make it easier for humans, but
letting them just use a "function call" to perform a common task,
without incurring the runtime overhead of the function call.

> > This may just be a stylistic issue: I tend to favor macros for
> > repeated tasks, on the theory that as long as one debugs it carefully,
> > the benefit of inlining is worth the effort of coding.
> 
> The macros are only used once in your patch...

Yep.  I noted in the comments that I'd included them as macros because
they seemed to be fairly common operations that might come up in other
contexts.  If that's not the case, I can just inline them and other
folks can copy-paste or reimplement them as needed.

> >> > +  if (!asciinonul[0]) {
> >> > +    asciinonul[0] = 255;                   /* NUL's not allowed */
> >>
> >> That doesn't look threadsafe.
> >
> > It is safe, though it doesn't prevent duplication.
> 
> No, it's not safe.  Consider:
> 
> - asciinonul is static initialised to zero
> - thread 1 tests asciinonul[0] and gets zero
> - thread 1 sets asciinonul[0] to 255, the rest of asciinonul is still zero
> - thread 2 tests asciinonul[0] and gets 255
> - thread 2 uses the rest of asciinonul although it is still zero, oops!

Good point; I should've spotted that.  I'd been sitting on the fence
between this approach (runtime initialization, slower but taking less
space in the source) and static initialization, but looks like the
latter is definitely the way to go here.

> >> Overall this patch makes me uneasy, I'd prefer a patch that builds on
> >> our existing ctype and/or utf-8 code.  However as I haven't tried to
> >> implement such a patch I don't know whether our existing code is
> >> a suitable starting point.
> >
> > That's certainly possible.  I'd actually started out that way, and
> > switched to the above when it seemed I was about to add ya parallel
> > routine to escape a string in a slightly different way.
> >
> > Thanks for the review.  If you or one of the other core developers
> > could answer some of the "philosophic" questions above, I can revise
> > the patch appropriately.
> 
> Well, that's a bit tricky.  Your patch looked overly complicated to
> me, but then I haven't tried to write an alternative so I don't know
> whether there is a better way.

My sense is that there is a simpler approach to the specific task of
making a string "UTF8-safe", but when I also consider related tasks
like escaping XML (single bytes escaped to entities like & => &amp;)
or potentially converting multibyte codes to alternative multibyte
codes, I can't whittle it down much further.

I think this approach is a net win because of reduction in parallel
code and simplification of maintenance, but that's an opinion rather
than verifiable fact.  It may produce some small savings in space,
though that's arguable given the use of 256-byte screening arrays. 
It's amost certainly not faster than a set of optimized custom
routines, though it's probably not much slower, either; the specifics
will likely depend on compiler and environment.  Whether, in the end,
the balance favors this "common pathway" approach or custom functions
for each task is the crucial question here, I guess.

On a practical level, how should I proceed?  Is it worth my making the
minor fixes you suggest and posting another patch for further
discussion, or would it be better to let it drop and see whether any
other opinions are forthcoming?

Thanks.

-- 
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by kf...@collab.net.

Charles Bailey wrote:
>  I'd hoped to encourage review, rather than put it off.

Clearly, you succeeded, given Philip's in-depth response :-)!

Philip Martin <ph...@codematters.co.uk> writes:
> Charles Bailey <ba...@gmail.com> writes:
> > to presume.  Is it acceptable to clone the copyright from an existing
> > file?  Does
> > that suffice to Collabnet as assignment of copyright?
> 
> I think so.

Yes, it is.  That's how we've done it in the past.

> >  Do I have a pledge from
> > Collabnet that any code whose copyright is assigned in this way will
> > remain freely
> > available in perpetuity?  (By "freely available", I mean terms
> > substantially identical
> > to Subversion's current licensing; I'm not trying to  start a skirmish
> > over software
> > licensing.)
> 
> I can't speak for CollabNet, but I'll point out that the Subversion
> licence is not a copyleft licence and so anyone, including CollabNet,
> can make a proprietary version.

Neither CollabNet nor anyone else can ever remove this code from the
open-source world, once it's under this license.  Someone could take a
*copy* of the code and put it into a closed project, but that wouldn't
affect the continued freedom of the original from which the copy came.

So I think you don't have anything to worry about.

Best,
-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Philip Martin <ph...@codematters.co.uk>.

Charles Bailey <ba...@gmail.com> writes:

> On 6/20/05, Philip Martin <ph...@codematters.co.uk> wrote:
>> Charles Bailey <ba...@gmail.com> writes:
>> 
>> > --- /dev/null Mon Jun  6 11:06:27 2005
>> > +++ subversion/libsvn_subr/escape.c   Fri Jun  3 19:16:09 2005
>> 
>> New files need a copyright notice.
>
> Right.  I'd noticed that, but since I don't know Collabnet's policy
> here I didn't want
> to presume.  Is it acceptable to clone the copyright from an existing
> file?  Does
> that suffice to Collabnet as assignment of copyright?

I think so.

>  Do I have a pledge from
> Collabnet that any code whose copyright is assigned in this way will
> remain freely
> available in perpetuity?  (By "freely available", I mean terms
> substantially identical
> to Subversion's current licensing; I'm not trying to  start a skirmish
> over software
> licensing.)

I can't speak for CollabNet, but I'll point out that the Subversion
licence is not a copyleft licence and so anyone, including CollabNet,
can make a proprietary version.

[...]

>> The patch is mangled here, and in several other places.
>
> Right.  When Michael Thelen pointed that out, I resent it as an attachment;
> I hope that came through in better shape.

I had one other copy in a forwarded message, it was also mangled.

[...]

>> Those big macros also make me uneasy.
>
> Is the concern that it'll choke come compilers, or that they're hard
> to maintain?

I don't think compilers will have a problem, I think it makes it
harder for humans.

> This may just be a stylistic issue: I tend to favor macros for
> repeated tasks, on the theory that as long as one debugs it carefully,
> the benefit of inlining is worth the effort of coding.

The macros are only used once in your patch...

[...]

>> > -fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)
>> > +ascii_fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)
>> >  {
>> > -  const char *src_orig = src, *src_end = src + len;
>> > -  apr_size_t new_len = 0;
>> > -  char *new;
>> > -  const char *new_orig;
>> > +  static unsigned char asciinonul[256];
>> > +  svn_stringbuf_t *result = NULL;
>> >
>> > -  /* First count how big a dest string we'll need. */
>> > -  while (src < src_end)
>> > -    {
>> > -      if (! svn_ctype_isascii (*src) || *src == '\0')
>> > -        new_len += 5;  /* 5 slots, for "?\XXX" */
>> > -      else
>> > -        new_len += 1;  /* one slot for the 7-bit char */
>> > +  if (!asciinonul[0]) {
>> > +    asciinonul[0] = 255;                   /* NUL's not allowed */
>> 
>> That doesn't look threadsafe.
>
> It is safe, though it doesn't prevent duplication.

No, it's not safe.  Consider:

- asciinonul is static initialised to zero
- thread 1 tests asciinonul[0] and gets zero
- thread 1 sets asciinonul[0] to 255, the rest of asciinonul is still zero
- thread 2 tests asciinonul[0] and gets 255
- thread 2 uses the rest of asciinonul although it is still zero, oops!

You could try to fix it by changing the code to set asciinonul[0]
later, but that's still not threadsafe:

- An optimising compiler can reorder the memory writes.
- The CPU pipeline can reorder the memory writes.
- Even if the writes are not reordered, on a multi-CPU system there is
  no guarantee that writes by one CPU will become visible to another CPU
  in the same order.

[...]

>> Overall this patch makes me uneasy, I'd prefer a patch that builds on
>> our existing ctype and/or utf-8 code.  However as I haven't tried to
>> implement such a patch I don't know whether our existing code is
>> a suitable starting point.
>
> That's certainly possible.  I'd actually started out that way, and
> switched to the above when it seemed I was about to add ya parallel
> routine to escape a string in a slightly different way.
>
> Thanks for the review.  If you or one of the other core developers
> could answer some of the "philosophic" questions above, I can revise
> the patch appropriately.

Well, that's a bit tricky.  Your patch looked overly complicated to
me, but then I haven't tried to write an alternative so I don't know
whether there is a better way.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On 6/20/05, Philip Martin <ph...@codematters.co.uk> wrote:> Charles Bailey <ba...@gmail.com> writes:> > > I've interspersed my comments in the code, since there's imho zero> > chance that this version of the patch will be> > substantially/stylistically suitable for committing.> > That doesn't encourage review!  Anyway, if the comments are necessary> to understand the code then they should be part of the code.
Hmm.  Perhaps I'm phrasing poorly.  My intent was to say that the patch wasfunctionally correct, but there are enough issues not only of style (ala HACKING)but of "project philosophy", for lack of a better term -- whethermacros are preferredto repeated code, framing of comments, calling sequence conventions --that it wasunlikely to be in final form.  I'd hoped to encourage review, ratherthan put it off.
Those comments which I intended to document the code are present as C-stylecomments in the patch.  It's the "philosophical" questions I've interspersed.
I hope this helps make my intent clearer.
> >  They're far from> > exhaustive, but this message is long enough already.> >> > Conceptual "Log message":> > [[[> > Add function that escapes illegal UTF-8 characters, along the way> > "valid" rather than "legal", "invalid" rather than "illegal".  That> applies to function names, variables, comments, etc.
OK, I'm happy to follow that convention.
> > refactoring core of> > string-escaping routines, and insure that illegal XML error message> > outputs legal UTF-8.> > ### Probably best applied as several patches, but collected here for review.> > If you think it should be several patches then submit several> patches.  On the whole it looks like something that should be one> patch.
Thanks.  This is another place where I was unsure of the philosophy. If the policy is one-change-per-commit, then I can frame it as addingthe driver in one patch, using the driver to escape strings to UTF8 ina second, and using that to fix the XML error message in a third.  Ifthe policy is one-problem-per-patch, then I agree it makes sense as asingle patch.   (I'd still keep refactoring other string escapingroutines to use the driver as a separate patch, no?)
> > * subversion/po/de.po, subversion/po/es.po, subversion/po/ja.po,> >   subversion/po/ko.po, subversion/po/nb.po, subversion/po/pl.po,> >   subversion/po/pt_BR.po, subversion/po/sv.po,> >   subversion/po/zh_CN.po, subversion/po/zh_TW.po:> >   Courtesy to translators, since I've changed a localized string.> > I wouldn't apply patches to PO files unless they come from someone who> claims to be fluent.
OK; I can drop them.
> > --- /dev/null Mon Jun  6 11:06:27 2005> > +++ subversion/libsvn_subr/escape.c   Fri Jun  3 19:16:09 2005> > New files need a copyright notice.
Right.  I'd noticed that, but since I don't know Collabnet's policyhere I didn't wantto presume.  Is it acceptable to clone the copyright from an existingfile?  Doesthat suffice to Collabnet as assignment of copyright?  Do I have a pledge fromCollabnet that any code whose copyright is assigned in this way willremain freelyavailable in perpetuity?  (By "freely available", I mean termssubstantially identicalto Subversion's current licensing; I'm not trying to  start a skirmishover softwarelicensing.)
> > > @@ -0,0 +1,58 @@> > +/*> > + * escape.c: common code for cleaning up unwanted bytes in strings> > + */> > +> > +#include "escape_impl.h"> > +> > +#define COPY_PREFIX \> > +  if (c > base) { \> > +    svn_stringbuf_appendbytes (out, base, c - base); \> > +    base = c; \> > +  }> > I don't think I'd have used a macro, it makes the calling code harder> to understand, but I'd prefer a macro with parameters and it needs> documentation.
Hmm.  The only purpose of the macro here is to make the calling code easierto understand; if it's not helping, I can just inline the prefix checkin the threeplaces where it's required.
> > +> > +svn_stringbuf_t *> > +svn_subr__escape_string (svn_stringbuf_t **outsbuf,> > +                      const unsigned char *instr,> > +                      apr_size_t len,> > +                      const unsigned char *isok,> > +                      unsigned char (*mapper) (unsigned char **,> > +                                               const unsigned char *,> > +                                               apr_size_t,> > +                                               const svn_stringbuf_t *,> > +                                               void *,> > +                                               apr_pool_t *),> > mapper should be a typedef.
OK.
> > +                      void *mapper_baton,> > +                      apr_pool_t *pool)> > +{> > +  unsigned char *base, *c;> > +  svn_stringbuf_t *out;> > +> > +  if (outsbuf == NULL || *outsbuf == NULL) {> > +    out = svn_stringbuf_create ("", pool);> > It should probably be created with at least len bytes capacity.
Sounds fair.  Would it be better to create it with as many bytes asthe input string length?
> > +    if (outsbuf)> > +      *outsbuf = out;> > +  }> > +  else> > +    out = *outsbuf;> > +> > +  for (c = base = (unsigned char *) instr; c < instr + len; ) {> > Try to avoid casting away const.
Agreed.  I think I tripped a compiler warning trying to leave everything (instr, base, and c) const, though it should be legal --  I'll haveanother look.
> > +    apr_size_t count = isok ? isok[*c] : 0;> > +    if (count == 0) {> > +      COPY_PREFIX;> > +      count = mapper ? mapper (&c, instr, len, out, mapper_baton, pool) : 255;> > +    }> > +    if (count == 255) {> > +      char esc[6];> > +> > +      COPY_PREFIX;> > +      sprintf (esc,"?\\%03u",*c);> > +      svn_stringbuf_appendcstr (out, esc);> > +      c++;> > +      base = c;> > +    }> > +    else c += count;> > +  }> > +  COPY_PREFIX;> > +  return out;> > This function doesn't follow the project's indentation/formatting> guidelines.
Got it.  I reflexively cuddled the braces, but can push them down.
> > +}> > +> >> >> > ### Comments are pretty self-explanatory.> > ### Docs are as doxygen; will need to be downgraded to plaintext since it's> > ### an internal header.> > ### As noted above, it makes sense to combine this with utf_impl.h.> > If it makes sense to combine it then why is it separate?
I didn't want on a first outing to be rearranging existing files.  Atsome level, this isalso a question of philosophy.  It seems to me that one "private"header file foreverything local to libsvn_subr is the best way to go.  If, however,project policyis one header per topic, then I'd use a separate escape.h.

> > --- /dev/null Mon Jun  6 11:35:47 2005> > +++ subversion/libsvn_subr/escape_impl.h      Thu Jun  2 18:44:05 2005> > Missing copyright notice.> > > @@ -0,0 +1,147 @@> > +/*> > + * escape_impl.h :  private header for string escaping function.> > + */> > +> > +> > +> > +#ifndef SVN_LIBSVN_SUBR_ESCAPE_IMPL_H> > +#define SVN_LIBSVN_SUBR_ESCAPE_IMPL_H> > +> > +> > +#include "svn_pools.h"> > +#include "svn_string.h"> > +> > +#ifdef __cplusplus> > +extern "C" {> > +#endif /* __cplusplus */> > +> > +> > +/** Scan @a instr of length @a len bytes, copying to stringbuf @a *outsbuf,> > + * escaping bytes as indicated by the lookup array @a isok and the mapping> > + * function @a mapper. Memory is allocated from @a pool.  You may provide> > + * any extra information needed by @a mapper in @a mapper_baton.> > + * Returns a pointer to the stringbuf containing the escaped string.> > + *> > + * If @a outsbuf or *outsbuf is NULL, a new stringbuf is created; its> > address is> > The patch is mangled here, and in several other places.
Right.  When Michael Thelen pointed that out, I resent it as an attachment;I hope that came through in better shape.
> > + * placed in @a outsbuf unless that argument is NULL.> > + * If @a isok is NULL, then @a mapper is used exclusively.> > + * If @ mapper is NULL, then a single character is escaped every time @a mapper> > + * would have been called.> > + *> > + * This is designed to be the common pathway for various string "escaping"> > + * functions across subversion.  The basic approach is to scan> > + * the input and decide whether each byte is OK as it stands, needs to be> > + * "escaped" using subversion's "?\uuu" default format, or needs to be> > + * transformed in some other way.  The decision is made using a two step> > + * process, which is designed to handle the simple cases quickly but allow> > + * for more complex mappings.  Since the typical string will (we hope)> > + * comprise mostly simple cases, this shouldn't require much code> > + * complexity or loss of efficiency.  The two steps used are:> > The question is do we need such a general function?  One the one hand> it looks like it is more complicated then necessary just to handle> escaping, on the other hand it doesn't look general enough to be used,> say, as a routine to separate multibyte codepoints.> > I think I'd prefer a simpler solution, although I haven't tried to> implement one.  Perhaps based on the existing validation functions?> (Although since I wrote those I might be biased.)
I think this is the crucial philosophic question.  For the specificproblem with whichI started (given a string from I-know-not-where, and hence inI-know-not-what-encoding,emit a string that is valid UTF8), it's certainly easy enough to writeanother escaping function, similar to the ones already present.  Thatapproach seemed less than ideal to me, since it would introduce yetanother function to be maintained in parallel, etc.  The conceptualsimplicity may make it most useful in the end, though.
I tried to write the driver to be general enough to handle differenttasks, but reasonably fast for common tasks.  Since most of thestrings appearing within Subversion are UTF8, I chose a byte-orientedscreen.  (I'm also trying to take advantage of the nice property ofUTF8 that one can identify multibyte codepoints from the first byte,iff the string is guaranteed to be valid.)  I'm using the mappingfunction as a "trap door" to permit more involved (includingmultibyte) testing; in this case, there's no benefit in speed, butperhaps a common point for the actual escaping of bytes.  More generalsolutions (of which I could conceive) seemed to me to entail yet morecomplexity; I chose this level as the appropriate tradeoff.  It maywell be that the core developers choose a different level.  I can workwith that.
> > + *> > + * 1. The value of a byte from the input string ("test byte") is used as an> > + *    index into a (usually 256 byte) array passed in by the caller.> > + *      - If the value of the appropriate array element is 0xff,> > + *        then the test byte is escaped as a "?\uuu" string in the output.> > + *      - If the value of the appropriate element is otherwise non-zero,> > + *        that many bytes are copied verbatim from the input to the output.> > + * 2. If the array yields a 0 value, then a mapping function provided by> > + *    the caller is used to allow for more complex evaluation.  This function> > + *    receives five arguments:> > Five? I see six in the code.
Drat.  I thought I'd added the baton to the docment.  My omission.
> > + *      - a pointer to the pointer used by svn__do_char_escape() to> > + *        mark the test byte in the input string> > + *      - a pointer to the start of the input string> > + *      - the length of the input string> > + *      - a pointer to the output stringbuf> > + *      - the ever-helpful pool.> > + * The mapping function may return a (positive) nonzero value,> > + * which is interpreted * as described in step 1 above, or zero,> > + * indicating that the test byte * should be ignored.  In the latter> > + * case, this is generally because the * mapping function has done the> > + * necessary work itself; it's free to * modify the output stringbuf and> > + * adjust the pointer to the test byte * as it sees fit (within the> > + * bounds of the input string).  At a minimum, * it should at least> > + * increment the pointer to the test byte before * returning 0, in order> > + * to avoid an infinite loop.> > Are the '*' characters within the comment lines just fallout from> paragraph filling?
Yes.
> > + */> > +> > +svn_stringbuf_t *> > +svn_subr__escape_string (svn_stringbuf_t **outsbuf,> > +                      const unsigned char *instr,> > +                      apr_size_t len,> > +                      const unsigned char *isok,> > +                      unsigned char (*mapper) (unsigned char **,> > +                                               const unsigned char *,> > +                                               apr_size_t,> > +                                               const svn_stringbuf_t *,> > +                                               void *,> > +                                               apr_pool_t *),> > Should be a typedef, the typedef should document the parameters.
OK.
> > +                      void *mapper_baton,> > +                      apr_pool_t *pool);> > +> > +> > +> > +/** Initializer for a basic screening matrix suitable for use with> > + *  #svn_subr__escape_string to escape non-UTF-8 bytes.> > + *  We provide this since "UTF-8-safety" is a common denominator for> > + *  most string escaping in Subversion, so this matrix makes a good> > + *  starting point for more involved schemes.> > + */> > +#define SVN_ESCAPE_UTF8_LEGAL_ARRAY { \> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,> >   0,   0,\> > +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,> >   0,   0,\> > +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,> >   0,   0,\> > +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,> >   0,   0,\> > +255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,> >   0,   0,\> > +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,> >   0,   0,\> > +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,> >   0,   0,\> > +  0,   0,   0,   0,   0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255}> > That looks a bit like the svn_ctype stuff, but it's a separate> implementation.  That makes me uneasy.
Fair enough.  The underlying problem is that different situationsrequire slightly different behaviors, of course.  I'm not sure whetherthe best solution is to start with a single framework and tweak it atruntime as needed, or to set up the framework for each case separatelyin the interest of speed over space.
> > +> > +/** Given pointer @a c into a string which ends at @a e, figure out> > + *  whether (*c) starts a valid UTF-8 sequence, and if so, how many bytes> > + *  it includes.  Return 255 if it's not valid UTF-8.> > + *  For a more detailed description of the encoding rules, see the UTF-8> > + *  specification in section 3-9 of the Unicode standard 4.0 (e.g. at> > + *  http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf),> > + *  with special attention to Table 3-6.> > + *  This macro is also provided as a building block for mappers used by> > + *  #svn_subr__escape_string that want to check for UTF-8-safety in> > + *  addition to other tasks.> > + */> > +#define SVN_ESCAPE_UTF8_MAPPING(c,e)                                        \> > +  ( (c)[0] < 0x80                                  ? /* ASCII */            \> > +    1 :                                                   /* OK, 1 byte */          \> > +    ( ( ((c)[0] > 0xc2 && (c)[0] < 0xdf)          && /* 2-byte char */              \> > +     ((c) + 1 <= (e))                          && /* Got 2 bytes */         \> > +     ((c)[1] >= 0x80 && (c)[1] <= 0xbf))        ? /* Byte 2 legal */        \> > +      2 :                                         /* OK, 2 bytes */         \> > +      ( ( ((c)[0] >= 0xe0 && (c)[0] <= 0xef)      && /* 3 byte char */              \> > +       ((c) + 2 <= (e))                        && /* Got 3 bytes */         \> > +       ((c)[1] >= 0x80 && (c)[1] <= 0xbf)      && /* Basic byte 2 legal */  \> > +       ((c)[2] >= 0x80 && (c)[2] <= 0xbf)      && /* Basic byte 3 legal */  \> > +       (!((c)[0] == 0xe0 && (c)[1] < 0xa0))    && /* 0xe0-0x[89]? illegal */\> > +       (!((c)[0] == 0xed && (c)[1] > 0x9f)) )   ? /* 0xed-0x[ab]? illegal */\> > +     3 :                                          /* OK, 3 bytes */         \> > +     ( ( ((c)[0] >= 0xf0 && (c)[0] <= 0xf4)    && /* 4 byte char */         \> > +         ((c) + 3 <= (e))                      && /* Got 4 bytes */         \> > +         ((c)[1] >= 0x80 && (c)[1] <= 0xbf)    && /* Basic byte 2 legal */  \> > +         ((c)[2] >= 0x80 && (c)[2] <= 0xbf)    && /* Basic byte 3 legal */  \> > +         ((c)[3] >= 0x80 && (c)[3] <= 0xbf)    && /* Basic byte 4 legal */  \> > +         (!((c)[0] == 0xf0 && (c)[1] < 0x90))  && /* 0xf0-0x8? illegal */   \> > +         (!((c)[0] == 0xf4 && (c)[1] > 0x8f)) ) ? /* 0xf4-0x[9ab]? illegal*/\> > +       4 :                                        /* OK, 4 bytes */         \> > +       255))))                                    /* Illegal; escape it */> > utf_validate.c already implements the UTF-8 encoding rules.  There is> obviously some duplication of the algorithm, that makes me uneasy.
True enough.  I'd done this for speed and compactness relative to thestate machine in utf_validate.c, expecting comments from the coredevelopers about whether it was a net win.  For the specific case ofescaping invalid UTF8, I could write a mapper to call out tosvn_utf__last_valid per character.
> Those big macros also make me uneasy.
Is the concern that it'll choke come compilers, or that they're hardto maintain?This may just be a stylistic issue: I tend to favor macros forrepeated tasks, on the theory that as long as one debugs it carefully,the benefit of inlining is worth the effort of coding.
> > +> > +> > +#ifdef __cplusplus> > +}> > +#endif /* __cplusplus */> > +> > +#endif /* SVN_LIBSVN_SUBR_ESCAPE_IMPL_H */> >> >> > ### Function names can be revised to fit convention, of course.> > ### svn_utf__cstring_escape_utf8_fuzzy serves as an example of a benefit of> > ### returning the resultant stringbuf from svn_subr__escape_string both in a> > ### parameter and as the function's return value. If the sense is that> > it'll be a cause> > ### of debugging headaches, or that it's cortrary to subversion> > culture to code public> > ### functions as macros, it's easy enough to code this as a function,> > and to make> > ### svn_subr__escape_string return void (or less likely svn_error_t,> > if it got pickier> > ### about params.)> > --- subversion/libsvn_subr/utf_impl.h (revision 14986)> > +++ subversion/libsvn_subr/utf_impl.h (working copy)> > @@ -24,12 +24,33 @@> >> >  #include <apr_pools.h>> >  #include "svn_types.h"> > +#include "svn_string.h"> >> >  #ifdef __cplusplus> >  extern "C" {> >  #endif /* __cplusplus */> >> >> > +/** Replace any non-UTF-8 characters in @a len byte long string @a src with> > + *  escaped representations, placing the result in a stringbuf pointed to by> > + *  @a *dest, which will be created if necessary.  Memory is allocated from> > How does the user know what "if necessary" means?
Fair point.  s/necessary/NULL/.
> > + *  @a pool as needed. Returns a pointer to the stringbuf containing the result> > + *  (identical to @a *dest, but facilitates chaining calls).> > + */> > +svn_stringbuf_t *> > +svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,> > +                                   const unsigned char *src,> > +                                   apr_size_t len,> > +                                   apr_pool_t *pool);> > +> > +/** Replace any non-UTF-8 characters in @a len byte long string @a src with> > + *  escaped representations.  Memory is allocated from @a pool as needed.> > + *  Returns a pointer to the resulting string.> > + */> > +#define svn_utf__cstring_escape_utf8_fuzzy(src,len,pool) \> > +  (svn_utf__stringbuf_escape_utf8_fuzzy(NULL,(src),(len),(pool)))->data> > Is there any need for this to be a macro?  A real function would be> less "surprising" and that makes it better unless you can justify the> macro.
OK.  I think this is another instance of the style point noted above.
> > +> > +> > const char *svn_utf__cstring_from_utf8_fuzzy (const char *src,> >                                                apr_pool_t *pool,> >                                                svn_error_t *(*convert_from_utf8)> >> >> >> > ### There're other places that could be rewritten in terms of the new escaping> > ### functions, but I hope the two given here serve as an example of how it might> > ### be done.> > Complete patches are better than incomplete ones.
Sure.  My standard for 'complete' might have been a bit low: I wasaiming for a unit that compiled,  passed tests, and illustrated theissues I believed were still open.  I thought I might wait to seewhether the notion of a "common escaping routine" would carry beforeextending it to XML, other UTF8 tasks, etc.

> > ### The rename to ascii_fuzzy_escape is to distinguish it from the new functions> > ### that escape only illegal UTF-8 sequences.> > --- subversion/libsvn_subr/utf.c      (revision 14986)> > +++ subversion/libsvn_subr/utf.c      (working copy)> > @@ -30,6 +30,7 @@> >  #include "svn_pools.h"> >  #include "svn_ctype.h"> >  #include "svn_utf.h"> > +#include "escape_impl.h"> >  #include "utf_impl.h"> >  #include "svn_private_config.h"> >> > @@ -323,53 +324,19 @@> >  /* Copy LEN bytes of SRC, converting non-ASCII and zero bytes to ?\nnn> >     sequences, allocating the result in POOL. */> >  static const char *> > -fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)> > +ascii_fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)> >  {> > -  const char *src_orig = src, *src_end = src + len;> > -  apr_size_t new_len = 0;> > -  char *new;> > -  const char *new_orig;> > +  static unsigned char asciinonul[256];> > +  svn_stringbuf_t *result = NULL;> >> > -  /* First count how big a dest string we'll need. */> > -  while (src < src_end)> > -    {> > -      if (! svn_ctype_isascii (*src) || *src == '\0')> > -        new_len += 5;  /* 5 slots, for "?\XXX" */> > -      else> > -        new_len += 1;  /* one slot for the 7-bit char */> > +  if (!asciinonul[0]) {> > +    asciinonul[0] = 255;                   /* NUL's not allowed */> > That doesn't look threadsafe.
It is safe, though it doesn't prevent duplication.
> > +    memset(asciinonul + 1, 1, 127);        /* Other regular ASCII OK */> > +    memset(asciinonul + 128, 255, 128);    /* High half not allowed */> > +  }> >> > -      src++;> > -    }> > -> > -  /* Allocate that amount. */> > -  new = apr_palloc (pool, new_len + 1);> > -> > -  new_orig = new;> > -> > -  /* And fill it up. */> > -  while (src_orig < src_end)> > -    {> > -      if (! svn_ctype_isascii (*src_orig) || src_orig == '\0')> > -        {> > -          /* This is the same format as svn_xml_fuzzy_escape uses, but that> > -             function escapes different characters.  Please keep in sync!> > -             ### If we add another fuzzy escape somewhere, we should abstract> > -             ### this out to a common function. */> > -          sprintf (new, "?\\%03u", (unsigned char) *src_orig);> > -          new += 5;> > -        }> > -      else> > -        {> > -          *new = *src_orig;> > -          new += 1;> > -        }> > -> > -      src_orig++;> > -    }> > -> > -  *new = '\0';> > -> > -  return new_orig;> > +  svn_subr__escape_string(&result, src, len, asciinonul, NULL, NULL, pool);> > +  return result->data;> >  }> >> >  /* Convert SRC_LENGTH bytes of SRC_DATA in NODE->handle, store the result> > @@ -448,7 +415,7 @@> >          errstr = apr_psprintf> >            (pool, _("Can't convert string from '%s' to '%s':"),> >             node->frompage, node->topage);> > -      err = svn_error_create (apr_err, NULL, fuzzy_escape (src_data,> > +      err = svn_error_create (apr_err, NULL, ascii_fuzzy_escape (src_data,> >                                                             src_length, pool));> >        return svn_error_create (apr_err, err, errstr);> >      }> > @@ -564,7 +531,28 @@> >    return SVN_NO_ERROR;> >  }> >> > +static unsigned char> > +utf8_escape_mapper (unsigned char **targ, const unsigned char *start,> > +                 apr_size_t len, const svn_stringbuf_t *dest,> > +                 void *baton, apr_pool_t *pool)> > New functions need documentation.
OK.  I can add a docment above it.
> > +{> > +  const unsigned char *end = start + len;> > +  return SVN_ESCAPE_UTF8_MAPPING(*targ, end);> > +}> >> > +svn_stringbuf_t *> > +svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,> > +                                   const unsigned char *src,> > +                                   apr_size_t len,> > +                                   apr_pool_t *pool)> > +{> > +  static unsigned char utf8screen[256] = SVN_ESCAPE_UTF8_LEGAL_ARRAY;> > +> > +  return svn_subr__escape_string(dest, src, len,> > +                              utf8screen, utf8_escape_mapper, NULL,> > +                              pool);> > +}> > +> >  svn_error_t *> >  svn_utf_stringbuf_to_utf8 (svn_stringbuf_t **dest,> >                             const svn_stringbuf_t *src,> > @@ -787,7 +775,7 @@> >    const char *escaped, *converted;> >    svn_error_t *err;> >> > -  escaped = fuzzy_escape (src, strlen (src), pool);> > +  escaped = ascii_fuzzy_escape (src, strlen (src), pool);> >> >    /* Okay, now we have a *new* UTF-8 string, one that's guaranteed to> >       contain only 7-bit bytes :-).  Recode to native... */> >> > ### With code comes testing.> > ### Note: Contains 8-bit chars, and also uses convention that cc will treat> > ### "foo" "bar" as "foobar".  Both can be avoided if useful for> > finicky compilers.> >> > --- subversion/tests/libsvn_subr/utf-test.c   (revision 14986)> > +++ subversion/tests/libsvn_subr/utf-test.c   (working copy)> > @@ -17,6 +17,7 @@> >   */> >> >  #include "../svn_test.h"> > +#include "../../include/svn_utf.h"> > Does a plain "svn_utf.h" work?
Hmm.  I don't know; I'd followed the convention of the prior line. I'll give it a try.
> >  #include "../../libsvn_subr/utf_impl.h"> >> >  /* Random number seed.  Yes, it's global, just pretend you can't see it. */> > @@ -222,6 +223,84 @@> >    return SVN_NO_ERROR;> >  }> >> > +static svn_error_t *> > +utf_escape (const char **msg,> > +         svn_boolean_t msg_only,> > +         svn_test_opts_t *opts,> > +         apr_pool_t *pool)> > +{> > +  char in[] = { 'A', 'S', 'C', 'I', 'I',     /* All printable */> > +             'R', 'E', 'T', '\n', 'N',    /* Newline */> > +             'B', 'E', 'L', 0x07, '!',    /* Control char */> > +             0xd2, 0xa6, 'O', 'K', '2',   /* 2-byte char, valid */> > +             0xc0, 0xc3, 'N', 'O', '2',   /* 2-byte char, invalid 1st */> > +             0x82, 0xc3, 'N', 'O', '2',   /* 2-byte char, invalid 2nd */> > +             0xe4, 0x87, 0xa0, 'O', 'K',  /* 3-byte char, valid */> > +             0xe2, 0xff, 0xba, 'N', 'O',  /*3-byte char, invalid 2nd */> > +             0xe0, 0x87, 0xa0, 'N', 'O',  /*3-byte char, invalid 2nd */> > +             0xed, 0xa5, 0xa0, 'N', 'O',  /*3-byte char, invalid 2nd */> > +             0xe4, 0x87, 0xc0, 'N', 'O',  /* 3-byte char, invalid 3rd */> > +             0xf2, 0x87, 0xa0, 0xb5, 'Y', /* 4-byte char, valid */> > +             0xf2, 0xd2, 0xa0, 0xb5, 'Y', /* 4-byte char, invalid 2nd */> > +             0xf0, 0x87, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */> > +             0xf4, 0x97, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */> > +             0xf2, 0x87, 0xc3, 0xb5, 'N', /* 4-byte char, invalid 3rd */> > +             0xf2, 0x87, 0xa0, 0xd5, 'N', /* 4-byte char, invalid 4th */> > +                0x00 };> > +  const unsigned char *legalresult => > +    "ASCIIRET\nNBEL!$-1(c)�OK2?\\192?\\195NO2?\\130?\\195NO2"-A> > +    "3$-3�0䇠1OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO?\\237?\\165?\\160NO"-A> > +    "?\\228?\\135?\\192NO3$-3�0򇠵1Y?\\242$-1(c)�?\\181Y?\\240?\\135?\\160"-A> > +    "?\\181N?\\244?\\151?\\160?\\181N?\\242?\\135�N?\\242?\\135?\\160"> > +    "?\\213N";> > I don't like the embedded control characters in the source code, could> you generate them at runtime?
Sure.  I could also just C-escape them like the high-half bytes above.
> > +  const unsigned char *asciiresult => > +    "ASCIIRET\nNBEL\x07!?\\210?\\166OK2?\\192?\\195NO2?\\130?\\195NO2"> > +    "?\\228?\\135?\\160OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO"> > +    "?\\237?\\165?\\160NO?\\228?\\135?\\192NO?\\242?\\135?\\160?\\181Y"> > +    "?\\242?\\210?\\160?\\181Y?\\240?\\135?\\160?\\181N"> > +    "?\\244?\\151?\\160?\\181N?\\242?\\135?\\195?\\181N"> > +    "?\\242?\\135?\\160?\\213N";> > +  const unsigned char *asciified;> > +  apr_size_t legalresult_len = 213;  /* == strlen(legalresult) iff no NULs */> > +  int i = 0;> > +  svn_stringbuf_t *escaped = NULL;> > +> > +  *msg = "test utf string escaping";> > +> > +  if (msg_only)> > +    return SVN_NO_ERROR;> > +> > +  if (svn_utf__stringbuf_escape_utf8_fuzzy> > +      (&escaped, in, sizeof in - 1, pool) != escaped)> > I prefer () with sizeof.
OK.
> > +    return svn_error_createf> > +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > +  i++;> > +  if (escaped->len != legalresult_len)> > +    return svn_error_createf> > +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > +  i++;> > +  if (memcmp(escaped->data, legalresult, legalresult_len))> > +    return svn_error_createf> > +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > +  i++;> > +  if (memcmp(escaped->data, legalresult, legalresult_len))> > A duplicate of the one above?
Er, yes.  I must've mispasted.  Sorry.
> > +    return svn_error_createf> > +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > +  i++;> > +> > +  asciified = svn_utf_cstring_from_utf8_fuzzy(in, pool);> > +  if (strlen(asciified) != strlen(asciiresult))> > +    return svn_error_createf> > +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > +  i++;> > +  if (strcmp(asciified, asciiresult))> > +    return svn_error_createf> > +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > +  i++;> > +> > +  return SVN_NO_ERROR;> > +}> > +> >> >  /* The test table.  */> >> > @@ -230,5 +309,6 @@> >      SVN_TEST_NULL,> >      SVN_TEST_PASS (utf_validate),> >      SVN_TEST_PASS (utf_validate2),> > +    SVN_TEST_PASS (utf_escape),> >      SVN_TEST_NULL> >    };> >> >> >> > ### The original point of this thread.> > ### This patch will apply with an offset, since I've cut out sections which> > ### reimplement XML escaping in terms of the svn_subr__escape_string.> > --- subversion/libsvn_subr/xml.c      (revision 14986)> > +++ subversion/libsvn_subr/xml.c      (working copy)> > @@ -395,11 +413,22 @@> >    /* If expat choked internally, return its error. */> >    if (! success)> >      {> > +      svn_stringbuf_t *sanitized;> > +      unsigned char *end;> > +> > +      svn_utf__stringbuf_escape_utf8_fuzzy(&sanitized, buf,> > +                                        (len > 240 ? 240 : len),> > +                                        svn_parser->pool);> > +      end = sanitized->data +> > +         (sanitized->len > 240 ? 240 : sanitized->len);> > 240?  A magic number with no explanation.
I can add a comment.  It was discussed earlier in this thread, but Iagree it'd helpin the future to add something to the code.

> > +      while (*end > 0x80 && *end < 0xc0 &&> > +          (char *) end > sanitized->data) end--;> > I think that could generate a huge error message in pathological> cases.
Up to 240 characters, yes.  My intent was to include enough to givethe reader a fair idea where the error occurred, without subjectingthem to screensful of junk.  Setting the cutoff at 240 seemed to give reasonable chunk of XML element for the cases I checked (errors in theentries file.)
> >        err = svn_error_createf> >          (SVN_ERR_XML_MALFORMED, NULL,> > -         _("Malformed XML: %s at line %d"),> > +         _("Malformed XML: %s at line %d; XML starts:\n%.*s"),> >           XML_ErrorString (XML_GetErrorCode (svn_parser->parser)),> > -         XML_GetCurrentLineNumber (svn_parser->parser));> > +         XML_GetCurrentLineNumber (svn_parser->parser),> > +      (char *) end - sanitized->data + 1, sanitized->data);> >> >        /* Kill all parsers and return the expat error */> >        svn_xml_free_parser (svn_parser);> >> > Overall this patch makes me uneasy, I'd prefer a patch that builds on> our existing ctype and/or utf-8 code.  However as I haven't tried to> implement such a patch I don't know whether our existing code is> a suitable starting point.
That's certainly possible.  I'd actually started out that way, andswitched to the above when it seemed I was about to add ya parallelroutine to escape a string in a slightly different way.
Thanks for the review.  If you or one of the other core developerscould answer some of the "philosophic" questions above, I can revisethe patch appropriately.
-- Regards,Charles BaileyLists: bailey _dot_ charles _at_ gmail _dot_ comOther: bailey _at_ newman _dot_ upenn _dot_ edu

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Philip Martin <ph...@codematters.co.uk>.

Charles Bailey <ba...@gmail.com> writes:

> I've interspersed my comments in the code, since there's imho zero
> chance that this version of the patch will be
> substantially/stylistically suitable for committing.

That doesn't encourage review!  Anyway, if the comments are necessary
to understand the code then they should be part of the code.

>  They're far from
> exhaustive, but this message is long enough already.
>
> Conceptual "Log message":
> [[[
> Add function that escapes illegal UTF-8 characters, along the way

"valid" rather than "legal", "invalid" rather than "illegal".  That
applies to function names, variables, comments, etc.

> refactoring core of
> string-escaping routines, and insure that illegal XML error message
> outputs legal UTF-8.
> ### Probably best applied as several patches, but collected here for review.

If you think it should be several patches then submit several
patches.  On the whole it looks like something that should be one
patch.

> * subversion/po/de.po, subversion/po/es.po, subversion/po/ja.po,
>   subversion/po/ko.po, subversion/po/nb.po, subversion/po/pl.po,
>   subversion/po/pt_BR.po, subversion/po/sv.po,
>   subversion/po/zh_CN.po, subversion/po/zh_TW.po:
>   Courtesy to translators, since I've changed a localized string.

I wouldn't apply patches to PO files unless they come from someone who
claims to be fluent.

> --- /dev/null	Mon Jun  6 11:06:27 2005
> +++ subversion/libsvn_subr/escape.c	Fri Jun  3 19:16:09 2005

New files need a copyright notice.

> @@ -0,0 +1,58 @@
> +/*
> + * escape.c:	common code for cleaning up unwanted bytes in strings
> + */
> +
> +#include "escape_impl.h"
> +
> +#define COPY_PREFIX \
> +  if (c > base) { \
> +    svn_stringbuf_appendbytes (out, base, c - base); \
> +    base = c; \
> +  }

I don't think I'd have used a macro, it makes the calling code harder
to understand, but I'd prefer a macro with parameters and it needs
documentation.

> +
> +svn_stringbuf_t *
> +svn_subr__escape_string (svn_stringbuf_t **outsbuf,
> +			 const unsigned char *instr,
> +			 apr_size_t len,
> +			 const unsigned char *isok,
> +			 unsigned char (*mapper) (unsigned char **,
> +						  const unsigned char *,
> +						  apr_size_t,
> +						  const svn_stringbuf_t *,
> +						  void *,
> +						  apr_pool_t *),

mapper should be a typedef.

> +			 void *mapper_baton,
> +			 apr_pool_t *pool)
> +{
> +  unsigned char *base, *c;
> +  svn_stringbuf_t *out;
> +
> +  if (outsbuf == NULL || *outsbuf == NULL) {
> +    out = svn_stringbuf_create ("", pool);

It should probably be created with at least len bytes capacity.

> +    if (outsbuf)
> +      *outsbuf = out;
> +  }
> +  else
> +    out = *outsbuf;
> +
> +  for (c = base = (unsigned char *) instr; c < instr + len; ) {

Try to avoid casting away const.

> +    apr_size_t count = isok ? isok[*c] : 0;
> +    if (count == 0) {
> +      COPY_PREFIX;
> +      count = mapper ? mapper (&c, instr, len, out, mapper_baton, pool) : 255;
> +    }
> +    if (count == 255) {
> +      char esc[6];
> +
> +      COPY_PREFIX;
> +      sprintf (esc,"?\\%03u",*c);
> +      svn_stringbuf_appendcstr (out, esc);
> +      c++;
> +      base = c;
> +    }
> +    else c += count;
> +  }
> +  COPY_PREFIX;
> +  return out;

This function doesn't follow the project's indentation/formatting
guidelines.

> +}
> +
>
>
> ### Comments are pretty self-explanatory.
> ### Docs are as doxygen; will need to be downgraded to plaintext since it's
> ### an internal header.
> ### As noted above, it makes sense to combine this with utf_impl.h.

If it makes sense to combine it then why is it separate?

> --- /dev/null	Mon Jun  6 11:35:47 2005
> +++ subversion/libsvn_subr/escape_impl.h	Thu Jun  2 18:44:05 2005

Missing copyright notice.

> @@ -0,0 +1,147 @@
> +/*
> + * escape_impl.h :  private header for string escaping function.
> + */
> +
> +
> +
> +#ifndef SVN_LIBSVN_SUBR_ESCAPE_IMPL_H
> +#define SVN_LIBSVN_SUBR_ESCAPE_IMPL_H
> +
> +
> +#include "svn_pools.h"
> +#include "svn_string.h"
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif /* __cplusplus */
> +
> +
> +/** Scan @a instr of length @a len bytes, copying to stringbuf @a *outsbuf,
> + * escaping bytes as indicated by the lookup array @a isok and the mapping
> + * function @a mapper. Memory is allocated from @a pool.  You may provide
> + * any extra information needed by @a mapper in @a mapper_baton.
> + * Returns a pointer to the stringbuf containing the escaped string.
> + *
> + * If @a outsbuf or *outsbuf is NULL, a new stringbuf is created; its
> address is

The patch is mangled here, and in several other places.

> + * placed in @a outsbuf unless that argument is NULL.
> + * If @a isok is NULL, then @a mapper is used exclusively.
> + * If @ mapper is NULL, then a single character is escaped every time @a mapper
> + * would have been called.
> + *
> + * This is designed to be the common pathway for various string "escaping"
> + * functions across subversion.  The basic approach is to scan
> + * the input and decide whether each byte is OK as it stands, needs to be
> + * "escaped" using subversion's "?\uuu" default format, or needs to be
> + * transformed in some other way.  The decision is made using a two step
> + * process, which is designed to handle the simple cases quickly but allow
> + * for more complex mappings.  Since the typical string will (we hope)
> + * comprise mostly simple cases, this shouldn't require much code
> + * complexity or loss of efficiency.  The two steps used are:

The question is do we need such a general function?  One the one hand
it looks like it is more complicated then necessary just to handle
escaping, on the other hand it doesn't look general enough to be used,
say, as a routine to separate multibyte codepoints.

I think I'd prefer a simpler solution, although I haven't tried to
implement one.  Perhaps based on the existing validation functions?
(Although since I wrote those I might be biased.)

> + *
> + * 1. The value of a byte from the input string ("test byte") is used as an
> + *    index into a (usually 256 byte) array passed in by the caller.
> + *      - If the value of the appropriate array element is 0xff,
> + *        then the test byte is escaped as a "?\uuu" string in the output.
> + *      - If the value of the appropriate element is otherwise non-zero,
> + *        that many bytes are copied verbatim from the input to the output.
> + * 2. If the array yields a 0 value, then a mapping function provided by
> + *    the caller is used to allow for more complex evaluation.  This function
> + *    receives five arguments:

Five? I see six in the code.

> + *      - a pointer to the pointer used by svn__do_char_escape() to
> + *        mark the test byte in the input string
> + *      - a pointer to the start of the input string
> + *      - the length of the input string
> + *      - a pointer to the output stringbuf
> + *      - the ever-helpful pool.
> + * The mapping function may return a (positive) nonzero value,
> + * which is interpreted * as described in step 1 above, or zero,
> + * indicating that the test byte * should be ignored.  In the latter
> + * case, this is generally because the * mapping function has done the
> + * necessary work itself; it's free to * modify the output stringbuf and
> + * adjust the pointer to the test byte * as it sees fit (within the
> + * bounds of the input string).  At a minimum, * it should at least
> + * increment the pointer to the test byte before * returning 0, in order
> + * to avoid an infinite loop.

Are the '*' characters within the comment lines just fallout from
paragraph filling?

> + */
> +
> +svn_stringbuf_t *
> +svn_subr__escape_string (svn_stringbuf_t **outsbuf,
> +			 const unsigned char *instr,
> +			 apr_size_t len,
> +			 const unsigned char *isok,
> +			 unsigned char (*mapper) (unsigned char **,
> +						  const unsigned char *,
> +						  apr_size_t,
> +						  const svn_stringbuf_t *,
> +						  void *,
> +						  apr_pool_t *),

Should be a typedef, the typedef should document the parameters.

> +			 void *mapper_baton,
> +			 apr_pool_t *pool);
> +
> +
> +
> +/** Initializer for a basic screening matrix suitable for use with
> + *  #svn_subr__escape_string to escape non-UTF-8 bytes.
> + *  We provide this since "UTF-8-safety" is a common denominator for
> + *  most string escaping in Subversion, so this matrix makes a good
> + *  starting point for more involved schemes.
> + */  
> +#define SVN_ESCAPE_UTF8_LEGAL_ARRAY { \
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>   0,   0,\
> +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>   0,   0,\
> +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>   0,   0,\
> +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>   0,   0,\
> +255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>   0,   0,\
> +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>   0,   0,\
> +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>   0,   0,\
> +  0,   0,   0,   0,   0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255}

That looks a bit like the svn_ctype stuff, but it's a separate
implementation.  That makes me uneasy.

> +
> +/** Given pointer @a c into a string which ends at @a e, figure out
> + *  whether (*c) starts a valid UTF-8 sequence, and if so, how many bytes
> + *  it includes.  Return 255 if it's not valid UTF-8.
> + *  For a more detailed description of the encoding rules, see the UTF-8
> + *  specification in section 3-9 of the Unicode standard 4.0 (e.g. at
> + *  http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf),
> + *  with special attention to Table 3-6.
> + *  This macro is also provided as a building block for mappers used by
> + *  #svn_subr__escape_string that want to check for UTF-8-safety in
> + *  addition to other tasks.
> + */
> +#define SVN_ESCAPE_UTF8_MAPPING(c,e)					       \
> +  ( (c)[0] < 0x80                                  ? /* ASCII */	       \
> +    1 :				                     /* OK, 1 byte */	       \
> +    ( ( ((c)[0] > 0xc2 && (c)[0] < 0xdf)          && /* 2-byte char */	       \
> +	((c) + 1 <= (e))                          && /* Got 2 bytes */	       \
> +	((c)[1] >= 0x80 && (c)[1] <= 0xbf))        ? /* Byte 2 legal */        \
> +      2 :					     /* OK, 2 bytes */         \
> +      ( ( ((c)[0] >= 0xe0 && (c)[0] <= 0xef)      && /* 3 byte char */	       \
> +	  ((c) + 2 <= (e))                        && /* Got 3 bytes */	       \
> +	  ((c)[1] >= 0x80 && (c)[1] <= 0xbf)      && /* Basic byte 2 legal */  \
> +	  ((c)[2] >= 0x80 && (c)[2] <= 0xbf)      && /* Basic byte 3 legal */  \
> +	  (!((c)[0] == 0xe0 && (c)[1] < 0xa0))    && /* 0xe0-0x[89]? illegal */\
> +	  (!((c)[0] == 0xed && (c)[1] > 0x9f)) )   ? /* 0xed-0x[ab]? illegal */\
> +	3 :                                          /* OK, 3 bytes */	       \
> +	( ( ((c)[0] >= 0xf0 && (c)[0] <= 0xf4)    && /* 4 byte char */         \
> +	    ((c) + 3 <= (e))                      && /* Got 4 bytes */         \
> +	    ((c)[1] >= 0x80 && (c)[1] <= 0xbf)    && /* Basic byte 2 legal */  \
> +	    ((c)[2] >= 0x80 && (c)[2] <= 0xbf)    && /* Basic byte 3 legal */  \
> +	    ((c)[3] >= 0x80 && (c)[3] <= 0xbf)    && /* Basic byte 4 legal */  \
> +	    (!((c)[0] == 0xf0 && (c)[1] < 0x90))  && /* 0xf0-0x8? illegal */   \
> +	    (!((c)[0] == 0xf4 && (c)[1] > 0x8f)) ) ? /* 0xf4-0x[9ab]? illegal*/\
> +	  4 :					     /* OK, 4 bytes */         \
> +	  255))))                                    /* Illegal; escape it */

utf_validate.c already implements the UTF-8 encoding rules.  There is
obviously some duplication of the algorithm, that makes me uneasy.

Those big macros also make me uneasy.

> +
> +
> +#ifdef __cplusplus
> +}
> +#endif /* __cplusplus */
> +
> +#endif /* SVN_LIBSVN_SUBR_ESCAPE_IMPL_H */
>
>
> ### Function names can be revised to fit convention, of course. 
> ### svn_utf__cstring_escape_utf8_fuzzy serves as an example of a benefit of
> ### returning the resultant stringbuf from svn_subr__escape_string both in a
> ### parameter and as the function's return value. If the sense is that
> it'll be a cause
> ### of debugging headaches, or that it's cortrary to subversion
> culture to code public
> ### functions as macros, it's easy enough to code this as a function,
> and to make
> ### svn_subr__escape_string return void (or less likely svn_error_t,
> if it got pickier
> ### about params.)
> --- subversion/libsvn_subr/utf_impl.h	(revision 14986)
> +++ subversion/libsvn_subr/utf_impl.h	(working copy)
> @@ -24,12 +24,33 @@
>  
>  #include <apr_pools.h>
>  #include "svn_types.h"
> +#include "svn_string.h"
>  
>  #ifdef __cplusplus
>  extern "C" {
>  #endif /* __cplusplus */
>  
>  
> +/** Replace any non-UTF-8 characters in @a len byte long string @a src with
> + *  escaped representations, placing the result in a stringbuf pointed to by
> + *  @a *dest, which will be created if necessary.  Memory is allocated from

How does the user know what "if necessary" means?

> + *  @a pool as needed. Returns a pointer to the stringbuf containing the result
> + *  (identical to @a *dest, but facilitates chaining calls).
> + */
> +svn_stringbuf_t *
> +svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,
> +				      const unsigned char *src,
> +				      apr_size_t len,
> +				      apr_pool_t *pool);
> +
> +/** Replace any non-UTF-8 characters in @a len byte long string @a src with
> + *  escaped representations.  Memory is allocated from @a pool as needed.
> + *  Returns a pointer to the resulting string.
> + */
> +#define svn_utf__cstring_escape_utf8_fuzzy(src,len,pool) \
> +  (svn_utf__stringbuf_escape_utf8_fuzzy(NULL,(src),(len),(pool)))->data

Is there any need for this to be a macro?  A real function would be
less "surprising" and that makes it better unless you can justify the
macro.

> +
> +
> const char *svn_utf__cstring_from_utf8_fuzzy (const char *src,
>                                                apr_pool_t *pool,
>                                                svn_error_t *(*convert_from_utf8)
>
>
>
> ### There're other places that could be rewritten in terms of the new escaping
> ### functions, but I hope the two given here serve as an example of how it might
> ### be done.

Complete patches are better than incomplete ones.

> ### The rename to ascii_fuzzy_escape is to distinguish it from the new functions
> ### that escape only illegal UTF-8 sequences.
> --- subversion/libsvn_subr/utf.c	(revision 14986)
> +++ subversion/libsvn_subr/utf.c	(working copy)
> @@ -30,6 +30,7 @@
>  #include "svn_pools.h"
>  #include "svn_ctype.h"
>  #include "svn_utf.h"
> +#include "escape_impl.h"
>  #include "utf_impl.h"
>  #include "svn_private_config.h"
>  
> @@ -323,53 +324,19 @@
>  /* Copy LEN bytes of SRC, converting non-ASCII and zero bytes to ?\nnn
>     sequences, allocating the result in POOL. */
>  static const char *
> -fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)
> +ascii_fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)
>  {
> -  const char *src_orig = src, *src_end = src + len;
> -  apr_size_t new_len = 0;
> -  char *new;
> -  const char *new_orig;
> +  static unsigned char asciinonul[256];
> +  svn_stringbuf_t *result = NULL;
>  
> -  /* First count how big a dest string we'll need. */
> -  while (src < src_end)
> -    {
> -      if (! svn_ctype_isascii (*src) || *src == '\0')
> -        new_len += 5;  /* 5 slots, for "?\XXX" */
> -      else
> -        new_len += 1;  /* one slot for the 7-bit char */
> +  if (!asciinonul[0]) {
> +    asciinonul[0] = 255;                   /* NUL's not allowed */

That doesn't look threadsafe.

> +    memset(asciinonul + 1, 1, 127);        /* Other regular ASCII OK */
> +    memset(asciinonul + 128, 255, 128);    /* High half not allowed */
> +  } 
>  
> -      src++;
> -    }
> -
> -  /* Allocate that amount. */
> -  new = apr_palloc (pool, new_len + 1);
> -
> -  new_orig = new;
> -
> -  /* And fill it up. */
> -  while (src_orig < src_end)
> -    {
> -      if (! svn_ctype_isascii (*src_orig) || src_orig == '\0')
> -        {
> -          /* This is the same format as svn_xml_fuzzy_escape uses, but that
> -             function escapes different characters.  Please keep in sync!
> -             ### If we add another fuzzy escape somewhere, we should abstract
> -             ### this out to a common function. */
> -          sprintf (new, "?\\%03u", (unsigned char) *src_orig);
> -          new += 5;
> -        }
> -      else
> -        {
> -          *new = *src_orig;
> -          new += 1;
> -        }
> -
> -      src_orig++;
> -    }
> -
> -  *new = '\0';
> -
> -  return new_orig;
> +  svn_subr__escape_string(&result, src, len, asciinonul, NULL, NULL, pool);
> +  return result->data;
>  }
>  
>  /* Convert SRC_LENGTH bytes of SRC_DATA in NODE->handle, store the result
> @@ -448,7 +415,7 @@
>          errstr = apr_psprintf
>            (pool, _("Can't convert string from '%s' to '%s':"),
>             node->frompage, node->topage);
> -      err = svn_error_create (apr_err, NULL, fuzzy_escape (src_data,
> +      err = svn_error_create (apr_err, NULL, ascii_fuzzy_escape (src_data,
>                                                             src_length, pool));
>        return svn_error_create (apr_err, err, errstr);
>      }
> @@ -564,7 +531,28 @@
>    return SVN_NO_ERROR;
>  }
>  
> +static unsigned char
> +utf8_escape_mapper (unsigned char **targ, const unsigned char *start,
> +		    apr_size_t len, const svn_stringbuf_t *dest,
> +		    void *baton, apr_pool_t *pool)

New functions need documentation.

> +{
> +  const unsigned char *end = start + len;
> +  return SVN_ESCAPE_UTF8_MAPPING(*targ, end);
> +}
>  
> +svn_stringbuf_t *
> +svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,
> +				      const unsigned char *src,
> +				      apr_size_t len,
> +				      apr_pool_t *pool)
> +{
> +  static unsigned char utf8screen[256] = SVN_ESCAPE_UTF8_LEGAL_ARRAY;
> +
> +  return svn_subr__escape_string(dest, src, len,
> +				 utf8screen, utf8_escape_mapper, NULL,
> +				 pool);
> +}
> +
>  svn_error_t *
>  svn_utf_stringbuf_to_utf8 (svn_stringbuf_t **dest,
>                             const svn_stringbuf_t *src,
> @@ -787,7 +775,7 @@
>    const char *escaped, *converted;
>    svn_error_t *err;
>  
> -  escaped = fuzzy_escape (src, strlen (src), pool);
> +  escaped = ascii_fuzzy_escape (src, strlen (src), pool);
>  
>    /* Okay, now we have a *new* UTF-8 string, one that's guaranteed to
>       contain only 7-bit bytes :-).  Recode to native... */
>
> ### With code comes testing.
> ### Note: Contains 8-bit chars, and also uses convention that cc will treat
> ### "foo" "bar" as "foobar".  Both can be avoided if useful for
> finicky compilers.
>
> --- subversion/tests/libsvn_subr/utf-test.c	(revision 14986)
> +++ subversion/tests/libsvn_subr/utf-test.c	(working copy)
> @@ -17,6 +17,7 @@
>   */
>  
>  #include "../svn_test.h"
> +#include "../../include/svn_utf.h"

Does a plain "svn_utf.h" work?

>  #include "../../libsvn_subr/utf_impl.h"
>  
>  /* Random number seed.  Yes, it's global, just pretend you can't see it. */
> @@ -222,6 +223,84 @@
>    return SVN_NO_ERROR;
>  }
>  
> +static svn_error_t *
> +utf_escape (const char **msg,
> +	    svn_boolean_t msg_only,
> +	    svn_test_opts_t *opts,
> +	    apr_pool_t *pool)
> +{
> +  char in[] = { 'A', 'S', 'C', 'I', 'I',     /* All printable */
> +		'R', 'E', 'T', '\n', 'N',    /* Newline */
> +		'B', 'E', 'L', 0x07, '!',    /* Control char */
> +		0xd2, 0xa6, 'O', 'K', '2',   /* 2-byte char, valid */
> +		0xc0, 0xc3, 'N', 'O', '2',   /* 2-byte char, invalid 1st */
> +		0x82, 0xc3, 'N', 'O', '2',   /* 2-byte char, invalid 2nd */
> +		0xe4, 0x87, 0xa0, 'O', 'K',  /* 3-byte char, valid */
> +		0xe2, 0xff, 0xba, 'N', 'O',  /*3-byte char, invalid 2nd */
> +		0xe0, 0x87, 0xa0, 'N', 'O',  /*3-byte char, invalid 2nd */
> +		0xed, 0xa5, 0xa0, 'N', 'O',  /*3-byte char, invalid 2nd */
> +		0xe4, 0x87, 0xc0, 'N', 'O',  /* 3-byte char, invalid 3rd */
> +		0xf2, 0x87, 0xa0, 0xb5, 'Y', /* 4-byte char, valid */
> +		0xf2, 0xd2, 0xa0, 0xb5, 'Y', /* 4-byte char, invalid 2nd */
> +		0xf0, 0x87, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */
> +		0xf4, 0x97, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */
> +		0xf2, 0x87, 0xc3, 0xb5, 'N', /* 4-byte char, invalid 3rd */
> +		0xf2, 0x87, 0xa0, 0xd5, 'N', /* 4-byte char, invalid 4th */
> +                0x00 };
> +  const unsigned char *legalresult =
> +    "ASCIIRET\nNBEL!$-1(c)ÊOK2?\\192?\\195NO2?\\130?\\195NO2"-A
> +    "3$-3ıΩ0‰á†1OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO?\\237?\\165?\\160NO"-A
> +    "?\\228?\\135?\\192NO3$-3ıΩ0Úá†µ1Y?\\242$-1(c)‡?\\181Y?\\240?\\135?\\160"-A
> +    "?\\181N?\\244?\\151?\\160?\\181N?\\242?\\135ıN?\\242?\\135?\\160"
> +    "?\\213N";

I don't like the embedded control characters in the source code, could
you generate them at runtime?

> +  const unsigned char *asciiresult =
> +    "ASCIIRET\nNBEL\x07!?\\210?\\166OK2?\\192?\\195NO2?\\130?\\195NO2"
> +    "?\\228?\\135?\\160OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO"
> +    "?\\237?\\165?\\160NO?\\228?\\135?\\192NO?\\242?\\135?\\160?\\181Y"
> +    "?\\242?\\210?\\160?\\181Y?\\240?\\135?\\160?\\181N"
> +    "?\\244?\\151?\\160?\\181N?\\242?\\135?\\195?\\181N"
> +    "?\\242?\\135?\\160?\\213N";
> +  const unsigned char *asciified;
> +  apr_size_t legalresult_len = 213;  /* == strlen(legalresult) iff no NULs */
> +  int i = 0;
> +  svn_stringbuf_t *escaped = NULL;
> +
> +  *msg = "test utf string escaping";
> +
> +  if (msg_only)
> +    return SVN_NO_ERROR;
> +
> +  if (svn_utf__stringbuf_escape_utf8_fuzzy
> +      (&escaped, in, sizeof in - 1, pool) != escaped)

I prefer () with sizeof.

> +    return svn_error_createf
> +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> +  i++;
> +  if (escaped->len != legalresult_len)
> +    return svn_error_createf
> +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> +  i++;
> +  if (memcmp(escaped->data, legalresult, legalresult_len))
> +    return svn_error_createf
> +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> +  i++;
> +  if (memcmp(escaped->data, legalresult, legalresult_len))

A duplicate of the one above?

> +    return svn_error_createf
> +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> +  i++;
> +
> +  asciified = svn_utf_cstring_from_utf8_fuzzy(in, pool);
> +  if (strlen(asciified) != strlen(asciiresult))
> +    return svn_error_createf
> +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> +  i++;
> +  if (strcmp(asciified, asciiresult))
> +    return svn_error_createf
> +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> +  i++;
> +
> +  return SVN_NO_ERROR;
> +}
> +
>  
>  /* The test table.  */
>  
> @@ -230,5 +309,6 @@
>      SVN_TEST_NULL,
>      SVN_TEST_PASS (utf_validate),
>      SVN_TEST_PASS (utf_validate2),
> +    SVN_TEST_PASS (utf_escape),
>      SVN_TEST_NULL
>    };
>
>
>
> ### The original point of this thread.
> ### This patch will apply with an offset, since I've cut out sections which
> ### reimplement XML escaping in terms of the svn_subr__escape_string.
> --- subversion/libsvn_subr/xml.c	(revision 14986)
> +++ subversion/libsvn_subr/xml.c	(working copy)
> @@ -395,11 +413,22 @@
>    /* If expat choked internally, return its error. */
>    if (! success)
>      {
> +      svn_stringbuf_t *sanitized;
> +      unsigned char *end;
> +      
> +      svn_utf__stringbuf_escape_utf8_fuzzy(&sanitized, buf,
> +					   (len > 240 ? 240 : len),
> +					   svn_parser->pool);
> +      end = sanitized->data +
> +	    (sanitized->len > 240 ? 240 : sanitized->len);

240?  A magic number with no explanation.

> +      while (*end > 0x80 && *end < 0xc0 &&
> +	     (char *) end > sanitized->data) end--;

I think that could generate a huge error message in pathological
cases.

>        err = svn_error_createf
>          (SVN_ERR_XML_MALFORMED, NULL, 
> -         _("Malformed XML: %s at line %d"),
> +         _("Malformed XML: %s at line %d; XML starts:\n%.*s"),
>           XML_ErrorString (XML_GetErrorCode (svn_parser->parser)),
> -         XML_GetCurrentLineNumber (svn_parser->parser));
> +         XML_GetCurrentLineNumber (svn_parser->parser),
> +	 (char *) end - sanitized->data + 1, sanitized->data);
>        
>        /* Kill all parsers and return the expat error */
>        svn_xml_free_parser (svn_parser);
>

Overall this patch makes me uneasy, I'd prefer a patch that builds on
our existing ctype and/or utf-8 code.  However as I haven't tried to
implement such a patch I don't know whether our existing code is
a suitable starting point.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

[PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On Wed, 23 Feb 2005, Charles Bailey wrote:> > Attached is a patchlet that, when expat fails to parse an hunk of XML,> appends at least part of the offending hunk to the error message.  It
which led to an exchange regarding the need to make stringsUTF-8-safe.  After too long a haitus, I posted for comment:
On 4/21/05, Charles Bailey <ba...@gmail.com> wrote:> > Well, after umpteen interrupts from the rest of life,I finally got a> few hours to look at this again.    In checking was was already> available, I found a handful of "string escaping" function in various> places which perform similar tasks (at least one with the comment> "this should share code with other_string_escaping_routine()").  Since> I'd have to add ya such function, I thought I'd try to abstract it a> bit, with the hope that similar routines could use a common base.> I've appended a short proposal at the bottom of this messages,> containing a common "engine" and an example implementation for> creating a UTF-8-safe version of an arbitrary string.
Julian Foad was kind enough to point out a dumb thinko, but no othercomments were forthcoming, possibly because the core developers werebusy with pre-1.2 cleanup.
So, after another too-long hiatus, here's a patch which implements a"common" string escaping function , uses it for UTF-8 escaping, and uses that to sanitize the offending XML, which is then output in theerror message that Jack built^W^Wstarted this thread.
I've interspersed my comments in the code, since there's imho zerochance that this version of the patch will besubstantially/stylistically suitable for committing.  They're far fromexhaustive, but this message is long enough already.
Conceptual "Log message":[[[Add function that escapes illegal UTF-8 characters, along the wayrefactoring core ofstring-escaping routines, and insure that illegal XML error messageoutputs legal UTF-8.### Probably best applied as several patches, but collected here for review.
* subversion/libsvn_subr/escape.c:   New file   (svn_subr__escape_string): Final-common-path function for escaping strings.
* subversion/libsvn_subr/escape_impl.h:   New file, declaring svn_subr__escape_string and convenience macros.   ### Logical candidate for consolidation with utf_impl.h, perhaps assubr_impl.h
* subversion/libsvn_subr/utf.c:   (fuzzy_escape): Renamed to ascii_fuzzy_escape, and rewritten to use    svn_subr__escape_string.   (svn_utf__stringbuf_escape_utf8_fuzzy): New function which escapes illegal    UTF-8 in a string, returning the escaped string in a stringbuf.   (utf8_escape_mapper): Helper function forsvn_utf__stringbuf_escape_utf8_fuzzy.
* subversion/libsvn_subr/utf_impl.h:   Add prototype for svn_utf__stringbuf_escape_utf8_fuzzy.   (svn_utf__cstring_escape_utf8_fuzzy):  Macro implementing variantof above that    returns NUL-terminated string.
* subversion/libsvn_subr/xml.c:   (svn_xml_parse): If parse fails, print (sanitized) (part of) offending XML    with error message.
* subversion/tests/libsvn_subr/utf-test.c:   (utf_escape): New function testing UTF-8 string-escaping functions.
* subversion/po/de.po, subversion/po/es.po, subversion/po/ja.po,  subversion/po/ko.po, subversion/po/nb.po, subversion/po/pl.po,  subversion/po/pt_BR.po, subversion/po/sv.po,  subversion/po/zh_CN.po, subversion/po/zh_TW.po:  Courtesy to translators, since I've changed a localized string.
]]]

### This driver was written because there are several "escaping"functions in different### places which do similar things with slightly different criteria. It seemed best to collect### the common work into one place, if not to save space, then tominimize divergence.### The goal here is to be fast on the simple cases via the screeningarray, while allowing### flexibility for more complex substitutions via the mappingfunction.  In very over-### simplified, off-the-cuff testing, eliminating the screening arraycaused a slowdiwn of### slightly less than twofold.### I've attempted to incorporate reasonable default behavior in thecase of NULL params.--- /dev/null	Mon Jun  6 11:06:27 2005+++ subversion/libsvn_subr/escape.c	Fri Jun  3 19:16:09 2005@@ -0,0 +1,58 @@+/*+ * escape.c:	common code for cleaning up unwanted bytes in strings+ */++#include "escape_impl.h"++#define COPY_PREFIX \+  if (c > base) { \+    svn_stringbuf_appendbytes (out, base, c - base); \+    base = c; \+  }++svn_stringbuf_t *+svn_subr__escape_string (svn_stringbuf_t **outsbuf,+			 const unsigned char *instr,+			 apr_size_t len,+			 const unsigned char *isok,+			 unsigned char (*mapper) (unsigned char **,+						  const unsigned char *,+						  apr_size_t,+						  const svn_stringbuf_t *,+						  void *,+						  apr_pool_t *),+			 void *mapper_baton,+			 apr_pool_t *pool)+{+  unsigned char *base, *c;+  svn_stringbuf_t *out;++  if (outsbuf == NULL || *outsbuf == NULL) {+    out = svn_stringbuf_create ("", pool);+    if (outsbuf)+      *outsbuf = out;+  }+  else+    out = *outsbuf;++  for (c = base = (unsigned char *) instr; c < instr + len; ) {+    apr_size_t count = isok ? isok[*c] : 0;+    if (count == 0) {+      COPY_PREFIX;+      count = mapper ? mapper (&c, instr, len, out, mapper_baton, pool) : 255;+    }+    if (count == 255) {+      char esc[6];++      COPY_PREFIX;+      sprintf (esc,"?\\%03u",*c);+      svn_stringbuf_appendcstr (out, esc);+      c++;+      base = c;+    }+    else c += count;+  }+  COPY_PREFIX;+  return out;+}+

### Comments are pretty self-explanatory.### Docs are as doxygen; will need to be downgraded to plaintext since it's### an internal header.### As noted above, it makes sense to combine this with utf_impl.h.--- /dev/null	Mon Jun  6 11:35:47 2005+++ subversion/libsvn_subr/escape_impl.h	Thu Jun  2 18:44:05 2005@@ -0,0 +1,147 @@+/*+ * escape_impl.h :  private header for string escaping function.+ */++++#ifndef SVN_LIBSVN_SUBR_ESCAPE_IMPL_H+#define SVN_LIBSVN_SUBR_ESCAPE_IMPL_H+++#include "svn_pools.h"+#include "svn_string.h"++#ifdef __cplusplus+extern "C" {+#endif /* __cplusplus */+++/** Scan @a instr of length @a len bytes, copying to stringbuf @a *outsbuf,+ * escaping bytes as indicated by the lookup array @a isok and the mapping+ * function @a mapper. Memory is allocated from @a pool.  You may provide+ * any extra information needed by @a mapper in @a mapper_baton.+ * Returns a pointer to the stringbuf containing the escaped string.+ *+ * If @a outsbuf or *outsbuf is NULL, a new stringbuf is created; itsaddress is+ * placed in @a outsbuf unless that argument is NULL.+ * If @a isok is NULL, then @a mapper is used exclusively.+ * If @ mapper is NULL, then a single character is escaped every time @a mapper+ * would have been called.+ *+ * This is designed to be the common pathway for various string "escaping"+ * functions across subversion.  The basic approach is to scan+ * the input and decide whether each byte is OK as it stands, needs to be+ * "escaped" using subversion's "?\uuu" default format, or needs to be+ * transformed in some other way.  The decision is made using a two step+ * process, which is designed to handle the simple cases quickly but allow+ * for more complex mappings.  Since the typical string will (we hope)+ * comprise mostly simple cases, this shouldn't require much code+ * complexity or loss of efficiency.  The two steps used are:+ *+ * 1. The value of a byte from the input string ("test byte") is used as an+ *    index into a (usually 256 byte) array passed in by the caller.+ *      - If the value of the appropriate array element is 0xff,+ *        then the test byte is escaped as a "?\uuu" string in the output.+ *      - If the value of the appropriate element is otherwise non-zero,+ *        that many bytes are copied verbatim from the input to the output.+ * 2. If the array yields a 0 value, then a mapping function provided by+ *    the caller is used to allow for more complex evaluation.  This function+ *    receives five arguments:+ *      - a pointer to the pointer used by svn__do_char_escape() to+ *        mark the test byte in the input string+ *      - a pointer to the start of the input string+ *      - the length of the input string+ *      - a pointer to the output stringbuf+ *      - the ever-helpful pool.+ * The mapping function may return a (positive) nonzero value,+ * which is interpreted * as described in step 1 above, or zero,+ * indicating that the test byte * should be ignored.  In the latter+ * case, this is generally because the * mapping function has done the+ * necessary work itself; it's free to * modify the output stringbuf and+ * adjust the pointer to the test byte * as it sees fit (within the+ * bounds of the input string).  At a minimum, * it should at least+ * increment the pointer to the test byte before * returning 0, in order+ * to avoid an infinite loop.+ */++svn_stringbuf_t *+svn_subr__escape_string (svn_stringbuf_t **outsbuf,+			 const unsigned char *instr,+			 apr_size_t len,+			 const unsigned char *isok,+			 unsigned char (*mapper) (unsigned char **,+						  const unsigned char *,+						  apr_size_t,+						  const svn_stringbuf_t *,+						  void *,+						  apr_pool_t *),+			 void *mapper_baton,+			 apr_pool_t *pool);++++/** Initializer for a basic screening matrix suitable for use with+ *  #svn_subr__escape_string to escape non-UTF-8 bytes.+ *  We provide this since "UTF-8-safety" is a common denominator for+ *  most string escaping in Subversion, so this matrix makes a good+ *  starting point for more involved schemes.+ */  +#define SVN_ESCAPE_UTF8_LEGAL_ARRAY { \+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0,\+  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0,\+  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0,\+  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0,\+255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0,\+  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0,\+  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0,\+  0,   0,   0,   0,   0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255}++/** Given pointer @a c into a string which ends at @a e, figure out+ *  whether (*c) starts a valid UTF-8 sequence, and if so, how many bytes+ *  it includes.  Return 255 if it's not valid UTF-8.+ *  For a more detailed description of the encoding rules, see the UTF-8+ *  specification in section 3-9 of the Unicode standard 4.0 (e.g. at+ *  http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf),+ *  with special attention to Table 3-6.+ *  This macro is also provided as a building block for mappers used by+ *  #svn_subr__escape_string that want to check for UTF-8-safety in+ *  addition to other tasks.+ */+#define SVN_ESCAPE_UTF8_MAPPING(c,e)					       \+  ( (c)[0] < 0x80                                  ? /* ASCII */	       \+    1 :				                     /* OK, 1 byte */	       \+    ( ( ((c)[0] > 0xc2 && (c)[0] < 0xdf)          && /* 2-byte char */	       \+	((c) + 1 <= (e))                          && /* Got 2 bytes */	       \+	((c)[1] >= 0x80 && (c)[1] <= 0xbf))        ? /* Byte 2 legal */        \+      2 :					     /* OK, 2 bytes */         \+      ( ( ((c)[0] >= 0xe0 && (c)[0] <= 0xef)      && /* 3 byte char */	       \+	  ((c) + 2 <= (e))                        && /* Got 3 bytes */	       \+	  ((c)[1] >= 0x80 && (c)[1] <= 0xbf)      && /* Basic byte 2 legal */  \+	  ((c)[2] >= 0x80 && (c)[2] <= 0xbf)      && /* Basic byte 3 legal */  \+	  (!((c)[0] == 0xe0 && (c)[1] < 0xa0))    && /* 0xe0-0x[89]? illegal */\+	  (!((c)[0] == 0xed && (c)[1] > 0x9f)) )   ? /* 0xed-0x[ab]? illegal */\+	3 :                                          /* OK, 3 bytes */	       \+	( ( ((c)[0] >= 0xf0 && (c)[0] <= 0xf4)    && /* 4 byte char */         \+	    ((c) + 3 <= (e))                      && /* Got 4 bytes */         \+	    ((c)[1] >= 0x80 && (c)[1] <= 0xbf)    && /* Basic byte 2 legal */  \+	    ((c)[2] >= 0x80 && (c)[2] <= 0xbf)    && /* Basic byte 3 legal */  \+	    ((c)[3] >= 0x80 && (c)[3] <= 0xbf)    && /* Basic byte 4 legal */  \+	    (!((c)[0] == 0xf0 && (c)[1] < 0x90))  && /* 0xf0-0x8? illegal */   \+	    (!((c)[0] == 0xf4 && (c)[1] > 0x8f)) ) ? /* 0xf4-0x[9ab]? illegal*/\+	  4 :					     /* OK, 4 bytes */         \+	  255))))                                    /* Illegal; escape it */+++#ifdef __cplusplus+}+#endif /* __cplusplus */++#endif /* SVN_LIBSVN_SUBR_ESCAPE_IMPL_H */

### Function names can be revised to fit convention, of course. ### svn_utf__cstring_escape_utf8_fuzzy serves as an example of a benefit of### returning the resultant stringbuf from svn_subr__escape_string both in a### parameter and as the function's return value. If the sense is thatit'll be a cause### of debugging headaches, or that it's cortrary to subversionculture to code public### functions as macros, it's easy enough to code this as a function,and to make### svn_subr__escape_string return void (or less likely svn_error_t,if it got pickier### about params.)--- subversion/libsvn_subr/utf_impl.h	(revision 14986)+++ subversion/libsvn_subr/utf_impl.h	(working copy)@@ -24,12 +24,33 @@  #include <apr_pools.h> #include "svn_types.h"+#include "svn_string.h"  #ifdef __cplusplus extern "C" { #endif /* __cplusplus */  +/** Replace any non-UTF-8 characters in @a len byte long string @a src with+ *  escaped representations, placing the result in a stringbuf pointed to by+ *  @a *dest, which will be created if necessary.  Memory is allocated from+ *  @a pool as needed. Returns a pointer to the stringbuf containing the result+ *  (identical to @a *dest, but facilitates chaining calls).+ */+svn_stringbuf_t *+svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,+				      const unsigned char *src,+				      apr_size_t len,+				      apr_pool_t *pool);++/** Replace any non-UTF-8 characters in @a len byte long string @a src with+ *  escaped representations.  Memory is allocated from @a pool as needed.+ *  Returns a pointer to the resulting string.+ */+#define svn_utf__cstring_escape_utf8_fuzzy(src,len,pool) \+  (svn_utf__stringbuf_escape_utf8_fuzzy(NULL,(src),(len),(pool)))->data++const char *svn_utf__cstring_from_utf8_fuzzy (const char *src,                                               apr_pool_t *pool,                                               svn_error_t *(*convert_from_utf8)


### There're other places that could be rewritten in terms of the new escaping### functions, but I hope the two given here serve as an example of how it might### be done.### The rename to ascii_fuzzy_escape is to distinguish it from the new functions### that escape only illegal UTF-8 sequences.--- subversion/libsvn_subr/utf.c	(revision 14986)+++ subversion/libsvn_subr/utf.c	(working copy)@@ -30,6 +30,7 @@ #include "svn_pools.h" #include "svn_ctype.h" #include "svn_utf.h"+#include "escape_impl.h" #include "utf_impl.h" #include "svn_private_config.h" @@ -323,53 +324,19 @@ /* Copy LEN bytes of SRC, converting non-ASCII and zero bytes to ?\nnn    sequences, allocating the result in POOL. */ static const char *-fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)+ascii_fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool) {-  const char *src_orig = src, *src_end = src + len;-  apr_size_t new_len = 0;-  char *new;-  const char *new_orig;+  static unsigned char asciinonul[256];+  svn_stringbuf_t *result = NULL; -  /* First count how big a dest string we'll need. */-  while (src < src_end)-    {-      if (! svn_ctype_isascii (*src) || *src == '\0')-        new_len += 5;  /* 5 slots, for "?\XXX" */-      else-        new_len += 1;  /* one slot for the 7-bit char */+  if (!asciinonul[0]) {+    asciinonul[0] = 255;                   /* NUL's not allowed */+    memset(asciinonul + 1, 1, 127);        /* Other regular ASCII OK */+    memset(asciinonul + 128, 255, 128);    /* High half not allowed */+  }  -      src++;-    }--  /* Allocate that amount. */-  new = apr_palloc (pool, new_len + 1);--  new_orig = new;--  /* And fill it up. */-  while (src_orig < src_end)-    {-      if (! svn_ctype_isascii (*src_orig) || src_orig == '\0')-        {-          /* This is the same format as svn_xml_fuzzy_escape uses, but that-             function escapes different characters.  Please keep in sync!-             ### If we add another fuzzy escape somewhere, we should abstract-             ### this out to a common function. */-          sprintf (new, "?\\%03u", (unsigned char) *src_orig);-          new += 5;-        }-      else-        {-          *new = *src_orig;-          new += 1;-        }--      src_orig++;-    }--  *new = '\0';--  return new_orig;+  svn_subr__escape_string(&result, src, len, asciinonul, NULL, NULL, pool);+  return result->data; }  /* Convert SRC_LENGTH bytes of SRC_DATA in NODE->handle, store the result@@ -448,7 +415,7 @@         errstr = apr_psprintf           (pool, _("Can't convert string from '%s' to '%s':"),            node->frompage, node->topage);-      err = svn_error_create (apr_err, NULL, fuzzy_escape (src_data,+      err = svn_error_create (apr_err, NULL, ascii_fuzzy_escape (src_data,                                                            src_length, pool));       return svn_error_create (apr_err, err, errstr);     }@@ -564,7 +531,28 @@   return SVN_NO_ERROR; } +static unsigned char+utf8_escape_mapper (unsigned char **targ, const unsigned char *start,+		    apr_size_t len, const svn_stringbuf_t *dest,+		    void *baton, apr_pool_t *pool)+{+  const unsigned char *end = start + len;+  return SVN_ESCAPE_UTF8_MAPPING(*targ, end);+} +svn_stringbuf_t *+svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,+				      const unsigned char *src,+				      apr_size_t len,+				      apr_pool_t *pool)+{+  static unsigned char utf8screen[256] = SVN_ESCAPE_UTF8_LEGAL_ARRAY;++  return svn_subr__escape_string(dest, src, len,+				 utf8screen, utf8_escape_mapper, NULL,+				 pool);+}+ svn_error_t * svn_utf_stringbuf_to_utf8 (svn_stringbuf_t **dest,                            const svn_stringbuf_t *src,@@ -787,7 +775,7 @@   const char *escaped, *converted;   svn_error_t *err; -  escaped = fuzzy_escape (src, strlen (src), pool);+  escaped = ascii_fuzzy_escape (src, strlen (src), pool);    /* Okay, now we have a *new* UTF-8 string, one that's guaranteed to      contain only 7-bit bytes :-).  Recode to native... */
### With code comes testing.### Note: Contains 8-bit chars, and also uses convention that cc will treat### "foo" "bar" as "foobar".  Both can be avoided if useful forfinicky compilers.
--- subversion/tests/libsvn_subr/utf-test.c	(revision 14986)+++ subversion/tests/libsvn_subr/utf-test.c	(working copy)@@ -17,6 +17,7 @@  */  #include "../svn_test.h"+#include "../../include/svn_utf.h" #include "../../libsvn_subr/utf_impl.h"  /* Random number seed.  Yes, it's global, just pretend you can't see it. */@@ -222,6 +223,84 @@   return SVN_NO_ERROR; } +static svn_error_t *+utf_escape (const char **msg,+	    svn_boolean_t msg_only,+	    svn_test_opts_t *opts,+	    apr_pool_t *pool)+{+  char in[] = { 'A', 'S', 'C', 'I', 'I',     /* All printable */+		'R', 'E', 'T', '\n', 'N',    /* Newline */+		'B', 'E', 'L', 0x07, '!',    /* Control char */+		0xd2, 0xa6, 'O', 'K', '2',   /* 2-byte char, valid */+		0xc0, 0xc3, 'N', 'O', '2',   /* 2-byte char, invalid 1st */+		0x82, 0xc3, 'N', 'O', '2',   /* 2-byte char, invalid 2nd */+		0xe4, 0x87, 0xa0, 'O', 'K',  /* 3-byte char, valid */+		0xe2, 0xff, 0xba, 'N', 'O',  /*3-byte char, invalid 2nd */+		0xe0, 0x87, 0xa0, 'N', 'O',  /*3-byte char, invalid 2nd */+		0xed, 0xa5, 0xa0, 'N', 'O',  /*3-byte char, invalid 2nd */+		0xe4, 0x87, 0xc0, 'N', 'O',  /* 3-byte char, invalid 3rd */+		0xf2, 0x87, 0xa0, 0xb5, 'Y', /* 4-byte char, valid */+		0xf2, 0xd2, 0xa0, 0xb5, 'Y', /* 4-byte char, invalid 2nd */+		0xf0, 0x87, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */+		0xf4, 0x97, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */+		0xf2, 0x87, 0xc3, 0xb5, 'N', /* 4-byte char, invalid 3rd */+		0xf2, 0x87, 0xa0, 0xd5, 'N', /* 4-byte char, invalid 4th */+                0x00 };+  const unsigned char *legalresult =+    "ASCIIRET\nNBEL!$-1(c)ÊOK2?\\192?\\195NO2?\\130?\\195NO2"-A+    "3$-3ıΩ0‰á†1OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO?\\237?\\165?\\160NO"-A+    "?\\228?\\135?\\192NO3$-3ıΩ0Úá†µ1Y?\\242$-1(c)‡?\\181Y?\\240?\\135?\\160"-A+    "?\\181N?\\244?\\151?\\160?\\181N?\\242?\\135ıN?\\242?\\135?\\160"+    "?\\213N";+  const unsigned char *asciiresult =+    "ASCIIRET\nNBEL\x07!?\\210?\\166OK2?\\192?\\195NO2?\\130?\\195NO2"+    "?\\228?\\135?\\160OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO"+    "?\\237?\\165?\\160NO?\\228?\\135?\\192NO?\\242?\\135?\\160?\\181Y"+    "?\\242?\\210?\\160?\\181Y?\\240?\\135?\\160?\\181N"+    "?\\244?\\151?\\160?\\181N?\\242?\\135?\\195?\\181N"+    "?\\242?\\135?\\160?\\213N";+  const unsigned char *asciified;+  apr_size_t legalresult_len = 213;  /* == strlen(legalresult) iff no NULs */+  int i = 0;+  svn_stringbuf_t *escaped = NULL;++  *msg = "test utf string escaping";++  if (msg_only)+    return SVN_NO_ERROR;++  if (svn_utf__stringbuf_escape_utf8_fuzzy+      (&escaped, in, sizeof in - 1, pool) != escaped)+    return svn_error_createf+      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);+  i++;+  if (escaped->len != legalresult_len)+    return svn_error_createf+      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);+  i++;+  if (memcmp(escaped->data, legalresult, legalresult_len))+    return svn_error_createf+      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);+  i++;+  if (memcmp(escaped->data, legalresult, legalresult_len))+    return svn_error_createf+      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);+  i++;++  asciified = svn_utf_cstring_from_utf8_fuzzy(in, pool);+  if (strlen(asciified) != strlen(asciiresult))+    return svn_error_createf+      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);+  i++;+  if (strcmp(asciified, asciiresult))+    return svn_error_createf+      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);+  i++;++  return SVN_NO_ERROR;+}+  /* The test table.  */ @@ -230,5 +309,6 @@     SVN_TEST_NULL,     SVN_TEST_PASS (utf_validate),     SVN_TEST_PASS (utf_validate2),+    SVN_TEST_PASS (utf_escape),     SVN_TEST_NULL   };


### The original point of this thread.### This patch will apply with an offset, since I've cut out sections which### reimplement XML escaping in terms of the svn_subr__escape_string.--- subversion/libsvn_subr/xml.c	(revision 14986)+++ subversion/libsvn_subr/xml.c	(working copy)@@ -395,11 +413,22 @@   /* If expat choked internally, return its error. */   if (! success)     {+      svn_stringbuf_t *sanitized;+      unsigned char *end;+      +      svn_utf__stringbuf_escape_utf8_fuzzy(&sanitized, buf,+					   (len > 240 ? 240 : len),+					   svn_parser->pool);+      end = sanitized->data ++	    (sanitized->len > 240 ? 240 : sanitized->len);+      while (*end > 0x80 && *end < 0xc0 &&+	     (char *) end > sanitized->data) end--;       err = svn_error_createf         (SVN_ERR_XML_MALFORMED, NULL, -         _("Malformed XML: %s at line %d"),+         _("Malformed XML: %s at line %d; XML starts:\n%.*s"),          XML_ErrorString (XML_GetErrorCode (svn_parser->parser)),-         XML_GetCurrentLineNumber (svn_parser->parser));+         XML_GetCurrentLineNumber (svn_parser->parser),+	 (char *) end - sanitized->data + 1, sanitized->data);              /* Kill all parsers and return the expat error */       svn_xml_free_parser (svn_parser);

### Finally, be kind to the translators.--- subversion/po/pt_BR.po	(revision 14986)+++ subversion/po/pt_BR.po	(working copy)@@ -6006,8 +6006,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "XML mal formado: %s na linha %d"+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "XML mal formado: %s na linha %d; XML comeÁa:\n%.240s"  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/es.po	(revision 14986)+++ subversion/po/es.po	(working copy)@@ -6102,8 +6102,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "XML malformado: %s en la lÌnea %d"+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "XML malformado: %s en la lÌnea %d; XML comienza:\n%.240s"  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/de.po	(revision 14986)+++ subversion/po/de.po	(working copy)@@ -6090,8 +6090,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "Fehlerhaftes XML: %s in Zeile %d"+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "Fehlerhaftes XML: %s in Zeile %d; XML beginnt:\n%.240s"  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/sv.po	(revision 14986)+++ subversion/po/sv.po	(working copy)@@ -6005,8 +6005,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "Felaktig XML: %s pÂ rad %d"+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "Felaktig XML: %s pÂ rad %d; XML starta:\n%.240s"  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/ko.po	(revision 14986)+++ subversion/po/ko.po	(working copy)@@ -5906,8 +5906,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "3$-3ıΩ0Ïûò13ıΩ0Î™ª13ıΩ0Îêú1 XML: %s (3ıΩ0Ï§Ñ13ıΩ0Î≤à13ıΩ0Ìò∏1 %d)"-A+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "3$-3ıΩ0Ïûò13ıΩ0Î™ª13ıΩ0Îêú1 XML: %s (3ıΩ0Ï§Ñ13ıΩ0Î≤à13ıΩ0Ìò∏1 %d); XML:\n%.240s"-A  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/ja.po	(revision 14986)+++ subversion/po/ja.po	(working copy)@@ -6463,8 +6463,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "3$-3ıΩ0Áï∞13ıΩ0Â∏∏1$-2æ  XML æ«æπ: %s (3$-3ıΩ0Ë°å1 %d)"-A+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "3$-3ıΩ0Áï∞13ıΩ0Â∏∏1$-2æ  XML æ«æπ: %s(3$-3ıΩ0Ë°å1 %d); XML:\n%.240s"-A  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/pl.po	(revision 14986)+++ subversion/po/pl.po	(working copy)@@ -6103,8 +6103,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "Uszkodzony XML: %s w linii %d"+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "Uszkodzony XML: %s w linii %d; XML wersja:\n%.240s"  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/zh_TW.po	(revision 14986)+++ subversion/po/zh_TW.po	(working copy)@@ -5896,8 +5896,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "3$-3ıΩ0Êúâ13ıΩ0Áº∫13ıΩ0Èô∑1 XML: %s3ıΩ0Êñº13ıΩ0Á¨¨1 %d 3ıΩ0Âàó1"-A+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "3$-3ıΩ0Êúâ13ıΩ0Áº∫13ıΩ0Èô∑1 XML: %s3ıΩ0Êñº13ıΩ0Á¨¨1 %d 3ıΩ0Âàó1; XML:\n%.240s"-A  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/nb.po	(revision 14986)+++ subversion/po/nb.po	(working copy)@@ -5995,8 +5995,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "Misdannet XML: %s i linje %d"+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "Misdannet XML: %s i linje %d; XML starter:\n%.240s"  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/zh_CN.po	(revision 14986)+++ subversion/po/zh_CN.po	(working copy)@@ -5955,8 +5955,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "3$-3ıΩ0Áï∏13ıΩ0ÂΩ¢13ıΩ0ÁöÑ1XMLÚ˙%s3ıΩ0Âú(r)13ıΩ0Á¨¨1 %d 3ıΩ0Ë°å1"-A+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "3$-3ıΩ0Áï∏13ıΩ0ÂΩ¢13ıΩ0ÁöÑ1XMLÚ˙%s3ıΩ0Âú(r)13ıΩ0Á¨¨1 %d 3ıΩ0Ë°å1; XML:\n%.240s"-A  #: libsvn_wc/adm_crawler.c:380 #, c-format
### End of patch ###
-- Regards,Charles BaileyLists: bailey _dot_ charles _at_ gmail _dot_ comOther: bailey _at_ newman _dot_ upenn _dot_ edu