You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Charles Bailey <ba...@gmail.com> on 2005/02/23 19:03:14 UTC

[PATCH] Include offending XML in "Malformed XML" error message

Attached is a patchlet that, when expat fails to parse an hunk of XML,
appends at least part of the offending hunk to the error message.  It
provides a better jumping-off point for further debugging than the
current message, especially since the line number often isn't all that
meaningful.  It's not intended to flag the error specifically, or even
to insure it's actually in the output fragment, since it'd be too hard
to reliably identify the problem.  (For that matter, I can envision
cases where the error is actually in a previously parsed hunk of XML,
but doesn't become apparent until a tag mismatches in the current hunk
or the like.) By way of example, the current error message:

   svn: Malformed XML: not well-formed (invalid token) at line 5810

becomes

    svn: Malformed XML: not well-formed (invalid token) at line 3; XML starts:
    >>>>>
    <mv
       dest=".svn/props/.ooo^H^H^Htestff^Hile.svn-work"
       name=".svn/tmp/props/.ooo^H^H^Htestff^Hile.svn-work"/>
    <readonly
       name=".svn/props/.ooo^H^H^Htestff^Hile.svn-work"/>
    <mv
       dest=".svn/prop-base/.ooo^H^H^Htestff^Hile.svn-base"
      name=".svn/tmp/
    <<<<<

(I've expanded the BS to "^H" for legibility in this message; the
actual error message contains the original text.)

A few comments:

 - The error message now extends over more than one line.  I understand that svn
    generally avoids this, but I think it's necessary to supply a
large enough fragment
    that the user can make a well-educated guess about the location of
the problem.
    The presence of \ns in svn-generated XML would also make it tough
to guarantee
    that any meaningful fragment stayed on one line.  Finally, trying
to flatten the XML to
    fit on a single logical line would go well past 80 cols, which I
think would be more of a
    problem than multiple lines that fit into an 80 column display.

 - I chose a maximum fragment length of 240 chars arbitrarily, as long
enough to give the
   reader some context and short enough to avoid a huge trail of text.
 It seemed like a
   reasonable balance to me.

 - I placed delimiters before and after the XML with the thought that
it might help in some
   situations where the XML itself started with odd characters; again,
this is relatively
   arbitrary.  It'd arguably be of some help for parsing the error
(the delimiters aren't
   legal XML), but as this is a fatal and (one hopes) uncommon error,
I don't think many
   people will be writing tools to anticipate it.  If the consensus is
that they just add noise,
   they're easy enough to drop.

Patched HEAD builds clean here, of course, and passes all but three of
the regression tests, none of which appear related to this patch.

'Nuff said about a minor patch for an infrequent occurrence.  Chalk it
up to it being my first foray onto this list.

Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.

On Mon, 28 Feb 2005, Charles Bailey wrote:

> On Mon, 28 Feb 2005 22:39:13 +0100 (CET), Peter N. Lundblad
> <pe...@famlundblad.se> wrote:
> > On Mon, 28 Feb 2005, Charles Bailey wrote:
> >
> > YOu misunerstood me. I didn't mean we need to escape in general (we
> > already does in the cmdline output routines to be safe). User input is
> > converted from the native encoding to UTF8 rather early, so normally we
> > rely on strings being valid UTF8. This is a special case, since, if I
> > understand correctly, it is raw XML from the parser. We rely on the parser
> > doing the recoding to UTF( for us, but since this is an error situation,
> > the data might not be valid UTF8. That's why we need this ugly escaping in
> > this case (and when reporting recoding errors in utf.c).
>
> Fair enough.  This is where my brief experience with svn limits me.
> Depending on how "UTF-8-safe" svn needs to be, my point may still
> apply to any data read from a file, however.   I think that would
> include not only XML from admin files, but property names and values,
> and any fragments from base or working revisions.  This protects
> against direct edits to the files, as well as errors elsewhere in svn
> (as was the case with the offending XML here).
>
Yes, you have a good point in that we don't always make sure our input is
valid UTF8 (especially from the network). this probably needs to be fixed
someday. But it should be done as input validation. Internally, we must be
able to assume that strings are valid UTF8.

> Mind you, I'm not advocating this -- I think it's a lot of work to
> guarantee that a non-UTF-8 character is never presented to an external
> library or to the user.  I'll work on the XML parse error as a special
> case, and leave the broader policy decisions for a later time.

Good. Note that this really is input validation. that's why we need to be
careful.

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On Mon, 28 Feb 2005 22:39:13 +0100 (CET), Peter N. Lundblad
<pe...@famlundblad.se> wrote:
> On Mon, 28 Feb 2005, Charles Bailey wrote:
> 
> > As a first pass, the '%s' token appears in 410 error messages in HEAD.
> >  On rapid inspection, about half appear to take internal strings,
> > which I would think are more likely to be valid UTF-8 (though the
> > error motivating my original patch was an internal string, so that
> > should be taken with the proverbial grain of salt).  The others are
> > most often user input, so may or may not be valid UTF-8 depending on
> > the user's locale settings; I haven't traced code to see how often
> > they're already escaped.  How much of this needs coverage, though,
> > looks to me like a question for the longer term and for more
> > experienced svn hands than I.
> >
> YOu misunerstood me. I didn't mean we need to escape in general (we
> already does in the cmdline output routines to be safe). User input is
> converted from the native encoding to UTF8 rather early, so normally we
> rely on strings being valid UTF8. This is a special case, since, if I
> understand correctly, it is raw XML from the parser. We rely on the parser
> doing the recoding to UTF( for us, but since this is an error situation,
> the data might not be valid UTF8. That's why we need this ugly escaping in
> this case (and when reporting recoding errors in utf.c).

Fair enough.  This is where my brief experience with svn limits me. 
Depending on how "UTF-8-safe" svn needs to be, my point may still
apply to any data read from a file, however.   I think that would
include not only XML from admin files, but property names and values,
and any fragments from base or working revisions.  This protects
against direct edits to the files, as well as errors elsewhere in svn
(as was the case with the offending XML here).

Mind you, I'm not advocating this -- I think it's a lot of work to
guarantee that a non-UTF-8 character is never presented to an external
library or to the user.  I'll work on the XML parse error as a special
case, and leave the broader policy decisions for a later time.

-- 
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.

On Mon, 28 Feb 2005, Charles Bailey wrote:

> On Mon, 28 Feb 2005 08:11:33 +0100 (CET), Peter N. Lundblad
> <pe...@famlundblad.se> wrote:
> > > If it is a general policy to convert to UTF-8, should I code this as a
> > > separate function, rather than putting the logic into parse_xml?
> > >
> > You can put it in a separate function. Keep it internal to the file,
> > though, until we see another use case for it. We can export it if that
> > happens.
>
> As a first pass, the '%s' token appears in 410 error messages in HEAD.
>  On rapid inspection, about half appear to take internal strings,
> which I would think are more likely to be valid UTF-8 (though the
> error motivating my original patch was an internal string, so that
> should be taken with the proverbial grain of salt).  The others are
> most often user input, so may or may not be valid UTF-8 depending on
> the user's locale settings; I haven't traced code to see how often
> they're already escaped.  How much of this needs coverage, though,
> looks to me like a question for the longer term and for more
> experienced svn hands than I.
>
YOu misunerstood me. I didn't mean we need to escape in general (we
already does in the cmdline output routines to be safe). User input is
converted from the native encoding to UTF8 rather early, so normally we
rely on strings being valid UTF8. This is a special case, since, if I
understand correctly, it is raw XML from the parser. We rely on the parser
doing the recoding to UTF( for us, but since this is an error situation,
the data might not be valid UTF8. That's why we need this ugly escaping in
this case (and when reporting recoding errors in utf.c).

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On Mon, 28 Feb 2005 08:11:33 +0100 (CET), Peter N. Lundblad
<pe...@famlundblad.se> wrote:
> On Sun, 27 Feb 2005, Charles Bailey wrote:
> 
> > On Sun, 27 Feb 2005 22:25:11 +0100 (CET), Peter N. Lundblad
> > <pe...@famlundblad.se> wrote:
> [snip discussion about ensuring that the error stirng is valid UTF8]
> 
> > This looks like a SMOP, but I'm not sure what the real benefit is.
> > Does Subversion make the guarantee that all of its output will be
> > UTF-8?  Unless that's a major goal, I think there's something to be
> > said for inserting the offending XML "as is" into the error message.
> 
> Not its output, but the strings internally. With few exceptions our APIs
> work with UTF8 internally. The problem with adding "raw data" is that the
> error message will be invalid UTF8 and you will have recoding errors
> later. Our own error output routines will escape the whole string then,
> but that might not other libraries do. So, yes, keeping our strings valid
> UTF8 is a goal.

OK.  I'll give it a go when I get a block of free time.

> > If it is a general policy to convert to UTF-8, should I code this as a
> > separate function, rather than putting the logic into parse_xml?
> >
> You can put it in a separate function. Keep it internal to the file,
> though, until we see another use case for it. We can export it if that
> happens.

As a first pass, the '%s' token appears in 410 error messages in HEAD.
 On rapid inspection, about half appear to take internal strings,
which I would think are more likely to be valid UTF-8 (though the
error motivating my original patch was an internal string, so that
should be taken with the proverbial grain of salt).  The others are
most often user input, so may or may not be valid UTF-8 depending on
the user's locale settings; I haven't traced code to see how often
they're already escaped.  How much of this needs coverage, though,
looks to me like a question for the longer term and for more
experienced svn hands than I.

--
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Michael W Thelen <mi...@pietdepsi.com>.

kfogel@collab.net wrote:
> Charles Bailey <ba...@gmail.com> writes:
>>>So, after another too-long hiatus, here's a patch which implements a
>>>"common" string escaping function , uses it for UTF-8 escaping, and
>>>uses that to sanitize the offending XML, which is then output in the
>>>error message that Jack built^W^Wstarted this thread.
>>
>>I'm new enough that I'm not certain of list etiquette at this point:
>>the patch has sat without comment for about two weeks.  Does this
>>indicate that it should be allowed to expire quietly, saving someone
>>the chore of saying, "I wouldn't put this twisted mess into my
>>competitor's code, much less my own!"?  Should it be dumped into an
>>issue that someone might review someday when they've had a free time
>>infusion?  Should it just bide its time here?
> 
> Usually our redoubtable patch manager makes an issue for it, however,
> he may have forgotten this time, in which case your reminder is a good
> idea...

Oh ye of little faith! :-)  Actually, the patch is on my list, and if no
one responds to this within a day or so, I'll file it as a new issue.
Thank you, Charles, for following up!

-- 
Michael W Thelen
It is a mistake to think you can solve any major problems just with
potatoes.       -- Douglas Adams

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by kf...@collab.net.

Charles Bailey <ba...@gmail.com> writes:
> > So, after another too-long hiatus, here's a patch which implements a
> > "common" string escaping function , uses it for UTF-8 escaping, and
> > uses that to sanitize the offending XML, which is then output in the
> > error message that Jack built^W^Wstarted this thread.
> 
> I'm new enough that I'm not certain of list etiquette at this point:
> the patch has sat without comment for about two weeks.  Does this
> indicate that it should be allowed to expire quietly, saving someone
> the chore of saying, "I wouldn't put this twisted mess into my
> competitor's code, much less my own!"?  Should it be dumped into an
> issue that someone might review someday when they've had a free time
> infusion?  Should it just bide its time here?

Usually our redoubtable patch manager makes an issue for it, however,
he may have forgotten this time, in which case your reminder is a good
idea...

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On 6/6/05, Charles Bailey <ba...@gmail.com> wrote:
> 
> So, after another too-long hiatus, here's a patch which implements a
> "common" string escaping function , uses it for UTF-8 escaping, and
> uses that to sanitize the offending XML, which is then output in the
> error message that Jack built^W^Wstarted this thread.

I'm new enough that I'm not certain of list etiquette at this point:
the patch has sat without comment for about two weeks.  Does this
indicate that it should be allowed to expire quietly, saving someone
the chore of saying, "I wouldn't put this twisted mess into my
competitor's code, much less my own!"?  Should it be dumped into an
issue that someone might review someday when they've had a free time
infusion?  Should it just bide its time here?

No hurry in any event -- I've certainly not had all that much time to
write it, and certainly can't expect someone to have all that much
time to read it.  I just want to make sure I shouldn't be doing
something at this point to movit along/out of the way.

-- 
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Michael W Thelen <mi...@pietdepsi.com>.

Charles Bailey wrote:
> On Wed, 23 Feb 2005, Charles Bailey wrote:> > Attached is a patchlet that...

It looks like something bad happened to your message, as if something
stripped out all the newlines.  Would you mind resending the patch?

-- 
Michael W Thelen
It is a mistake to think you can solve any major problems just with
potatoes.       -- Douglas Adams

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Michael W Thelen <mi...@pietdepsi.com>.

Charles Bailey tried to respond to my message, but for some reason it
hasn't shown up on the list.  He asked me to forward it, so here it is:

Forwarded message:
--On Jun 6, 2005 12:34 PM, Michael W Thelen <mi...@pietdepsi.com> wrote:

>
> It looks like something bad happened to your message, as if something
> stripped out all the newlines.  Would you mind resending the patch?

Sorry; I think I've been Gmailed again; let me try from Mulberry.
Here's the leading text:

On Wed, 23 Feb 2005, Charles Bailey wrote:

>
> Attached is a patchlet that, when expat fails to parse an hunk of XML,
> appends at least part of the offending hunk to the error message.  It

which led to an exchange regarding the need to make strings
UTF-8-safe.  After too long a haitus, I posted for comment:

On 4/21/05, Charles Bailey <ba...@gmail.com> wrote:

>
> Well, after umpteen interrupts from the rest of life,I finally got a
> few hours to look at this again.    In checking was was already
> available, I found a handful of "string escaping" function in various
> places which perform similar tasks (at least one with the comment
> "this should share code with other_string_escaping_routine()").  Since
> I'd have to add ya such function, I thought I'd try to abstract it a
> bit, with the hope that similar routines could use a common base.
> I've appended a short proposal at the bottom of this messages,
> containing a common "engine" and an example implementation for
> creating a UTF-8-safe version of an arbitrary string.

Julian Foad was kind enough to point out a dumb thinko, but no other
comments were forthcoming, possibly because the core developers were
busy with pre-1.2 cleanup.

So, after another too-long hiatus, here's a patch which implements a
"common" string escaping function , uses it for UTF-8 escaping, and
uses that to sanitize the offending XML, which is then output in the
error message that Jack built^W^Wstarted this thread.

I've interspersed my comments in the code, since there's imho zero
chance that this version of the patch will be
substantially/stylistically suitable for committing.  They're far from
exhaustive, but this message is long enough already.

Conceptual "Log message":
[[[
Add function that escapes illegal UTF-8 characters, along the way
refactoring core of
string-escaping routines, and insure that illegal XML error message
outputs legal UTF-8.
### Probably best applied as several patches, but collected here for review.

* subversion/libsvn_subr/escape.c:
  New file
  (svn_subr__escape_string): Final-common-path function for escaping
strings.

* subversion/libsvn_subr/escape_impl.h:
  New file, declaring svn_subr__escape_string and convenience macros.
  ### Logical candidate for consolidation with utf_impl.h, perhaps as
subr_impl.h

* subversion/libsvn_subr/utf.c:
  (fuzzy_escape): Renamed to ascii_fuzzy_escape, and rewritten to use
   svn_subr__escape_string.
  (svn_utf__stringbuf_escape_utf8_fuzzy): New function which escapes illegal
   UTF-8 in a string, returning the escaped string in a stringbuf.
  (utf8_escape_mapper): Helper function for
svn_utf__stringbuf_escape_utf8_fuzzy.

* subversion/libsvn_subr/utf_impl.h:
  Add prototype for svn_utf__stringbuf_escape_utf8_fuzzy.
  (svn_utf__cstring_escape_utf8_fuzzy):  Macro implementing variant
of above that
   returns NUL-terminated string.

* subversion/libsvn_subr/xml.c:
  (svn_xml_parse): If parse fails, print (sanitized) (part of) offending XML
   with error message.

* subversion/tests/libsvn_subr/utf-test.c:
  (utf_escape): New function testing UTF-8 string-escaping functions.

* subversion/po/de.po, subversion/po/es.po, subversion/po/ja.po,
 subversion/po/ko.po, subversion/po/nb.po, subversion/po/pl.po,
 subversion/po/pt_BR.po, subversion/po/sv.po,
 subversion/po/zh_CN.po, subversion/po/zh_TW.po:
 Courtesy to translators, since I've changed a localized string.
]]]

The patch, with interspersed comments, is appended as an attachment.

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On 23 Jun 2005 09:43:08 -0500, kfogel@collab.net <kf...@collab.net> wrote:
> Daniel Berlin <db...@dberlin.org> writes:
> > You've asked two different questions.
> > The answer to the first is no, the second is yes.
> > :)
> > If you ask "has he assigned copyright to CollabNet", the answer is no.
> > Assignment of copyright is a "term of art", and has very specific meaning.
> >
> > If you ask "would this make any difference in a court of law given
> > what he has done", the answer is probably also no, because he'd
> > probably be estopped from asserting copyright infringement claims.
> >
> > However, not having assigned the code does mean he doesn't have to
> > worry about how CollabNet is going to license it, because he can still
> > do what he wants.
> 
> Thanks, perfectly clear.

Likewise, and fine with me.  What caught my attention the first time
out was that the stock statement in the source files says "Copyright
(c) 2000-2004 CollabNet.", which it wouldn't actually be without my
assigning copyright.   (I'd guess one could argue that by placing that
notice in text I submitted I was assigning copyright, but that'd be
hard to establish formally.)  I'm happy to use the boilerplate
(assuming, of course, that the patch is worth adding :-)) as current
convention, or to simply state that a perpetual license is granted to
CollabNet, or any user, to use the code with Subversion.

Even apart copyright assignment, I agree (as a non-attorney) with Mr.
Berlin that it'd be hard for someone who submitted a patch to
Subversion through the current review process to argue that there was
no intent for it to be incorporated into Subversion under its current
license for distribution, which is what I think it'd take to assert
copyright infringement based on inclusion in Subversion.  There might
be more of an argument over its inclusion in dervative works (since
the Subversion COPYING license isn't really clear about revocation and
severability), and moreso about inclusion in broadly unrelated works
by cutting-and-pasting from Subversion's sources, but I'd guess these
aren't things about which CollabNet's too worried.

-- 
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by kf...@collab.net.

Daniel Berlin <db...@dberlin.org> writes:
> You've asked two different questions.
> The answer to the first is no, the second is yes.
> :)
> If you ask "has he assigned copyright to CollabNet", the answer is no.
> Assignment of copyright is a "term of art", and has very specific meaning.
> 
> If you ask "would this make any difference in a court of law given
> what he has done", the answer is probably also no, because he'd
> probably be estopped from asserting copyright infringement claims.
> 
> However, not having assigned the code does mean he doesn't have to
> worry about how CollabNet is going to license it, because he can still
> do what he wants.

Thanks, perfectly clear.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Daniel Berlin <db...@dberlin.org>.

On Thu, 22 Jun 2005 kfogel@collab.net wrote:

> Daniel Berlin <db...@dberlin.org> writes:
>>> Does
>>> that suffice to Collabnet as assignment of copyright?
>>
>> No, at least in the US, copyright can only be assigned in writing.
>
> Wait a minute: I thought in the past you'd said that email and bona
> fide belief of author's intent was sufficient to assign copyright?  Or
> maybe you meant, sufficient for CollabNet to proceed as if copyright
> had been assigned?

You've asked two different questions.
The answer to the first is no, the second is yes.
:)
If you ask "has he assigned copyright to CollabNet", the answer is no.
Assignment of copyright is a "term of art", and has very specific meaning.

If you ask "would this make any difference in a court of law given what he 
has done", the answer is probably also no, because he'd probably be 
estopped from asserting copyright infringement claims.

However, not having assigned the code does mean he doesn't have to 
worry about how CollabNet is going to license it, because he can still do 
what he wants.

>
> -K
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Brian Behlendorf <br...@collab.net>.

On Wed, 22 Jun 2005 kfogel@collab.net wrote:
> Daniel Berlin <db...@dberlin.org> writes:
>>> Does
>>> that suffice to Collabnet as assignment of copyright?
>>
>> No, at least in the US, copyright can only be assigned in writing.
>
> Wait a minute: I thought in the past you'd said that email and bona
> fide belief of author's intent was sufficient to assign copyright?  Or
> maybe you meant, sufficient for CollabNet to proceed as if copyright
> had been assigned?

We're not worrying about assigning copyright; for the same reasons Apache 
does not.  All we legally need is the right to sublicense the code under 
the relevant license of the whole code base, in this case the Apache 1.1 
license.  We're considering submission of a patch or idea to this list or 
the bug databases, etc., as an expression of such a license.  We've known 
that we can and we should get a more formal CLA like Apache has, to 
protect against someone posting something here and then claiming later 
that we don't have the right to integrate what they posted into 
Subversion.

 	Brian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by kf...@collab.net.

Daniel Berlin <db...@dberlin.org> writes:
> > Does
> > that suffice to Collabnet as assignment of copyright? 
> 
> No, at least in the US, copyright can only be assigned in writing.

Wait a minute: I thought in the past you'd said that email and bona
fide belief of author's intent was sufficient to assign copyright?  Or
maybe you meant, sufficient for CollabNet to proceed as if copyright
had been assigned?

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Daniel Berlin <db...@dberlin.org>.

> Right.  I'd noticed that, but since I don't know Collabnet's policy
> here I didn't want
> to presume.  Is it acceptable to clone the copyright from an existing
> file?  

Yes, but the notice still needs to be made proper in terms of dates.

> Does
> that suffice to Collabnet as assignment of copyright? 

No, at least in the US, copyright can only be assigned in writing.

>  Do I have a pledge from
> Collabnet that any code whose copyright is assigned in this way will
> remain freely
> available in perpetuity?

You haven't assigned copyright, so you can still do what you want with
your code.





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On 22 Jun 2005 15:14:51 -0500, kfogel@collab.net <kf...@collab.net> wrote:
> Charles Bailey <ba...@gmail.com> writes:
> > I've no problems with that.  I would have problems with CollabNet
> > asserting next Tuesday that Subversion was (t)henceforth a proprietary
> > product, and any further use would require payment of a license fee.
> > I assume that's not the case (and it would arguably be hard to do in
> > any event), but thought it prudent to ask, since I'm new to this
> > process.
> 
> Just to make absolutely clear here, in case others are worried by
> these questions:
> 
> CollabNet cannot do that.  The "latest free copy of the code" would
> then become the starting point for a new free fork, and every
> developer here would simply follow that fork.

Thanks; that sounds great.  I'm not a lawyer (really!) nor am I trying
to create a problem where none exists.  I did want to make sure I
understood the standards of the Subversion community -- and don't want
to deal with grumpy folks at the university that employs me if it
turned out I was writing code (solely) for someone else's commercial
venture.

We now return to our regularly scheduled discussion of string escaping.

-- 
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by kf...@collab.net.

Charles Bailey <ba...@gmail.com> writes:
> I've no problems with that.  I would have problems with CollabNet
> asserting next Tuesday that Subversion was (t)henceforth a proprietary
> product, and any further use would require payment of a license fee. 
> I assume that's not the case (and it would arguably be hard to do in
> any event), but thought it prudent to ask, since I'm new to this
> process.

Just to make absolutely clear here, in case others are worried by
these questions:

CollabNet cannot do that.  The "latest free copy of the code" would
then become the starting point for a new free fork, and every
developer here would simply follow that fork.

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On 6/21/05, Philip Martin <ph...@codematters.co.uk> wrote:
> Charles Bailey <ba...@gmail.com> writes:
> 
> > On 6/20/05, Philip Martin <ph...@codematters.co.uk> wrote:
> >> Charles Bailey <ba...@gmail.com> writes:
> >>
> >> > --- /dev/null Mon Jun  6 11:06:27 2005
> >> > +++ subversion/libsvn_subr/escape.c   Fri Jun  3 19:16:09 2005
> >>
> >> New files need a copyright notice.
> >
> > Right.  I'd noticed that, but since I don't know Collabnet's policy
> > here I didn't want
> > to presume.  Is it acceptable to clone the copyright from an existing
> > file?  Does
> > that suffice to Collabnet as assignment of copyright?
> 
> I think so.

OK, will do.

> >  Do I have a pledge from
> > Collabnet that any code whose copyright is assigned in this way will
> > remain freely
> > available in perpetuity?  (By "freely available", I mean terms
> > substantially identical
> > to Subversion's current licensing; I'm not trying to  start a skirmish
> > over software
> > licensing.)
> 
> I can't speak for CollabNet, but I'll point out that the Subversion
> licence is not a copyleft licence and so anyone, including CollabNet,
> can make a proprietary version.

I've no problems with that.  I would have problems with CollabNet
asserting next Tuesday that Subversion was (t)henceforth a proprietary
product, and any further use would require payment of a license fee. 
I assume that's not the case (and it would arguably be hard to do in
any event), but thought it prudent to ask, since I'm new to this
process.

> >> Those big macros also make me uneasy.
> >
> > Is the concern that it'll choke come compilers, or that they're hard
> > to maintain?
> 
> I don't think compilers will have a problem, I think it makes it
> harder for humans.

OK.  Paradoxically, I'm trying to make it easier for humans, but
letting them just use a "function call" to perform a common task,
without incurring the runtime overhead of the function call.

> > This may just be a stylistic issue: I tend to favor macros for
> > repeated tasks, on the theory that as long as one debugs it carefully,
> > the benefit of inlining is worth the effort of coding.
> 
> The macros are only used once in your patch...

Yep.  I noted in the comments that I'd included them as macros because
they seemed to be fairly common operations that might come up in other
contexts.  If that's not the case, I can just inline them and other
folks can copy-paste or reimplement them as needed.

> >> > +  if (!asciinonul[0]) {
> >> > +    asciinonul[0] = 255;                   /* NUL's not allowed */
> >>
> >> That doesn't look threadsafe.
> >
> > It is safe, though it doesn't prevent duplication.
> 
> No, it's not safe.  Consider:
> 
> - asciinonul is static initialised to zero
> - thread 1 tests asciinonul[0] and gets zero
> - thread 1 sets asciinonul[0] to 255, the rest of asciinonul is still zero
> - thread 2 tests asciinonul[0] and gets 255
> - thread 2 uses the rest of asciinonul although it is still zero, oops!

Good point; I should've spotted that.  I'd been sitting on the fence
between this approach (runtime initialization, slower but taking less
space in the source) and static initialization, but looks like the
latter is definitely the way to go here.

> >> Overall this patch makes me uneasy, I'd prefer a patch that builds on
> >> our existing ctype and/or utf-8 code.  However as I haven't tried to
> >> implement such a patch I don't know whether our existing code is
> >> a suitable starting point.
> >
> > That's certainly possible.  I'd actually started out that way, and
> > switched to the above when it seemed I was about to add ya parallel
> > routine to escape a string in a slightly different way.
> >
> > Thanks for the review.  If you or one of the other core developers
> > could answer some of the "philosophic" questions above, I can revise
> > the patch appropriately.
> 
> Well, that's a bit tricky.  Your patch looked overly complicated to
> me, but then I haven't tried to write an alternative so I don't know
> whether there is a better way.

My sense is that there is a simpler approach to the specific task of
making a string "UTF8-safe", but when I also consider related tasks
like escaping XML (single bytes escaped to entities like & => &amp;)
or potentially converting multibyte codes to alternative multibyte
codes, I can't whittle it down much further.

I think this approach is a net win because of reduction in parallel
code and simplification of maintenance, but that's an opinion rather
than verifiable fact.  It may produce some small savings in space,
though that's arguable given the use of 256-byte screening arrays. 
It's amost certainly not faster than a set of optimized custom
routines, though it's probably not much slower, either; the specifics
will likely depend on compiler and environment.  Whether, in the end,
the balance favors this "common pathway" approach or custom functions
for each task is the crucial question here, I guess.

On a practical level, how should I proceed?  Is it worth my making the
minor fixes you suggest and posting another patch for further
discussion, or would it be better to let it drop and see whether any
other opinions are forthcoming?

Thanks.

-- 
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by kf...@collab.net.

Charles Bailey wrote:
>  I'd hoped to encourage review, rather than put it off.

Clearly, you succeeded, given Philip's in-depth response :-)!

Philip Martin <ph...@codematters.co.uk> writes:
> Charles Bailey <ba...@gmail.com> writes:
> > to presume.  Is it acceptable to clone the copyright from an existing
> > file?  Does
> > that suffice to Collabnet as assignment of copyright?
> 
> I think so.

Yes, it is.  That's how we've done it in the past.

> >  Do I have a pledge from
> > Collabnet that any code whose copyright is assigned in this way will
> > remain freely
> > available in perpetuity?  (By "freely available", I mean terms
> > substantially identical
> > to Subversion's current licensing; I'm not trying to  start a skirmish
> > over software
> > licensing.)
> 
> I can't speak for CollabNet, but I'll point out that the Subversion
> licence is not a copyleft licence and so anyone, including CollabNet,
> can make a proprietary version.

Neither CollabNet nor anyone else can ever remove this code from the
open-source world, once it's under this license.  Someone could take a
*copy* of the code and put it into a closed project, but that wouldn't
affect the continued freedom of the original from which the copy came.

So I think you don't have anything to worry about.

Best,
-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Philip Martin <ph...@codematters.co.uk>.

Charles Bailey <ba...@gmail.com> writes:

> On 6/20/05, Philip Martin <ph...@codematters.co.uk> wrote:
>> Charles Bailey <ba...@gmail.com> writes:
>> 
>> > --- /dev/null Mon Jun  6 11:06:27 2005
>> > +++ subversion/libsvn_subr/escape.c   Fri Jun  3 19:16:09 2005
>> 
>> New files need a copyright notice.
>
> Right.  I'd noticed that, but since I don't know Collabnet's policy
> here I didn't want
> to presume.  Is it acceptable to clone the copyright from an existing
> file?  Does
> that suffice to Collabnet as assignment of copyright?

I think so.

>  Do I have a pledge from
> Collabnet that any code whose copyright is assigned in this way will
> remain freely
> available in perpetuity?  (By "freely available", I mean terms
> substantially identical
> to Subversion's current licensing; I'm not trying to  start a skirmish
> over software
> licensing.)

I can't speak for CollabNet, but I'll point out that the Subversion
licence is not a copyleft licence and so anyone, including CollabNet,
can make a proprietary version.

[...]

>> The patch is mangled here, and in several other places.
>
> Right.  When Michael Thelen pointed that out, I resent it as an attachment;
> I hope that came through in better shape.

I had one other copy in a forwarded message, it was also mangled.

[...]

>> Those big macros also make me uneasy.
>
> Is the concern that it'll choke come compilers, or that they're hard
> to maintain?

I don't think compilers will have a problem, I think it makes it
harder for humans.

> This may just be a stylistic issue: I tend to favor macros for
> repeated tasks, on the theory that as long as one debugs it carefully,
> the benefit of inlining is worth the effort of coding.

The macros are only used once in your patch...

[...]

>> > -fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)
>> > +ascii_fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)
>> >  {
>> > -  const char *src_orig = src, *src_end = src + len;
>> > -  apr_size_t new_len = 0;
>> > -  char *new;
>> > -  const char *new_orig;
>> > +  static unsigned char asciinonul[256];
>> > +  svn_stringbuf_t *result = NULL;
>> >
>> > -  /* First count how big a dest string we'll need. */
>> > -  while (src < src_end)
>> > -    {
>> > -      if (! svn_ctype_isascii (*src) || *src == '\0')
>> > -        new_len += 5;  /* 5 slots, for "?\XXX" */
>> > -      else
>> > -        new_len += 1;  /* one slot for the 7-bit char */
>> > +  if (!asciinonul[0]) {
>> > +    asciinonul[0] = 255;                   /* NUL's not allowed */
>> 
>> That doesn't look threadsafe.
>
> It is safe, though it doesn't prevent duplication.

No, it's not safe.  Consider:

- asciinonul is static initialised to zero
- thread 1 tests asciinonul[0] and gets zero
- thread 1 sets asciinonul[0] to 255, the rest of asciinonul is still zero
- thread 2 tests asciinonul[0] and gets 255
- thread 2 uses the rest of asciinonul although it is still zero, oops!

You could try to fix it by changing the code to set asciinonul[0]
later, but that's still not threadsafe:

- An optimising compiler can reorder the memory writes.
- The CPU pipeline can reorder the memory writes.
- Even if the writes are not reordered, on a multi-CPU system there is
  no guarantee that writes by one CPU will become visible to another CPU
  in the same order.

[...]

>> Overall this patch makes me uneasy, I'd prefer a patch that builds on
>> our existing ctype and/or utf-8 code.  However as I haven't tried to
>> implement such a patch I don't know whether our existing code is
>> a suitable starting point.
>
> That's certainly possible.  I'd actually started out that way, and
> switched to the above when it seemed I was about to add ya parallel
> routine to escape a string in a slightly different way.
>
> Thanks for the review.  If you or one of the other core developers
> could answer some of the "philosophic" questions above, I can revise
> the patch appropriately.

Well, that's a bit tricky.  Your patch looked overly complicated to
me, but then I haven't tried to write an alternative so I don't know
whether there is a better way.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On 6/20/05, Philip Martin <ph...@codematters.co.uk> wrote:> Charles Bailey <ba...@gmail.com> writes:> > > I've interspersed my comments in the code, since there's imho zero> > chance that this version of the patch will be> > substantially/stylistically suitable for committing.> > That doesn't encourage review!  Anyway, if the comments are necessary> to understand the code then they should be part of the code.
Hmm.  Perhaps I'm phrasing poorly.  My intent was to say that the patch wasfunctionally correct, but there are enough issues not only of style (ala HACKING)but of "project philosophy", for lack of a better term -- whethermacros are preferredto repeated code, framing of comments, calling sequence conventions --that it wasunlikely to be in final form.  I'd hoped to encourage review, ratherthan put it off.
Those comments which I intended to document the code are present as C-stylecomments in the patch.  It's the "philosophical" questions I've interspersed.
I hope this helps make my intent clearer.
> >  They're far from> > exhaustive, but this message is long enough already.> >> > Conceptual "Log message":> > [[[> > Add function that escapes illegal UTF-8 characters, along the way> > "valid" rather than "legal", "invalid" rather than "illegal".  That> applies to function names, variables, comments, etc.
OK, I'm happy to follow that convention.
> > refactoring core of> > string-escaping routines, and insure that illegal XML error message> > outputs legal UTF-8.> > ### Probably best applied as several patches, but collected here for review.> > If you think it should be several patches then submit several> patches.  On the whole it looks like something that should be one> patch.
Thanks.  This is another place where I was unsure of the philosophy. If the policy is one-change-per-commit, then I can frame it as addingthe driver in one patch, using the driver to escape strings to UTF8 ina second, and using that to fix the XML error message in a third.  Ifthe policy is one-problem-per-patch, then I agree it makes sense as asingle patch.   (I'd still keep refactoring other string escapingroutines to use the driver as a separate patch, no?)
> > * subversion/po/de.po, subversion/po/es.po, subversion/po/ja.po,> >   subversion/po/ko.po, subversion/po/nb.po, subversion/po/pl.po,> >   subversion/po/pt_BR.po, subversion/po/sv.po,> >   subversion/po/zh_CN.po, subversion/po/zh_TW.po:> >   Courtesy to translators, since I've changed a localized string.> > I wouldn't apply patches to PO files unless they come from someone who> claims to be fluent.
OK; I can drop them.
> > --- /dev/null Mon Jun  6 11:06:27 2005> > +++ subversion/libsvn_subr/escape.c   Fri Jun  3 19:16:09 2005> > New files need a copyright notice.
Right.  I'd noticed that, but since I don't know Collabnet's policyhere I didn't wantto presume.  Is it acceptable to clone the copyright from an existingfile?  Doesthat suffice to Collabnet as assignment of copyright?  Do I have a pledge fromCollabnet that any code whose copyright is assigned in this way willremain freelyavailable in perpetuity?  (By "freely available", I mean termssubstantially identicalto Subversion's current licensing; I'm not trying to  start a skirmishover softwarelicensing.)
> > > @@ -0,0 +1,58 @@> > +/*> > + * escape.c: common code for cleaning up unwanted bytes in strings> > + */> > +> > +#include "escape_impl.h"> > +> > +#define COPY_PREFIX \> > +  if (c > base) { \> > +    svn_stringbuf_appendbytes (out, base, c - base); \> > +    base = c; \> > +  }> > I don't think I'd have used a macro, it makes the calling code harder> to understand, but I'd prefer a macro with parameters and it needs> documentation.
Hmm.  The only purpose of the macro here is to make the calling code easierto understand; if it's not helping, I can just inline the prefix checkin the threeplaces where it's required.
> > +> > +svn_stringbuf_t *> > +svn_subr__escape_string (svn_stringbuf_t **outsbuf,> > +                      const unsigned char *instr,> > +                      apr_size_t len,> > +                      const unsigned char *isok,> > +                      unsigned char (*mapper) (unsigned char **,> > +                                               const unsigned char *,> > +                                               apr_size_t,> > +                                               const svn_stringbuf_t *,> > +                                               void *,> > +                                               apr_pool_t *),> > mapper should be a typedef.
OK.
> > +                      void *mapper_baton,> > +                      apr_pool_t *pool)> > +{> > +  unsigned char *base, *c;> > +  svn_stringbuf_t *out;> > +> > +  if (outsbuf == NULL || *outsbuf == NULL) {> > +    out = svn_stringbuf_create ("", pool);> > It should probably be created with at least len bytes capacity.
Sounds fair.  Would it be better to create it with as many bytes asthe input string length?
> > +    if (outsbuf)> > +      *outsbuf = out;> > +  }> > +  else> > +    out = *outsbuf;> > +> > +  for (c = base = (unsigned char *) instr; c < instr + len; ) {> > Try to avoid casting away const.
Agreed.  I think I tripped a compiler warning trying to leave everything (instr, base, and c) const, though it should be legal --  I'll haveanother look.
> > +    apr_size_t count = isok ? isok[*c] : 0;> > +    if (count == 0) {> > +      COPY_PREFIX;> > +      count = mapper ? mapper (&c, instr, len, out, mapper_baton, pool) : 255;> > +    }> > +    if (count == 255) {> > +      char esc[6];> > +> > +      COPY_PREFIX;> > +      sprintf (esc,"?\\%03u",*c);> > +      svn_stringbuf_appendcstr (out, esc);> > +      c++;> > +      base = c;> > +    }> > +    else c += count;> > +  }> > +  COPY_PREFIX;> > +  return out;> > This function doesn't follow the project's indentation/formatting> guidelines.
Got it.  I reflexively cuddled the braces, but can push them down.
> > +}> > +> >> >> > ### Comments are pretty self-explanatory.> > ### Docs are as doxygen; will need to be downgraded to plaintext since it's> > ### an internal header.> > ### As noted above, it makes sense to combine this with utf_impl.h.> > If it makes sense to combine it then why is it separate?
I didn't want on a first outing to be rearranging existing files.  Atsome level, this isalso a question of philosophy.  It seems to me that one "private"header file foreverything local to libsvn_subr is the best way to go.  If, however,project policyis one header per topic, then I'd use a separate escape.h.

> > --- /dev/null Mon Jun  6 11:35:47 2005> > +++ subversion/libsvn_subr/escape_impl.h      Thu Jun  2 18:44:05 2005> > Missing copyright notice.> > > @@ -0,0 +1,147 @@> > +/*> > + * escape_impl.h :  private header for string escaping function.> > + */> > +> > +> > +> > +#ifndef SVN_LIBSVN_SUBR_ESCAPE_IMPL_H> > +#define SVN_LIBSVN_SUBR_ESCAPE_IMPL_H> > +> > +> > +#include "svn_pools.h"> > +#include "svn_string.h"> > +> > +#ifdef __cplusplus> > +extern "C" {> > +#endif /* __cplusplus */> > +> > +> > +/** Scan @a instr of length @a len bytes, copying to stringbuf @a *outsbuf,> > + * escaping bytes as indicated by the lookup array @a isok and the mapping> > + * function @a mapper. Memory is allocated from @a pool.  You may provide> > + * any extra information needed by @a mapper in @a mapper_baton.> > + * Returns a pointer to the stringbuf containing the escaped string.> > + *> > + * If @a outsbuf or *outsbuf is NULL, a new stringbuf is created; its> > address is> > The patch is mangled here, and in several other places.
Right.  When Michael Thelen pointed that out, I resent it as an attachment;I hope that came through in better shape.
> > + * placed in @a outsbuf unless that argument is NULL.> > + * If @a isok is NULL, then @a mapper is used exclusively.> > + * If @ mapper is NULL, then a single character is escaped every time @a mapper> > + * would have been called.> > + *> > + * This is designed to be the common pathway for various string "escaping"> > + * functions across subversion.  The basic approach is to scan> > + * the input and decide whether each byte is OK as it stands, needs to be> > + * "escaped" using subversion's "?\uuu" default format, or needs to be> > + * transformed in some other way.  The decision is made using a two step> > + * process, which is designed to handle the simple cases quickly but allow> > + * for more complex mappings.  Since the typical string will (we hope)> > + * comprise mostly simple cases, this shouldn't require much code> > + * complexity or loss of efficiency.  The two steps used are:> > The question is do we need such a general function?  One the one hand> it looks like it is more complicated then necessary just to handle> escaping, on the other hand it doesn't look general enough to be used,> say, as a routine to separate multibyte codepoints.> > I think I'd prefer a simpler solution, although I haven't tried to> implement one.  Perhaps based on the existing validation functions?> (Although since I wrote those I might be biased.)
I think this is the crucial philosophic question.  For the specificproblem with whichI started (given a string from I-know-not-where, and hence inI-know-not-what-encoding,emit a string that is valid UTF8), it's certainly easy enough to writeanother escaping function, similar to the ones already present.  Thatapproach seemed less than ideal to me, since it would introduce yetanother function to be maintained in parallel, etc.  The conceptualsimplicity may make it most useful in the end, though.
I tried to write the driver to be general enough to handle differenttasks, but reasonably fast for common tasks.  Since most of thestrings appearing within Subversion are UTF8, I chose a byte-orientedscreen.  (I'm also trying to take advantage of the nice property ofUTF8 that one can identify multibyte codepoints from the first byte,iff the string is guaranteed to be valid.)  I'm using the mappingfunction as a "trap door" to permit more involved (includingmultibyte) testing; in this case, there's no benefit in speed, butperhaps a common point for the actual escaping of bytes.  More generalsolutions (of which I could conceive) seemed to me to entail yet morecomplexity; I chose this level as the appropriate tradeoff.  It maywell be that the core developers choose a different level.  I can workwith that.
> > + *> > + * 1. The value of a byte from the input string ("test byte") is used as an> > + *    index into a (usually 256 byte) array passed in by the caller.> > + *      - If the value of the appropriate array element is 0xff,> > + *        then the test byte is escaped as a "?\uuu" string in the output.> > + *      - If the value of the appropriate element is otherwise non-zero,> > + *        that many bytes are copied verbatim from the input to the output.> > + * 2. If the array yields a 0 value, then a mapping function provided by> > + *    the caller is used to allow for more complex evaluation.  This function> > + *    receives five arguments:> > Five? I see six in the code.
Drat.  I thought I'd added the baton to the docment.  My omission.
> > + *      - a pointer to the pointer used by svn__do_char_escape() to> > + *        mark the test byte in the input string> > + *      - a pointer to the start of the input string> > + *      - the length of the input string> > + *      - a pointer to the output stringbuf> > + *      - the ever-helpful pool.> > + * The mapping function may return a (positive) nonzero value,> > + * which is interpreted * as described in step 1 above, or zero,> > + * indicating that the test byte * should be ignored.  In the latter> > + * case, this is generally because the * mapping function has done the> > + * necessary work itself; it's free to * modify the output stringbuf and> > + * adjust the pointer to the test byte * as it sees fit (within the> > + * bounds of the input string).  At a minimum, * it should at least> > + * increment the pointer to the test byte before * returning 0, in order> > + * to avoid an infinite loop.> > Are the '*' characters within the comment lines just fallout from> paragraph filling?
Yes.
> > + */> > +> > +svn_stringbuf_t *> > +svn_subr__escape_string (svn_stringbuf_t **outsbuf,> > +                      const unsigned char *instr,> > +                      apr_size_t len,> > +                      const unsigned char *isok,> > +                      unsigned char (*mapper) (unsigned char **,> > +                                               const unsigned char *,> > +                                               apr_size_t,> > +                                               const svn_stringbuf_t *,> > +                                               void *,> > +                                               apr_pool_t *),> > Should be a typedef, the typedef should document the parameters.
OK.
> > +                      void *mapper_baton,> > +                      apr_pool_t *pool);> > +> > +> > +> > +/** Initializer for a basic screening matrix suitable for use with> > + *  #svn_subr__escape_string to escape non-UTF-8 bytes.> > + *  We provide this since "UTF-8-safety" is a common denominator for> > + *  most string escaping in Subversion, so this matrix makes a good> > + *  starting point for more involved schemes.> > + */> > +#define SVN_ESCAPE_UTF8_LEGAL_ARRAY { \> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,> >   1,   1,\> > +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,> >   0,   0,\> > +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,> >   0,   0,\> > +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,> >   0,   0,\> > +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,> >   0,   0,\> > +255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,> >   0,   0,\> > +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,> >   0,   0,\> > +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,> >   0,   0,\> > +  0,   0,   0,   0,   0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255}> > That looks a bit like the svn_ctype stuff, but it's a separate> implementation.  That makes me uneasy.
Fair enough.  The underlying problem is that different situationsrequire slightly different behaviors, of course.  I'm not sure whetherthe best solution is to start with a single framework and tweak it atruntime as needed, or to set up the framework for each case separatelyin the interest of speed over space.
> > +> > +/** Given pointer @a c into a string which ends at @a e, figure out> > + *  whether (*c) starts a valid UTF-8 sequence, and if so, how many bytes> > + *  it includes.  Return 255 if it's not valid UTF-8.> > + *  For a more detailed description of the encoding rules, see the UTF-8> > + *  specification in section 3-9 of the Unicode standard 4.0 (e.g. at> > + *  http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf),> > + *  with special attention to Table 3-6.> > + *  This macro is also provided as a building block for mappers used by> > + *  #svn_subr__escape_string that want to check for UTF-8-safety in> > + *  addition to other tasks.> > + */> > +#define SVN_ESCAPE_UTF8_MAPPING(c,e)                                        \> > +  ( (c)[0] < 0x80                                  ? /* ASCII */            \> > +    1 :                                                   /* OK, 1 byte */          \> > +    ( ( ((c)[0] > 0xc2 && (c)[0] < 0xdf)          && /* 2-byte char */              \> > +     ((c) + 1 <= (e))                          && /* Got 2 bytes */         \> > +     ((c)[1] >= 0x80 && (c)[1] <= 0xbf))        ? /* Byte 2 legal */        \> > +      2 :                                         /* OK, 2 bytes */         \> > +      ( ( ((c)[0] >= 0xe0 && (c)[0] <= 0xef)      && /* 3 byte char */              \> > +       ((c) + 2 <= (e))                        && /* Got 3 bytes */         \> > +       ((c)[1] >= 0x80 && (c)[1] <= 0xbf)      && /* Basic byte 2 legal */  \> > +       ((c)[2] >= 0x80 && (c)[2] <= 0xbf)      && /* Basic byte 3 legal */  \> > +       (!((c)[0] == 0xe0 && (c)[1] < 0xa0))    && /* 0xe0-0x[89]? illegal */\> > +       (!((c)[0] == 0xed && (c)[1] > 0x9f)) )   ? /* 0xed-0x[ab]? illegal */\> > +     3 :                                          /* OK, 3 bytes */         \> > +     ( ( ((c)[0] >= 0xf0 && (c)[0] <= 0xf4)    && /* 4 byte char */         \> > +         ((c) + 3 <= (e))                      && /* Got 4 bytes */         \> > +         ((c)[1] >= 0x80 && (c)[1] <= 0xbf)    && /* Basic byte 2 legal */  \> > +         ((c)[2] >= 0x80 && (c)[2] <= 0xbf)    && /* Basic byte 3 legal */  \> > +         ((c)[3] >= 0x80 && (c)[3] <= 0xbf)    && /* Basic byte 4 legal */  \> > +         (!((c)[0] == 0xf0 && (c)[1] < 0x90))  && /* 0xf0-0x8? illegal */   \> > +         (!((c)[0] == 0xf4 && (c)[1] > 0x8f)) ) ? /* 0xf4-0x[9ab]? illegal*/\> > +       4 :                                        /* OK, 4 bytes */         \> > +       255))))                                    /* Illegal; escape it */> > utf_validate.c already implements the UTF-8 encoding rules.  There is> obviously some duplication of the algorithm, that makes me uneasy.
True enough.  I'd done this for speed and compactness relative to thestate machine in utf_validate.c, expecting comments from the coredevelopers about whether it was a net win.  For the specific case ofescaping invalid UTF8, I could write a mapper to call out tosvn_utf__last_valid per character.
> Those big macros also make me uneasy.
Is the concern that it'll choke come compilers, or that they're hardto maintain?This may just be a stylistic issue: I tend to favor macros forrepeated tasks, on the theory that as long as one debugs it carefully,the benefit of inlining is worth the effort of coding.
> > +> > +> > +#ifdef __cplusplus> > +}> > +#endif /* __cplusplus */> > +> > +#endif /* SVN_LIBSVN_SUBR_ESCAPE_IMPL_H */> >> >> > ### Function names can be revised to fit convention, of course.> > ### svn_utf__cstring_escape_utf8_fuzzy serves as an example of a benefit of> > ### returning the resultant stringbuf from svn_subr__escape_string both in a> > ### parameter and as the function's return value. If the sense is that> > it'll be a cause> > ### of debugging headaches, or that it's cortrary to subversion> > culture to code public> > ### functions as macros, it's easy enough to code this as a function,> > and to make> > ### svn_subr__escape_string return void (or less likely svn_error_t,> > if it got pickier> > ### about params.)> > --- subversion/libsvn_subr/utf_impl.h (revision 14986)> > +++ subversion/libsvn_subr/utf_impl.h (working copy)> > @@ -24,12 +24,33 @@> >> >  #include <apr_pools.h>> >  #include "svn_types.h"> > +#include "svn_string.h"> >> >  #ifdef __cplusplus> >  extern "C" {> >  #endif /* __cplusplus */> >> >> > +/** Replace any non-UTF-8 characters in @a len byte long string @a src with> > + *  escaped representations, placing the result in a stringbuf pointed to by> > + *  @a *dest, which will be created if necessary.  Memory is allocated from> > How does the user know what "if necessary" means?
Fair point.  s/necessary/NULL/.
> > + *  @a pool as needed. Returns a pointer to the stringbuf containing the result> > + *  (identical to @a *dest, but facilitates chaining calls).> > + */> > +svn_stringbuf_t *> > +svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,> > +                                   const unsigned char *src,> > +                                   apr_size_t len,> > +                                   apr_pool_t *pool);> > +> > +/** Replace any non-UTF-8 characters in @a len byte long string @a src with> > + *  escaped representations.  Memory is allocated from @a pool as needed.> > + *  Returns a pointer to the resulting string.> > + */> > +#define svn_utf__cstring_escape_utf8_fuzzy(src,len,pool) \> > +  (svn_utf__stringbuf_escape_utf8_fuzzy(NULL,(src),(len),(pool)))->data> > Is there any need for this to be a macro?  A real function would be> less "surprising" and that makes it better unless you can justify the> macro.
OK.  I think this is another instance of the style point noted above.
> > +> > +> > const char *svn_utf__cstring_from_utf8_fuzzy (const char *src,> >                                                apr_pool_t *pool,> >                                                svn_error_t *(*convert_from_utf8)> >> >> >> > ### There're other places that could be rewritten in terms of the new escaping> > ### functions, but I hope the two given here serve as an example of how it might> > ### be done.> > Complete patches are better than incomplete ones.
Sure.  My standard for 'complete' might have been a bit low: I wasaiming for a unit that compiled,  passed tests, and illustrated theissues I believed were still open.  I thought I might wait to seewhether the notion of a "common escaping routine" would carry beforeextending it to XML, other UTF8 tasks, etc.

> > ### The rename to ascii_fuzzy_escape is to distinguish it from the new functions> > ### that escape only illegal UTF-8 sequences.> > --- subversion/libsvn_subr/utf.c      (revision 14986)> > +++ subversion/libsvn_subr/utf.c      (working copy)> > @@ -30,6 +30,7 @@> >  #include "svn_pools.h"> >  #include "svn_ctype.h"> >  #include "svn_utf.h"> > +#include "escape_impl.h"> >  #include "utf_impl.h"> >  #include "svn_private_config.h"> >> > @@ -323,53 +324,19 @@> >  /* Copy LEN bytes of SRC, converting non-ASCII and zero bytes to ?\nnn> >     sequences, allocating the result in POOL. */> >  static const char *> > -fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)> > +ascii_fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)> >  {> > -  const char *src_orig = src, *src_end = src + len;> > -  apr_size_t new_len = 0;> > -  char *new;> > -  const char *new_orig;> > +  static unsigned char asciinonul[256];> > +  svn_stringbuf_t *result = NULL;> >> > -  /* First count how big a dest string we'll need. */> > -  while (src < src_end)> > -    {> > -      if (! svn_ctype_isascii (*src) || *src == '\0')> > -        new_len += 5;  /* 5 slots, for "?\XXX" */> > -      else> > -        new_len += 1;  /* one slot for the 7-bit char */> > +  if (!asciinonul[0]) {> > +    asciinonul[0] = 255;                   /* NUL's not allowed */> > That doesn't look threadsafe.
It is safe, though it doesn't prevent duplication.
> > +    memset(asciinonul + 1, 1, 127);        /* Other regular ASCII OK */> > +    memset(asciinonul + 128, 255, 128);    /* High half not allowed */> > +  }> >> > -      src++;> > -    }> > -> > -  /* Allocate that amount. */> > -  new = apr_palloc (pool, new_len + 1);> > -> > -  new_orig = new;> > -> > -  /* And fill it up. */> > -  while (src_orig < src_end)> > -    {> > -      if (! svn_ctype_isascii (*src_orig) || src_orig == '\0')> > -        {> > -          /* This is the same format as svn_xml_fuzzy_escape uses, but that> > -             function escapes different characters.  Please keep in sync!> > -             ### If we add another fuzzy escape somewhere, we should abstract> > -             ### this out to a common function. */> > -          sprintf (new, "?\\%03u", (unsigned char) *src_orig);> > -          new += 5;> > -        }> > -      else> > -        {> > -          *new = *src_orig;> > -          new += 1;> > -        }> > -> > -      src_orig++;> > -    }> > -> > -  *new = '\0';> > -> > -  return new_orig;> > +  svn_subr__escape_string(&result, src, len, asciinonul, NULL, NULL, pool);> > +  return result->data;> >  }> >> >  /* Convert SRC_LENGTH bytes of SRC_DATA in NODE->handle, store the result> > @@ -448,7 +415,7 @@> >          errstr = apr_psprintf> >            (pool, _("Can't convert string from '%s' to '%s':"),> >             node->frompage, node->topage);> > -      err = svn_error_create (apr_err, NULL, fuzzy_escape (src_data,> > +      err = svn_error_create (apr_err, NULL, ascii_fuzzy_escape (src_data,> >                                                             src_length, pool));> >        return svn_error_create (apr_err, err, errstr);> >      }> > @@ -564,7 +531,28 @@> >    return SVN_NO_ERROR;> >  }> >> > +static unsigned char> > +utf8_escape_mapper (unsigned char **targ, const unsigned char *start,> > +                 apr_size_t len, const svn_stringbuf_t *dest,> > +                 void *baton, apr_pool_t *pool)> > New functions need documentation.
OK.  I can add a docment above it.
> > +{> > +  const unsigned char *end = start + len;> > +  return SVN_ESCAPE_UTF8_MAPPING(*targ, end);> > +}> >> > +svn_stringbuf_t *> > +svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,> > +                                   const unsigned char *src,> > +                                   apr_size_t len,> > +                                   apr_pool_t *pool)> > +{> > +  static unsigned char utf8screen[256] = SVN_ESCAPE_UTF8_LEGAL_ARRAY;> > +> > +  return svn_subr__escape_string(dest, src, len,> > +                              utf8screen, utf8_escape_mapper, NULL,> > +                              pool);> > +}> > +> >  svn_error_t *> >  svn_utf_stringbuf_to_utf8 (svn_stringbuf_t **dest,> >                             const svn_stringbuf_t *src,> > @@ -787,7 +775,7 @@> >    const char *escaped, *converted;> >    svn_error_t *err;> >> > -  escaped = fuzzy_escape (src, strlen (src), pool);> > +  escaped = ascii_fuzzy_escape (src, strlen (src), pool);> >> >    /* Okay, now we have a *new* UTF-8 string, one that's guaranteed to> >       contain only 7-bit bytes :-).  Recode to native... */> >> > ### With code comes testing.> > ### Note: Contains 8-bit chars, and also uses convention that cc will treat> > ### "foo" "bar" as "foobar".  Both can be avoided if useful for> > finicky compilers.> >> > --- subversion/tests/libsvn_subr/utf-test.c   (revision 14986)> > +++ subversion/tests/libsvn_subr/utf-test.c   (working copy)> > @@ -17,6 +17,7 @@> >   */> >> >  #include "../svn_test.h"> > +#include "../../include/svn_utf.h"> > Does a plain "svn_utf.h" work?
Hmm.  I don't know; I'd followed the convention of the prior line. I'll give it a try.
> >  #include "../../libsvn_subr/utf_impl.h"> >> >  /* Random number seed.  Yes, it's global, just pretend you can't see it. */> > @@ -222,6 +223,84 @@> >    return SVN_NO_ERROR;> >  }> >> > +static svn_error_t *> > +utf_escape (const char **msg,> > +         svn_boolean_t msg_only,> > +         svn_test_opts_t *opts,> > +         apr_pool_t *pool)> > +{> > +  char in[] = { 'A', 'S', 'C', 'I', 'I',     /* All printable */> > +             'R', 'E', 'T', '\n', 'N',    /* Newline */> > +             'B', 'E', 'L', 0x07, '!',    /* Control char */> > +             0xd2, 0xa6, 'O', 'K', '2',   /* 2-byte char, valid */> > +             0xc0, 0xc3, 'N', 'O', '2',   /* 2-byte char, invalid 1st */> > +             0x82, 0xc3, 'N', 'O', '2',   /* 2-byte char, invalid 2nd */> > +             0xe4, 0x87, 0xa0, 'O', 'K',  /* 3-byte char, valid */> > +             0xe2, 0xff, 0xba, 'N', 'O',  /*3-byte char, invalid 2nd */> > +             0xe0, 0x87, 0xa0, 'N', 'O',  /*3-byte char, invalid 2nd */> > +             0xed, 0xa5, 0xa0, 'N', 'O',  /*3-byte char, invalid 2nd */> > +             0xe4, 0x87, 0xc0, 'N', 'O',  /* 3-byte char, invalid 3rd */> > +             0xf2, 0x87, 0xa0, 0xb5, 'Y', /* 4-byte char, valid */> > +             0xf2, 0xd2, 0xa0, 0xb5, 'Y', /* 4-byte char, invalid 2nd */> > +             0xf0, 0x87, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */> > +             0xf4, 0x97, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */> > +             0xf2, 0x87, 0xc3, 0xb5, 'N', /* 4-byte char, invalid 3rd */> > +             0xf2, 0x87, 0xa0, 0xd5, 'N', /* 4-byte char, invalid 4th */> > +                0x00 };> > +  const unsigned char *legalresult => > +    "ASCIIRET\nNBEL!$-1(c)�OK2?\\192?\\195NO2?\\130?\\195NO2"-A> > +    "3$-3�0䇠1OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO?\\237?\\165?\\160NO"-A> > +    "?\\228?\\135?\\192NO3$-3�0򇠵1Y?\\242$-1(c)�?\\181Y?\\240?\\135?\\160"-A> > +    "?\\181N?\\244?\\151?\\160?\\181N?\\242?\\135�N?\\242?\\135?\\160"> > +    "?\\213N";> > I don't like the embedded control characters in the source code, could> you generate them at runtime?
Sure.  I could also just C-escape them like the high-half bytes above.
> > +  const unsigned char *asciiresult => > +    "ASCIIRET\nNBEL\x07!?\\210?\\166OK2?\\192?\\195NO2?\\130?\\195NO2"> > +    "?\\228?\\135?\\160OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO"> > +    "?\\237?\\165?\\160NO?\\228?\\135?\\192NO?\\242?\\135?\\160?\\181Y"> > +    "?\\242?\\210?\\160?\\181Y?\\240?\\135?\\160?\\181N"> > +    "?\\244?\\151?\\160?\\181N?\\242?\\135?\\195?\\181N"> > +    "?\\242?\\135?\\160?\\213N";> > +  const unsigned char *asciified;> > +  apr_size_t legalresult_len = 213;  /* == strlen(legalresult) iff no NULs */> > +  int i = 0;> > +  svn_stringbuf_t *escaped = NULL;> > +> > +  *msg = "test utf string escaping";> > +> > +  if (msg_only)> > +    return SVN_NO_ERROR;> > +> > +  if (svn_utf__stringbuf_escape_utf8_fuzzy> > +      (&escaped, in, sizeof in - 1, pool) != escaped)> > I prefer () with sizeof.
OK.
> > +    return svn_error_createf> > +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > +  i++;> > +  if (escaped->len != legalresult_len)> > +    return svn_error_createf> > +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > +  i++;> > +  if (memcmp(escaped->data, legalresult, legalresult_len))> > +    return svn_error_createf> > +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > +  i++;> > +  if (memcmp(escaped->data, legalresult, legalresult_len))> > A duplicate of the one above?
Er, yes.  I must've mispasted.  Sorry.
> > +    return svn_error_createf> > +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > +  i++;> > +> > +  asciified = svn_utf_cstring_from_utf8_fuzzy(in, pool);> > +  if (strlen(asciified) != strlen(asciiresult))> > +    return svn_error_createf> > +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > +  i++;> > +  if (strcmp(asciified, asciiresult))> > +    return svn_error_createf> > +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > +  i++;> > +> > +  return SVN_NO_ERROR;> > +}> > +> >> >  /* The test table.  */> >> > @@ -230,5 +309,6 @@> >      SVN_TEST_NULL,> >      SVN_TEST_PASS (utf_validate),> >      SVN_TEST_PASS (utf_validate2),> > +    SVN_TEST_PASS (utf_escape),> >      SVN_TEST_NULL> >    };> >> >> >> > ### The original point of this thread.> > ### This patch will apply with an offset, since I've cut out sections which> > ### reimplement XML escaping in terms of the svn_subr__escape_string.> > --- subversion/libsvn_subr/xml.c      (revision 14986)> > +++ subversion/libsvn_subr/xml.c      (working copy)> > @@ -395,11 +413,22 @@> >    /* If expat choked internally, return its error. */> >    if (! success)> >      {> > +      svn_stringbuf_t *sanitized;> > +      unsigned char *end;> > +> > +      svn_utf__stringbuf_escape_utf8_fuzzy(&sanitized, buf,> > +                                        (len > 240 ? 240 : len),> > +                                        svn_parser->pool);> > +      end = sanitized->data +> > +         (sanitized->len > 240 ? 240 : sanitized->len);> > 240?  A magic number with no explanation.
I can add a comment.  It was discussed earlier in this thread, but Iagree it'd helpin the future to add something to the code.

> > +      while (*end > 0x80 && *end < 0xc0 &&> > +          (char *) end > sanitized->data) end--;> > I think that could generate a huge error message in pathological> cases.
Up to 240 characters, yes.  My intent was to include enough to givethe reader a fair idea where the error occurred, without subjectingthem to screensful of junk.  Setting the cutoff at 240 seemed to give reasonable chunk of XML element for the cases I checked (errors in theentries file.)
> >        err = svn_error_createf> >          (SVN_ERR_XML_MALFORMED, NULL,> > -         _("Malformed XML: %s at line %d"),> > +         _("Malformed XML: %s at line %d; XML starts:\n%.*s"),> >           XML_ErrorString (XML_GetErrorCode (svn_parser->parser)),> > -         XML_GetCurrentLineNumber (svn_parser->parser));> > +         XML_GetCurrentLineNumber (svn_parser->parser),> > +      (char *) end - sanitized->data + 1, sanitized->data);> >> >        /* Kill all parsers and return the expat error */> >        svn_xml_free_parser (svn_parser);> >> > Overall this patch makes me uneasy, I'd prefer a patch that builds on> our existing ctype and/or utf-8 code.  However as I haven't tried to> implement such a patch I don't know whether our existing code is> a suitable starting point.
That's certainly possible.  I'd actually started out that way, andswitched to the above when it seemed I was about to add ya parallelroutine to escape a string in a slightly different way.
Thanks for the review.  If you or one of the other core developerscould answer some of the "philosophic" questions above, I can revisethe patch appropriately.
-- Regards,Charles BaileyLists: bailey _dot_ charles _at_ gmail _dot_ comOther: bailey _at_ newman _dot_ upenn _dot_ edu

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Philip Martin <ph...@codematters.co.uk>.

Charles Bailey <ba...@gmail.com> writes:

> I've interspersed my comments in the code, since there's imho zero
> chance that this version of the patch will be
> substantially/stylistically suitable for committing.

That doesn't encourage review!  Anyway, if the comments are necessary
to understand the code then they should be part of the code.

>  They're far from
> exhaustive, but this message is long enough already.
>
> Conceptual "Log message":
> [[[
> Add function that escapes illegal UTF-8 characters, along the way

"valid" rather than "legal", "invalid" rather than "illegal".  That
applies to function names, variables, comments, etc.

> refactoring core of
> string-escaping routines, and insure that illegal XML error message
> outputs legal UTF-8.
> ### Probably best applied as several patches, but collected here for review.

If you think it should be several patches then submit several
patches.  On the whole it looks like something that should be one
patch.

> * subversion/po/de.po, subversion/po/es.po, subversion/po/ja.po,
>   subversion/po/ko.po, subversion/po/nb.po, subversion/po/pl.po,
>   subversion/po/pt_BR.po, subversion/po/sv.po,
>   subversion/po/zh_CN.po, subversion/po/zh_TW.po:
>   Courtesy to translators, since I've changed a localized string.

I wouldn't apply patches to PO files unless they come from someone who
claims to be fluent.

> --- /dev/null	Mon Jun  6 11:06:27 2005
> +++ subversion/libsvn_subr/escape.c	Fri Jun  3 19:16:09 2005

New files need a copyright notice.

> @@ -0,0 +1,58 @@
> +/*
> + * escape.c:	common code for cleaning up unwanted bytes in strings
> + */
> +
> +#include "escape_impl.h"
> +
> +#define COPY_PREFIX \
> +  if (c > base) { \
> +    svn_stringbuf_appendbytes (out, base, c - base); \
> +    base = c; \
> +  }

I don't think I'd have used a macro, it makes the calling code harder
to understand, but I'd prefer a macro with parameters and it needs
documentation.

> +
> +svn_stringbuf_t *
> +svn_subr__escape_string (svn_stringbuf_t **outsbuf,
> +			 const unsigned char *instr,
> +			 apr_size_t len,
> +			 const unsigned char *isok,
> +			 unsigned char (*mapper) (unsigned char **,
> +						  const unsigned char *,
> +						  apr_size_t,
> +						  const svn_stringbuf_t *,
> +						  void *,
> +						  apr_pool_t *),

mapper should be a typedef.

> +			 void *mapper_baton,
> +			 apr_pool_t *pool)
> +{
> +  unsigned char *base, *c;
> +  svn_stringbuf_t *out;
> +
> +  if (outsbuf == NULL || *outsbuf == NULL) {
> +    out = svn_stringbuf_create ("", pool);

It should probably be created with at least len bytes capacity.

> +    if (outsbuf)
> +      *outsbuf = out;
> +  }
> +  else
> +    out = *outsbuf;
> +
> +  for (c = base = (unsigned char *) instr; c < instr + len; ) {

Try to avoid casting away const.

> +    apr_size_t count = isok ? isok[*c] : 0;
> +    if (count == 0) {
> +      COPY_PREFIX;
> +      count = mapper ? mapper (&c, instr, len, out, mapper_baton, pool) : 255;
> +    }
> +    if (count == 255) {
> +      char esc[6];
> +
> +      COPY_PREFIX;
> +      sprintf (esc,"?\\%03u",*c);
> +      svn_stringbuf_appendcstr (out, esc);
> +      c++;
> +      base = c;
> +    }
> +    else c += count;
> +  }
> +  COPY_PREFIX;
> +  return out;

This function doesn't follow the project's indentation/formatting
guidelines.

> +}
> +
>
>
> ### Comments are pretty self-explanatory.
> ### Docs are as doxygen; will need to be downgraded to plaintext since it's
> ### an internal header.
> ### As noted above, it makes sense to combine this with utf_impl.h.

If it makes sense to combine it then why is it separate?

> --- /dev/null	Mon Jun  6 11:35:47 2005
> +++ subversion/libsvn_subr/escape_impl.h	Thu Jun  2 18:44:05 2005

Missing copyright notice.

> @@ -0,0 +1,147 @@
> +/*
> + * escape_impl.h :  private header for string escaping function.
> + */
> +
> +
> +
> +#ifndef SVN_LIBSVN_SUBR_ESCAPE_IMPL_H
> +#define SVN_LIBSVN_SUBR_ESCAPE_IMPL_H
> +
> +
> +#include "svn_pools.h"
> +#include "svn_string.h"
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif /* __cplusplus */
> +
> +
> +/** Scan @a instr of length @a len bytes, copying to stringbuf @a *outsbuf,
> + * escaping bytes as indicated by the lookup array @a isok and the mapping
> + * function @a mapper. Memory is allocated from @a pool.  You may provide
> + * any extra information needed by @a mapper in @a mapper_baton.
> + * Returns a pointer to the stringbuf containing the escaped string.
> + *
> + * If @a outsbuf or *outsbuf is NULL, a new stringbuf is created; its
> address is

The patch is mangled here, and in several other places.

> + * placed in @a outsbuf unless that argument is NULL.
> + * If @a isok is NULL, then @a mapper is used exclusively.
> + * If @ mapper is NULL, then a single character is escaped every time @a mapper
> + * would have been called.
> + *
> + * This is designed to be the common pathway for various string "escaping"
> + * functions across subversion.  The basic approach is to scan
> + * the input and decide whether each byte is OK as it stands, needs to be
> + * "escaped" using subversion's "?\uuu" default format, or needs to be
> + * transformed in some other way.  The decision is made using a two step
> + * process, which is designed to handle the simple cases quickly but allow
> + * for more complex mappings.  Since the typical string will (we hope)
> + * comprise mostly simple cases, this shouldn't require much code
> + * complexity or loss of efficiency.  The two steps used are:

The question is do we need such a general function?  One the one hand
it looks like it is more complicated then necessary just to handle
escaping, on the other hand it doesn't look general enough to be used,
say, as a routine to separate multibyte codepoints.

I think I'd prefer a simpler solution, although I haven't tried to
implement one.  Perhaps based on the existing validation functions?
(Although since I wrote those I might be biased.)

> + *
> + * 1. The value of a byte from the input string ("test byte") is used as an
> + *    index into a (usually 256 byte) array passed in by the caller.
> + *      - If the value of the appropriate array element is 0xff,
> + *        then the test byte is escaped as a "?\uuu" string in the output.
> + *      - If the value of the appropriate element is otherwise non-zero,
> + *        that many bytes are copied verbatim from the input to the output.
> + * 2. If the array yields a 0 value, then a mapping function provided by
> + *    the caller is used to allow for more complex evaluation.  This function
> + *    receives five arguments:

Five? I see six in the code.

> + *      - a pointer to the pointer used by svn__do_char_escape() to
> + *        mark the test byte in the input string
> + *      - a pointer to the start of the input string
> + *      - the length of the input string
> + *      - a pointer to the output stringbuf
> + *      - the ever-helpful pool.
> + * The mapping function may return a (positive) nonzero value,
> + * which is interpreted * as described in step 1 above, or zero,
> + * indicating that the test byte * should be ignored.  In the latter
> + * case, this is generally because the * mapping function has done the
> + * necessary work itself; it's free to * modify the output stringbuf and
> + * adjust the pointer to the test byte * as it sees fit (within the
> + * bounds of the input string).  At a minimum, * it should at least
> + * increment the pointer to the test byte before * returning 0, in order
> + * to avoid an infinite loop.

Are the '*' characters within the comment lines just fallout from
paragraph filling?

> + */
> +
> +svn_stringbuf_t *
> +svn_subr__escape_string (svn_stringbuf_t **outsbuf,
> +			 const unsigned char *instr,
> +			 apr_size_t len,
> +			 const unsigned char *isok,
> +			 unsigned char (*mapper) (unsigned char **,
> +						  const unsigned char *,
> +						  apr_size_t,
> +						  const svn_stringbuf_t *,
> +						  void *,
> +						  apr_pool_t *),

Should be a typedef, the typedef should document the parameters.

> +			 void *mapper_baton,
> +			 apr_pool_t *pool);
> +
> +
> +
> +/** Initializer for a basic screening matrix suitable for use with
> + *  #svn_subr__escape_string to escape non-UTF-8 bytes.
> + *  We provide this since "UTF-8-safety" is a common denominator for
> + *  most string escaping in Subversion, so this matrix makes a good
> + *  starting point for more involved schemes.
> + */  
> +#define SVN_ESCAPE_UTF8_LEGAL_ARRAY { \
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
>   1,   1,\
> +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>   0,   0,\
> +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>   0,   0,\
> +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>   0,   0,\
> +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>   0,   0,\
> +255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>   0,   0,\
> +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>   0,   0,\
> +  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>   0,   0,\
> +  0,   0,   0,   0,   0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255}

That looks a bit like the svn_ctype stuff, but it's a separate
implementation.  That makes me uneasy.

> +
> +/** Given pointer @a c into a string which ends at @a e, figure out
> + *  whether (*c) starts a valid UTF-8 sequence, and if so, how many bytes
> + *  it includes.  Return 255 if it's not valid UTF-8.
> + *  For a more detailed description of the encoding rules, see the UTF-8
> + *  specification in section 3-9 of the Unicode standard 4.0 (e.g. at
> + *  http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf),
> + *  with special attention to Table 3-6.
> + *  This macro is also provided as a building block for mappers used by
> + *  #svn_subr__escape_string that want to check for UTF-8-safety in
> + *  addition to other tasks.
> + */
> +#define SVN_ESCAPE_UTF8_MAPPING(c,e)					       \
> +  ( (c)[0] < 0x80                                  ? /* ASCII */	       \
> +    1 :				                     /* OK, 1 byte */	       \
> +    ( ( ((c)[0] > 0xc2 && (c)[0] < 0xdf)          && /* 2-byte char */	       \
> +	((c) + 1 <= (e))                          && /* Got 2 bytes */	       \
> +	((c)[1] >= 0x80 && (c)[1] <= 0xbf))        ? /* Byte 2 legal */        \
> +      2 :					     /* OK, 2 bytes */         \
> +      ( ( ((c)[0] >= 0xe0 && (c)[0] <= 0xef)      && /* 3 byte char */	       \
> +	  ((c) + 2 <= (e))                        && /* Got 3 bytes */	       \
> +	  ((c)[1] >= 0x80 && (c)[1] <= 0xbf)      && /* Basic byte 2 legal */  \
> +	  ((c)[2] >= 0x80 && (c)[2] <= 0xbf)      && /* Basic byte 3 legal */  \
> +	  (!((c)[0] == 0xe0 && (c)[1] < 0xa0))    && /* 0xe0-0x[89]? illegal */\
> +	  (!((c)[0] == 0xed && (c)[1] > 0x9f)) )   ? /* 0xed-0x[ab]? illegal */\
> +	3 :                                          /* OK, 3 bytes */	       \
> +	( ( ((c)[0] >= 0xf0 && (c)[0] <= 0xf4)    && /* 4 byte char */         \
> +	    ((c) + 3 <= (e))                      && /* Got 4 bytes */         \
> +	    ((c)[1] >= 0x80 && (c)[1] <= 0xbf)    && /* Basic byte 2 legal */  \
> +	    ((c)[2] >= 0x80 && (c)[2] <= 0xbf)    && /* Basic byte 3 legal */  \
> +	    ((c)[3] >= 0x80 && (c)[3] <= 0xbf)    && /* Basic byte 4 legal */  \
> +	    (!((c)[0] == 0xf0 && (c)[1] < 0x90))  && /* 0xf0-0x8? illegal */   \
> +	    (!((c)[0] == 0xf4 && (c)[1] > 0x8f)) ) ? /* 0xf4-0x[9ab]? illegal*/\
> +	  4 :					     /* OK, 4 bytes */         \
> +	  255))))                                    /* Illegal; escape it */

utf_validate.c already implements the UTF-8 encoding rules.  There is
obviously some duplication of the algorithm, that makes me uneasy.

Those big macros also make me uneasy.

> +
> +
> +#ifdef __cplusplus
> +}
> +#endif /* __cplusplus */
> +
> +#endif /* SVN_LIBSVN_SUBR_ESCAPE_IMPL_H */
>
>
> ### Function names can be revised to fit convention, of course. 
> ### svn_utf__cstring_escape_utf8_fuzzy serves as an example of a benefit of
> ### returning the resultant stringbuf from svn_subr__escape_string both in a
> ### parameter and as the function's return value. If the sense is that
> it'll be a cause
> ### of debugging headaches, or that it's cortrary to subversion
> culture to code public
> ### functions as macros, it's easy enough to code this as a function,
> and to make
> ### svn_subr__escape_string return void (or less likely svn_error_t,
> if it got pickier
> ### about params.)
> --- subversion/libsvn_subr/utf_impl.h	(revision 14986)
> +++ subversion/libsvn_subr/utf_impl.h	(working copy)
> @@ -24,12 +24,33 @@
>  
>  #include <apr_pools.h>
>  #include "svn_types.h"
> +#include "svn_string.h"
>  
>  #ifdef __cplusplus
>  extern "C" {
>  #endif /* __cplusplus */
>  
>  
> +/** Replace any non-UTF-8 characters in @a len byte long string @a src with
> + *  escaped representations, placing the result in a stringbuf pointed to by
> + *  @a *dest, which will be created if necessary.  Memory is allocated from

How does the user know what "if necessary" means?

> + *  @a pool as needed. Returns a pointer to the stringbuf containing the result
> + *  (identical to @a *dest, but facilitates chaining calls).
> + */
> +svn_stringbuf_t *
> +svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,
> +				      const unsigned char *src,
> +				      apr_size_t len,
> +				      apr_pool_t *pool);
> +
> +/** Replace any non-UTF-8 characters in @a len byte long string @a src with
> + *  escaped representations.  Memory is allocated from @a pool as needed.
> + *  Returns a pointer to the resulting string.
> + */
> +#define svn_utf__cstring_escape_utf8_fuzzy(src,len,pool) \
> +  (svn_utf__stringbuf_escape_utf8_fuzzy(NULL,(src),(len),(pool)))->data

Is there any need for this to be a macro?  A real function would be
less "surprising" and that makes it better unless you can justify the
macro.

> +
> +
> const char *svn_utf__cstring_from_utf8_fuzzy (const char *src,
>                                                apr_pool_t *pool,
>                                                svn_error_t *(*convert_from_utf8)
>
>
>
> ### There're other places that could be rewritten in terms of the new escaping
> ### functions, but I hope the two given here serve as an example of how it might
> ### be done.

Complete patches are better than incomplete ones.

> ### The rename to ascii_fuzzy_escape is to distinguish it from the new functions
> ### that escape only illegal UTF-8 sequences.
> --- subversion/libsvn_subr/utf.c	(revision 14986)
> +++ subversion/libsvn_subr/utf.c	(working copy)
> @@ -30,6 +30,7 @@
>  #include "svn_pools.h"
>  #include "svn_ctype.h"
>  #include "svn_utf.h"
> +#include "escape_impl.h"
>  #include "utf_impl.h"
>  #include "svn_private_config.h"
>  
> @@ -323,53 +324,19 @@
>  /* Copy LEN bytes of SRC, converting non-ASCII and zero bytes to ?\nnn
>     sequences, allocating the result in POOL. */
>  static const char *
> -fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)
> +ascii_fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)
>  {
> -  const char *src_orig = src, *src_end = src + len;
> -  apr_size_t new_len = 0;
> -  char *new;
> -  const char *new_orig;
> +  static unsigned char asciinonul[256];
> +  svn_stringbuf_t *result = NULL;
>  
> -  /* First count how big a dest string we'll need. */
> -  while (src < src_end)
> -    {
> -      if (! svn_ctype_isascii (*src) || *src == '\0')
> -        new_len += 5;  /* 5 slots, for "?\XXX" */
> -      else
> -        new_len += 1;  /* one slot for the 7-bit char */
> +  if (!asciinonul[0]) {
> +    asciinonul[0] = 255;                   /* NUL's not allowed */

That doesn't look threadsafe.

> +    memset(asciinonul + 1, 1, 127);        /* Other regular ASCII OK */
> +    memset(asciinonul + 128, 255, 128);    /* High half not allowed */
> +  } 
>  
> -      src++;
> -    }
> -
> -  /* Allocate that amount. */
> -  new = apr_palloc (pool, new_len + 1);
> -
> -  new_orig = new;
> -
> -  /* And fill it up. */
> -  while (src_orig < src_end)
> -    {
> -      if (! svn_ctype_isascii (*src_orig) || src_orig == '\0')
> -        {
> -          /* This is the same format as svn_xml_fuzzy_escape uses, but that
> -             function escapes different characters.  Please keep in sync!
> -             ### If we add another fuzzy escape somewhere, we should abstract
> -             ### this out to a common function. */
> -          sprintf (new, "?\\%03u", (unsigned char) *src_orig);
> -          new += 5;
> -        }
> -      else
> -        {
> -          *new = *src_orig;
> -          new += 1;
> -        }
> -
> -      src_orig++;
> -    }
> -
> -  *new = '\0';
> -
> -  return new_orig;
> +  svn_subr__escape_string(&result, src, len, asciinonul, NULL, NULL, pool);
> +  return result->data;
>  }
>  
>  /* Convert SRC_LENGTH bytes of SRC_DATA in NODE->handle, store the result
> @@ -448,7 +415,7 @@
>          errstr = apr_psprintf
>            (pool, _("Can't convert string from '%s' to '%s':"),
>             node->frompage, node->topage);
> -      err = svn_error_create (apr_err, NULL, fuzzy_escape (src_data,
> +      err = svn_error_create (apr_err, NULL, ascii_fuzzy_escape (src_data,
>                                                             src_length, pool));
>        return svn_error_create (apr_err, err, errstr);
>      }
> @@ -564,7 +531,28 @@
>    return SVN_NO_ERROR;
>  }
>  
> +static unsigned char
> +utf8_escape_mapper (unsigned char **targ, const unsigned char *start,
> +		    apr_size_t len, const svn_stringbuf_t *dest,
> +		    void *baton, apr_pool_t *pool)

New functions need documentation.

> +{
> +  const unsigned char *end = start + len;
> +  return SVN_ESCAPE_UTF8_MAPPING(*targ, end);
> +}
>  
> +svn_stringbuf_t *
> +svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,
> +				      const unsigned char *src,
> +				      apr_size_t len,
> +				      apr_pool_t *pool)
> +{
> +  static unsigned char utf8screen[256] = SVN_ESCAPE_UTF8_LEGAL_ARRAY;
> +
> +  return svn_subr__escape_string(dest, src, len,
> +				 utf8screen, utf8_escape_mapper, NULL,
> +				 pool);
> +}
> +
>  svn_error_t *
>  svn_utf_stringbuf_to_utf8 (svn_stringbuf_t **dest,
>                             const svn_stringbuf_t *src,
> @@ -787,7 +775,7 @@
>    const char *escaped, *converted;
>    svn_error_t *err;
>  
> -  escaped = fuzzy_escape (src, strlen (src), pool);
> +  escaped = ascii_fuzzy_escape (src, strlen (src), pool);
>  
>    /* Okay, now we have a *new* UTF-8 string, one that's guaranteed to
>       contain only 7-bit bytes :-).  Recode to native... */
>
> ### With code comes testing.
> ### Note: Contains 8-bit chars, and also uses convention that cc will treat
> ### "foo" "bar" as "foobar".  Both can be avoided if useful for
> finicky compilers.
>
> --- subversion/tests/libsvn_subr/utf-test.c	(revision 14986)
> +++ subversion/tests/libsvn_subr/utf-test.c	(working copy)
> @@ -17,6 +17,7 @@
>   */
>  
>  #include "../svn_test.h"
> +#include "../../include/svn_utf.h"

Does a plain "svn_utf.h" work?

>  #include "../../libsvn_subr/utf_impl.h"
>  
>  /* Random number seed.  Yes, it's global, just pretend you can't see it. */
> @@ -222,6 +223,84 @@
>    return SVN_NO_ERROR;
>  }
>  
> +static svn_error_t *
> +utf_escape (const char **msg,
> +	    svn_boolean_t msg_only,
> +	    svn_test_opts_t *opts,
> +	    apr_pool_t *pool)
> +{
> +  char in[] = { 'A', 'S', 'C', 'I', 'I',     /* All printable */
> +		'R', 'E', 'T', '\n', 'N',    /* Newline */
> +		'B', 'E', 'L', 0x07, '!',    /* Control char */
> +		0xd2, 0xa6, 'O', 'K', '2',   /* 2-byte char, valid */
> +		0xc0, 0xc3, 'N', 'O', '2',   /* 2-byte char, invalid 1st */
> +		0x82, 0xc3, 'N', 'O', '2',   /* 2-byte char, invalid 2nd */
> +		0xe4, 0x87, 0xa0, 'O', 'K',  /* 3-byte char, valid */
> +		0xe2, 0xff, 0xba, 'N', 'O',  /*3-byte char, invalid 2nd */
> +		0xe0, 0x87, 0xa0, 'N', 'O',  /*3-byte char, invalid 2nd */
> +		0xed, 0xa5, 0xa0, 'N', 'O',  /*3-byte char, invalid 2nd */
> +		0xe4, 0x87, 0xc0, 'N', 'O',  /* 3-byte char, invalid 3rd */
> +		0xf2, 0x87, 0xa0, 0xb5, 'Y', /* 4-byte char, valid */
> +		0xf2, 0xd2, 0xa0, 0xb5, 'Y', /* 4-byte char, invalid 2nd */
> +		0xf0, 0x87, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */
> +		0xf4, 0x97, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */
> +		0xf2, 0x87, 0xc3, 0xb5, 'N', /* 4-byte char, invalid 3rd */
> +		0xf2, 0x87, 0xa0, 0xd5, 'N', /* 4-byte char, invalid 4th */
> +                0x00 };
> +  const unsigned char *legalresult =
> +    "ASCIIRET\nNBEL!$-1(c)ÊOK2?\\192?\\195NO2?\\130?\\195NO2"-A
> +    "3$-3ıΩ0‰á†1OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO?\\237?\\165?\\160NO"-A
> +    "?\\228?\\135?\\192NO3$-3ıΩ0Úá†µ1Y?\\242$-1(c)‡?\\181Y?\\240?\\135?\\160"-A
> +    "?\\181N?\\244?\\151?\\160?\\181N?\\242?\\135ıN?\\242?\\135?\\160"
> +    "?\\213N";

I don't like the embedded control characters in the source code, could
you generate them at runtime?

> +  const unsigned char *asciiresult =
> +    "ASCIIRET\nNBEL\x07!?\\210?\\166OK2?\\192?\\195NO2?\\130?\\195NO2"
> +    "?\\228?\\135?\\160OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO"
> +    "?\\237?\\165?\\160NO?\\228?\\135?\\192NO?\\242?\\135?\\160?\\181Y"
> +    "?\\242?\\210?\\160?\\181Y?\\240?\\135?\\160?\\181N"
> +    "?\\244?\\151?\\160?\\181N?\\242?\\135?\\195?\\181N"
> +    "?\\242?\\135?\\160?\\213N";
> +  const unsigned char *asciified;
> +  apr_size_t legalresult_len = 213;  /* == strlen(legalresult) iff no NULs */
> +  int i = 0;
> +  svn_stringbuf_t *escaped = NULL;
> +
> +  *msg = "test utf string escaping";
> +
> +  if (msg_only)
> +    return SVN_NO_ERROR;
> +
> +  if (svn_utf__stringbuf_escape_utf8_fuzzy
> +      (&escaped, in, sizeof in - 1, pool) != escaped)

I prefer () with sizeof.

> +    return svn_error_createf
> +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> +  i++;
> +  if (escaped->len != legalresult_len)
> +    return svn_error_createf
> +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> +  i++;
> +  if (memcmp(escaped->data, legalresult, legalresult_len))
> +    return svn_error_createf
> +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> +  i++;
> +  if (memcmp(escaped->data, legalresult, legalresult_len))

A duplicate of the one above?

> +    return svn_error_createf
> +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> +  i++;
> +
> +  asciified = svn_utf_cstring_from_utf8_fuzzy(in, pool);
> +  if (strlen(asciified) != strlen(asciiresult))
> +    return svn_error_createf
> +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> +  i++;
> +  if (strcmp(asciified, asciiresult))
> +    return svn_error_createf
> +      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> +  i++;
> +
> +  return SVN_NO_ERROR;
> +}
> +
>  
>  /* The test table.  */
>  
> @@ -230,5 +309,6 @@
>      SVN_TEST_NULL,
>      SVN_TEST_PASS (utf_validate),
>      SVN_TEST_PASS (utf_validate2),
> +    SVN_TEST_PASS (utf_escape),
>      SVN_TEST_NULL
>    };
>
>
>
> ### The original point of this thread.
> ### This patch will apply with an offset, since I've cut out sections which
> ### reimplement XML escaping in terms of the svn_subr__escape_string.
> --- subversion/libsvn_subr/xml.c	(revision 14986)
> +++ subversion/libsvn_subr/xml.c	(working copy)
> @@ -395,11 +413,22 @@
>    /* If expat choked internally, return its error. */
>    if (! success)
>      {
> +      svn_stringbuf_t *sanitized;
> +      unsigned char *end;
> +      
> +      svn_utf__stringbuf_escape_utf8_fuzzy(&sanitized, buf,
> +					   (len > 240 ? 240 : len),
> +					   svn_parser->pool);
> +      end = sanitized->data +
> +	    (sanitized->len > 240 ? 240 : sanitized->len);

240?  A magic number with no explanation.

> +      while (*end > 0x80 && *end < 0xc0 &&
> +	     (char *) end > sanitized->data) end--;

I think that could generate a huge error message in pathological
cases.

>        err = svn_error_createf
>          (SVN_ERR_XML_MALFORMED, NULL, 
> -         _("Malformed XML: %s at line %d"),
> +         _("Malformed XML: %s at line %d; XML starts:\n%.*s"),
>           XML_ErrorString (XML_GetErrorCode (svn_parser->parser)),
> -         XML_GetCurrentLineNumber (svn_parser->parser));
> +         XML_GetCurrentLineNumber (svn_parser->parser),
> +	 (char *) end - sanitized->data + 1, sanitized->data);
>        
>        /* Kill all parsers and return the expat error */
>        svn_xml_free_parser (svn_parser);
>

Overall this patch makes me uneasy, I'd prefer a patch that builds on
our existing ctype and/or utf-8 code.  However as I haven't tried to
implement such a patch I don't know whether our existing code is
a suitable starting point.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

[PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On Wed, 23 Feb 2005, Charles Bailey wrote:> > Attached is a patchlet that, when expat fails to parse an hunk of XML,> appends at least part of the offending hunk to the error message.  It
which led to an exchange regarding the need to make stringsUTF-8-safe.  After too long a haitus, I posted for comment:
On 4/21/05, Charles Bailey <ba...@gmail.com> wrote:> > Well, after umpteen interrupts from the rest of life,I finally got a> few hours to look at this again.    In checking was was already> available, I found a handful of "string escaping" function in various> places which perform similar tasks (at least one with the comment> "this should share code with other_string_escaping_routine()").  Since> I'd have to add ya such function, I thought I'd try to abstract it a> bit, with the hope that similar routines could use a common base.> I've appended a short proposal at the bottom of this messages,> containing a common "engine" and an example implementation for> creating a UTF-8-safe version of an arbitrary string.
Julian Foad was kind enough to point out a dumb thinko, but no othercomments were forthcoming, possibly because the core developers werebusy with pre-1.2 cleanup.
So, after another too-long hiatus, here's a patch which implements a"common" string escaping function , uses it for UTF-8 escaping, and uses that to sanitize the offending XML, which is then output in theerror message that Jack built^W^Wstarted this thread.
I've interspersed my comments in the code, since there's imho zerochance that this version of the patch will besubstantially/stylistically suitable for committing.  They're far fromexhaustive, but this message is long enough already.
Conceptual "Log message":[[[Add function that escapes illegal UTF-8 characters, along the wayrefactoring core ofstring-escaping routines, and insure that illegal XML error messageoutputs legal UTF-8.### Probably best applied as several patches, but collected here for review.
* subversion/libsvn_subr/escape.c:   New file   (svn_subr__escape_string): Final-common-path function for escaping strings.
* subversion/libsvn_subr/escape_impl.h:   New file, declaring svn_subr__escape_string and convenience macros.   ### Logical candidate for consolidation with utf_impl.h, perhaps assubr_impl.h
* subversion/libsvn_subr/utf.c:   (fuzzy_escape): Renamed to ascii_fuzzy_escape, and rewritten to use    svn_subr__escape_string.   (svn_utf__stringbuf_escape_utf8_fuzzy): New function which escapes illegal    UTF-8 in a string, returning the escaped string in a stringbuf.   (utf8_escape_mapper): Helper function forsvn_utf__stringbuf_escape_utf8_fuzzy.
* subversion/libsvn_subr/utf_impl.h:   Add prototype for svn_utf__stringbuf_escape_utf8_fuzzy.   (svn_utf__cstring_escape_utf8_fuzzy):  Macro implementing variantof above that    returns NUL-terminated string.
* subversion/libsvn_subr/xml.c:   (svn_xml_parse): If parse fails, print (sanitized) (part of) offending XML    with error message.
* subversion/tests/libsvn_subr/utf-test.c:   (utf_escape): New function testing UTF-8 string-escaping functions.
* subversion/po/de.po, subversion/po/es.po, subversion/po/ja.po,  subversion/po/ko.po, subversion/po/nb.po, subversion/po/pl.po,  subversion/po/pt_BR.po, subversion/po/sv.po,  subversion/po/zh_CN.po, subversion/po/zh_TW.po:  Courtesy to translators, since I've changed a localized string.
]]]

### This driver was written because there are several "escaping"functions in different### places which do similar things with slightly different criteria. It seemed best to collect### the common work into one place, if not to save space, then tominimize divergence.### The goal here is to be fast on the simple cases via the screeningarray, while allowing### flexibility for more complex substitutions via the mappingfunction.  In very over-### simplified, off-the-cuff testing, eliminating the screening arraycaused a slowdiwn of### slightly less than twofold.### I've attempted to incorporate reasonable default behavior in thecase of NULL params.--- /dev/null	Mon Jun  6 11:06:27 2005+++ subversion/libsvn_subr/escape.c	Fri Jun  3 19:16:09 2005@@ -0,0 +1,58 @@+/*+ * escape.c:	common code for cleaning up unwanted bytes in strings+ */++#include "escape_impl.h"++#define COPY_PREFIX \+  if (c > base) { \+    svn_stringbuf_appendbytes (out, base, c - base); \+    base = c; \+  }++svn_stringbuf_t *+svn_subr__escape_string (svn_stringbuf_t **outsbuf,+			 const unsigned char *instr,+			 apr_size_t len,+			 const unsigned char *isok,+			 unsigned char (*mapper) (unsigned char **,+						  const unsigned char *,+						  apr_size_t,+						  const svn_stringbuf_t *,+						  void *,+						  apr_pool_t *),+			 void *mapper_baton,+			 apr_pool_t *pool)+{+  unsigned char *base, *c;+  svn_stringbuf_t *out;++  if (outsbuf == NULL || *outsbuf == NULL) {+    out = svn_stringbuf_create ("", pool);+    if (outsbuf)+      *outsbuf = out;+  }+  else+    out = *outsbuf;++  for (c = base = (unsigned char *) instr; c < instr + len; ) {+    apr_size_t count = isok ? isok[*c] : 0;+    if (count == 0) {+      COPY_PREFIX;+      count = mapper ? mapper (&c, instr, len, out, mapper_baton, pool) : 255;+    }+    if (count == 255) {+      char esc[6];++      COPY_PREFIX;+      sprintf (esc,"?\\%03u",*c);+      svn_stringbuf_appendcstr (out, esc);+      c++;+      base = c;+    }+    else c += count;+  }+  COPY_PREFIX;+  return out;+}+

### Comments are pretty self-explanatory.### Docs are as doxygen; will need to be downgraded to plaintext since it's### an internal header.### As noted above, it makes sense to combine this with utf_impl.h.--- /dev/null	Mon Jun  6 11:35:47 2005+++ subversion/libsvn_subr/escape_impl.h	Thu Jun  2 18:44:05 2005@@ -0,0 +1,147 @@+/*+ * escape_impl.h :  private header for string escaping function.+ */++++#ifndef SVN_LIBSVN_SUBR_ESCAPE_IMPL_H+#define SVN_LIBSVN_SUBR_ESCAPE_IMPL_H+++#include "svn_pools.h"+#include "svn_string.h"++#ifdef __cplusplus+extern "C" {+#endif /* __cplusplus */+++/** Scan @a instr of length @a len bytes, copying to stringbuf @a *outsbuf,+ * escaping bytes as indicated by the lookup array @a isok and the mapping+ * function @a mapper. Memory is allocated from @a pool.  You may provide+ * any extra information needed by @a mapper in @a mapper_baton.+ * Returns a pointer to the stringbuf containing the escaped string.+ *+ * If @a outsbuf or *outsbuf is NULL, a new stringbuf is created; itsaddress is+ * placed in @a outsbuf unless that argument is NULL.+ * If @a isok is NULL, then @a mapper is used exclusively.+ * If @ mapper is NULL, then a single character is escaped every time @a mapper+ * would have been called.+ *+ * This is designed to be the common pathway for various string "escaping"+ * functions across subversion.  The basic approach is to scan+ * the input and decide whether each byte is OK as it stands, needs to be+ * "escaped" using subversion's "?\uuu" default format, or needs to be+ * transformed in some other way.  The decision is made using a two step+ * process, which is designed to handle the simple cases quickly but allow+ * for more complex mappings.  Since the typical string will (we hope)+ * comprise mostly simple cases, this shouldn't require much code+ * complexity or loss of efficiency.  The two steps used are:+ *+ * 1. The value of a byte from the input string ("test byte") is used as an+ *    index into a (usually 256 byte) array passed in by the caller.+ *      - If the value of the appropriate array element is 0xff,+ *        then the test byte is escaped as a "?\uuu" string in the output.+ *      - If the value of the appropriate element is otherwise non-zero,+ *        that many bytes are copied verbatim from the input to the output.+ * 2. If the array yields a 0 value, then a mapping function provided by+ *    the caller is used to allow for more complex evaluation.  This function+ *    receives five arguments:+ *      - a pointer to the pointer used by svn__do_char_escape() to+ *        mark the test byte in the input string+ *      - a pointer to the start of the input string+ *      - the length of the input string+ *      - a pointer to the output stringbuf+ *      - the ever-helpful pool.+ * The mapping function may return a (positive) nonzero value,+ * which is interpreted * as described in step 1 above, or zero,+ * indicating that the test byte * should be ignored.  In the latter+ * case, this is generally because the * mapping function has done the+ * necessary work itself; it's free to * modify the output stringbuf and+ * adjust the pointer to the test byte * as it sees fit (within the+ * bounds of the input string).  At a minimum, * it should at least+ * increment the pointer to the test byte before * returning 0, in order+ * to avoid an infinite loop.+ */++svn_stringbuf_t *+svn_subr__escape_string (svn_stringbuf_t **outsbuf,+			 const unsigned char *instr,+			 apr_size_t len,+			 const unsigned char *isok,+			 unsigned char (*mapper) (unsigned char **,+						  const unsigned char *,+						  apr_size_t,+						  const svn_stringbuf_t *,+						  void *,+						  apr_pool_t *),+			 void *mapper_baton,+			 apr_pool_t *pool);++++/** Initializer for a basic screening matrix suitable for use with+ *  #svn_subr__escape_string to escape non-UTF-8 bytes.+ *  We provide this since "UTF-8-safety" is a common denominator for+ *  most string escaping in Subversion, so this matrix makes a good+ *  starting point for more involved schemes.+ */  +#define SVN_ESCAPE_UTF8_LEGAL_ARRAY { \+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  1,   1,\+  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0,\+  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0,\+  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0,\+  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0,\+255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0,\+  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0,\+  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0,\+  0,   0,   0,   0,   0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255}++/** Given pointer @a c into a string which ends at @a e, figure out+ *  whether (*c) starts a valid UTF-8 sequence, and if so, how many bytes+ *  it includes.  Return 255 if it's not valid UTF-8.+ *  For a more detailed description of the encoding rules, see the UTF-8+ *  specification in section 3-9 of the Unicode standard 4.0 (e.g. at+ *  http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf),+ *  with special attention to Table 3-6.+ *  This macro is also provided as a building block for mappers used by+ *  #svn_subr__escape_string that want to check for UTF-8-safety in+ *  addition to other tasks.+ */+#define SVN_ESCAPE_UTF8_MAPPING(c,e)					       \+  ( (c)[0] < 0x80                                  ? /* ASCII */	       \+    1 :				                     /* OK, 1 byte */	       \+    ( ( ((c)[0] > 0xc2 && (c)[0] < 0xdf)          && /* 2-byte char */	       \+	((c) + 1 <= (e))                          && /* Got 2 bytes */	       \+	((c)[1] >= 0x80 && (c)[1] <= 0xbf))        ? /* Byte 2 legal */        \+      2 :					     /* OK, 2 bytes */         \+      ( ( ((c)[0] >= 0xe0 && (c)[0] <= 0xef)      && /* 3 byte char */	       \+	  ((c) + 2 <= (e))                        && /* Got 3 bytes */	       \+	  ((c)[1] >= 0x80 && (c)[1] <= 0xbf)      && /* Basic byte 2 legal */  \+	  ((c)[2] >= 0x80 && (c)[2] <= 0xbf)      && /* Basic byte 3 legal */  \+	  (!((c)[0] == 0xe0 && (c)[1] < 0xa0))    && /* 0xe0-0x[89]? illegal */\+	  (!((c)[0] == 0xed && (c)[1] > 0x9f)) )   ? /* 0xed-0x[ab]? illegal */\+	3 :                                          /* OK, 3 bytes */	       \+	( ( ((c)[0] >= 0xf0 && (c)[0] <= 0xf4)    && /* 4 byte char */         \+	    ((c) + 3 <= (e))                      && /* Got 4 bytes */         \+	    ((c)[1] >= 0x80 && (c)[1] <= 0xbf)    && /* Basic byte 2 legal */  \+	    ((c)[2] >= 0x80 && (c)[2] <= 0xbf)    && /* Basic byte 3 legal */  \+	    ((c)[3] >= 0x80 && (c)[3] <= 0xbf)    && /* Basic byte 4 legal */  \+	    (!((c)[0] == 0xf0 && (c)[1] < 0x90))  && /* 0xf0-0x8? illegal */   \+	    (!((c)[0] == 0xf4 && (c)[1] > 0x8f)) ) ? /* 0xf4-0x[9ab]? illegal*/\+	  4 :					     /* OK, 4 bytes */         \+	  255))))                                    /* Illegal; escape it */+++#ifdef __cplusplus+}+#endif /* __cplusplus */++#endif /* SVN_LIBSVN_SUBR_ESCAPE_IMPL_H */

### Function names can be revised to fit convention, of course. ### svn_utf__cstring_escape_utf8_fuzzy serves as an example of a benefit of### returning the resultant stringbuf from svn_subr__escape_string both in a### parameter and as the function's return value. If the sense is thatit'll be a cause### of debugging headaches, or that it's cortrary to subversionculture to code public### functions as macros, it's easy enough to code this as a function,and to make### svn_subr__escape_string return void (or less likely svn_error_t,if it got pickier### about params.)--- subversion/libsvn_subr/utf_impl.h	(revision 14986)+++ subversion/libsvn_subr/utf_impl.h	(working copy)@@ -24,12 +24,33 @@  #include <apr_pools.h> #include "svn_types.h"+#include "svn_string.h"  #ifdef __cplusplus extern "C" { #endif /* __cplusplus */  +/** Replace any non-UTF-8 characters in @a len byte long string @a src with+ *  escaped representations, placing the result in a stringbuf pointed to by+ *  @a *dest, which will be created if necessary.  Memory is allocated from+ *  @a pool as needed. Returns a pointer to the stringbuf containing the result+ *  (identical to @a *dest, but facilitates chaining calls).+ */+svn_stringbuf_t *+svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,+				      const unsigned char *src,+				      apr_size_t len,+				      apr_pool_t *pool);++/** Replace any non-UTF-8 characters in @a len byte long string @a src with+ *  escaped representations.  Memory is allocated from @a pool as needed.+ *  Returns a pointer to the resulting string.+ */+#define svn_utf__cstring_escape_utf8_fuzzy(src,len,pool) \+  (svn_utf__stringbuf_escape_utf8_fuzzy(NULL,(src),(len),(pool)))->data++const char *svn_utf__cstring_from_utf8_fuzzy (const char *src,                                               apr_pool_t *pool,                                               svn_error_t *(*convert_from_utf8)


### There're other places that could be rewritten in terms of the new escaping### functions, but I hope the two given here serve as an example of how it might### be done.### The rename to ascii_fuzzy_escape is to distinguish it from the new functions### that escape only illegal UTF-8 sequences.--- subversion/libsvn_subr/utf.c	(revision 14986)+++ subversion/libsvn_subr/utf.c	(working copy)@@ -30,6 +30,7 @@ #include "svn_pools.h" #include "svn_ctype.h" #include "svn_utf.h"+#include "escape_impl.h" #include "utf_impl.h" #include "svn_private_config.h" @@ -323,53 +324,19 @@ /* Copy LEN bytes of SRC, converting non-ASCII and zero bytes to ?\nnn    sequences, allocating the result in POOL. */ static const char *-fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)+ascii_fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool) {-  const char *src_orig = src, *src_end = src + len;-  apr_size_t new_len = 0;-  char *new;-  const char *new_orig;+  static unsigned char asciinonul[256];+  svn_stringbuf_t *result = NULL; -  /* First count how big a dest string we'll need. */-  while (src < src_end)-    {-      if (! svn_ctype_isascii (*src) || *src == '\0')-        new_len += 5;  /* 5 slots, for "?\XXX" */-      else-        new_len += 1;  /* one slot for the 7-bit char */+  if (!asciinonul[0]) {+    asciinonul[0] = 255;                   /* NUL's not allowed */+    memset(asciinonul + 1, 1, 127);        /* Other regular ASCII OK */+    memset(asciinonul + 128, 255, 128);    /* High half not allowed */+  }  -      src++;-    }--  /* Allocate that amount. */-  new = apr_palloc (pool, new_len + 1);--  new_orig = new;--  /* And fill it up. */-  while (src_orig < src_end)-    {-      if (! svn_ctype_isascii (*src_orig) || src_orig == '\0')-        {-          /* This is the same format as svn_xml_fuzzy_escape uses, but that-             function escapes different characters.  Please keep in sync!-             ### If we add another fuzzy escape somewhere, we should abstract-             ### this out to a common function. */-          sprintf (new, "?\\%03u", (unsigned char) *src_orig);-          new += 5;-        }-      else-        {-          *new = *src_orig;-          new += 1;-        }--      src_orig++;-    }--  *new = '\0';--  return new_orig;+  svn_subr__escape_string(&result, src, len, asciinonul, NULL, NULL, pool);+  return result->data; }  /* Convert SRC_LENGTH bytes of SRC_DATA in NODE->handle, store the result@@ -448,7 +415,7 @@         errstr = apr_psprintf           (pool, _("Can't convert string from '%s' to '%s':"),            node->frompage, node->topage);-      err = svn_error_create (apr_err, NULL, fuzzy_escape (src_data,+      err = svn_error_create (apr_err, NULL, ascii_fuzzy_escape (src_data,                                                            src_length, pool));       return svn_error_create (apr_err, err, errstr);     }@@ -564,7 +531,28 @@   return SVN_NO_ERROR; } +static unsigned char+utf8_escape_mapper (unsigned char **targ, const unsigned char *start,+		    apr_size_t len, const svn_stringbuf_t *dest,+		    void *baton, apr_pool_t *pool)+{+  const unsigned char *end = start + len;+  return SVN_ESCAPE_UTF8_MAPPING(*targ, end);+} +svn_stringbuf_t *+svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,+				      const unsigned char *src,+				      apr_size_t len,+				      apr_pool_t *pool)+{+  static unsigned char utf8screen[256] = SVN_ESCAPE_UTF8_LEGAL_ARRAY;++  return svn_subr__escape_string(dest, src, len,+				 utf8screen, utf8_escape_mapper, NULL,+				 pool);+}+ svn_error_t * svn_utf_stringbuf_to_utf8 (svn_stringbuf_t **dest,                            const svn_stringbuf_t *src,@@ -787,7 +775,7 @@   const char *escaped, *converted;   svn_error_t *err; -  escaped = fuzzy_escape (src, strlen (src), pool);+  escaped = ascii_fuzzy_escape (src, strlen (src), pool);    /* Okay, now we have a *new* UTF-8 string, one that's guaranteed to      contain only 7-bit bytes :-).  Recode to native... */
### With code comes testing.### Note: Contains 8-bit chars, and also uses convention that cc will treat### "foo" "bar" as "foobar".  Both can be avoided if useful forfinicky compilers.
--- subversion/tests/libsvn_subr/utf-test.c	(revision 14986)+++ subversion/tests/libsvn_subr/utf-test.c	(working copy)@@ -17,6 +17,7 @@  */  #include "../svn_test.h"+#include "../../include/svn_utf.h" #include "../../libsvn_subr/utf_impl.h"  /* Random number seed.  Yes, it's global, just pretend you can't see it. */@@ -222,6 +223,84 @@   return SVN_NO_ERROR; } +static svn_error_t *+utf_escape (const char **msg,+	    svn_boolean_t msg_only,+	    svn_test_opts_t *opts,+	    apr_pool_t *pool)+{+  char in[] = { 'A', 'S', 'C', 'I', 'I',     /* All printable */+		'R', 'E', 'T', '\n', 'N',    /* Newline */+		'B', 'E', 'L', 0x07, '!',    /* Control char */+		0xd2, 0xa6, 'O', 'K', '2',   /* 2-byte char, valid */+		0xc0, 0xc3, 'N', 'O', '2',   /* 2-byte char, invalid 1st */+		0x82, 0xc3, 'N', 'O', '2',   /* 2-byte char, invalid 2nd */+		0xe4, 0x87, 0xa0, 'O', 'K',  /* 3-byte char, valid */+		0xe2, 0xff, 0xba, 'N', 'O',  /*3-byte char, invalid 2nd */+		0xe0, 0x87, 0xa0, 'N', 'O',  /*3-byte char, invalid 2nd */+		0xed, 0xa5, 0xa0, 'N', 'O',  /*3-byte char, invalid 2nd */+		0xe4, 0x87, 0xc0, 'N', 'O',  /* 3-byte char, invalid 3rd */+		0xf2, 0x87, 0xa0, 0xb5, 'Y', /* 4-byte char, valid */+		0xf2, 0xd2, 0xa0, 0xb5, 'Y', /* 4-byte char, invalid 2nd */+		0xf0, 0x87, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */+		0xf4, 0x97, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */+		0xf2, 0x87, 0xc3, 0xb5, 'N', /* 4-byte char, invalid 3rd */+		0xf2, 0x87, 0xa0, 0xd5, 'N', /* 4-byte char, invalid 4th */+                0x00 };+  const unsigned char *legalresult =+    "ASCIIRET\nNBEL!$-1(c)ÊOK2?\\192?\\195NO2?\\130?\\195NO2"-A+    "3$-3ıΩ0‰á†1OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO?\\237?\\165?\\160NO"-A+    "?\\228?\\135?\\192NO3$-3ıΩ0Úá†µ1Y?\\242$-1(c)‡?\\181Y?\\240?\\135?\\160"-A+    "?\\181N?\\244?\\151?\\160?\\181N?\\242?\\135ıN?\\242?\\135?\\160"+    "?\\213N";+  const unsigned char *asciiresult =+    "ASCIIRET\nNBEL\x07!?\\210?\\166OK2?\\192?\\195NO2?\\130?\\195NO2"+    "?\\228?\\135?\\160OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO"+    "?\\237?\\165?\\160NO?\\228?\\135?\\192NO?\\242?\\135?\\160?\\181Y"+    "?\\242?\\210?\\160?\\181Y?\\240?\\135?\\160?\\181N"+    "?\\244?\\151?\\160?\\181N?\\242?\\135?\\195?\\181N"+    "?\\242?\\135?\\160?\\213N";+  const unsigned char *asciified;+  apr_size_t legalresult_len = 213;  /* == strlen(legalresult) iff no NULs */+  int i = 0;+  svn_stringbuf_t *escaped = NULL;++  *msg = "test utf string escaping";++  if (msg_only)+    return SVN_NO_ERROR;++  if (svn_utf__stringbuf_escape_utf8_fuzzy+      (&escaped, in, sizeof in - 1, pool) != escaped)+    return svn_error_createf+      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);+  i++;+  if (escaped->len != legalresult_len)+    return svn_error_createf+      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);+  i++;+  if (memcmp(escaped->data, legalresult, legalresult_len))+    return svn_error_createf+      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);+  i++;+  if (memcmp(escaped->data, legalresult, legalresult_len))+    return svn_error_createf+      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);+  i++;++  asciified = svn_utf_cstring_from_utf8_fuzzy(in, pool);+  if (strlen(asciified) != strlen(asciiresult))+    return svn_error_createf+      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);+  i++;+  if (strcmp(asciified, asciiresult))+    return svn_error_createf+      (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);+  i++;++  return SVN_NO_ERROR;+}+  /* The test table.  */ @@ -230,5 +309,6 @@     SVN_TEST_NULL,     SVN_TEST_PASS (utf_validate),     SVN_TEST_PASS (utf_validate2),+    SVN_TEST_PASS (utf_escape),     SVN_TEST_NULL   };


### The original point of this thread.### This patch will apply with an offset, since I've cut out sections which### reimplement XML escaping in terms of the svn_subr__escape_string.--- subversion/libsvn_subr/xml.c	(revision 14986)+++ subversion/libsvn_subr/xml.c	(working copy)@@ -395,11 +413,22 @@   /* If expat choked internally, return its error. */   if (! success)     {+      svn_stringbuf_t *sanitized;+      unsigned char *end;+      +      svn_utf__stringbuf_escape_utf8_fuzzy(&sanitized, buf,+					   (len > 240 ? 240 : len),+					   svn_parser->pool);+      end = sanitized->data ++	    (sanitized->len > 240 ? 240 : sanitized->len);+      while (*end > 0x80 && *end < 0xc0 &&+	     (char *) end > sanitized->data) end--;       err = svn_error_createf         (SVN_ERR_XML_MALFORMED, NULL, -         _("Malformed XML: %s at line %d"),+         _("Malformed XML: %s at line %d; XML starts:\n%.*s"),          XML_ErrorString (XML_GetErrorCode (svn_parser->parser)),-         XML_GetCurrentLineNumber (svn_parser->parser));+         XML_GetCurrentLineNumber (svn_parser->parser),+	 (char *) end - sanitized->data + 1, sanitized->data);              /* Kill all parsers and return the expat error */       svn_xml_free_parser (svn_parser);

### Finally, be kind to the translators.--- subversion/po/pt_BR.po	(revision 14986)+++ subversion/po/pt_BR.po	(working copy)@@ -6006,8 +6006,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "XML mal formado: %s na linha %d"+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "XML mal formado: %s na linha %d; XML comeÁa:\n%.240s"  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/es.po	(revision 14986)+++ subversion/po/es.po	(working copy)@@ -6102,8 +6102,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "XML malformado: %s en la lÌnea %d"+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "XML malformado: %s en la lÌnea %d; XML comienza:\n%.240s"  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/de.po	(revision 14986)+++ subversion/po/de.po	(working copy)@@ -6090,8 +6090,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "Fehlerhaftes XML: %s in Zeile %d"+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "Fehlerhaftes XML: %s in Zeile %d; XML beginnt:\n%.240s"  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/sv.po	(revision 14986)+++ subversion/po/sv.po	(working copy)@@ -6005,8 +6005,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "Felaktig XML: %s pÂ rad %d"+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "Felaktig XML: %s pÂ rad %d; XML starta:\n%.240s"  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/ko.po	(revision 14986)+++ subversion/po/ko.po	(working copy)@@ -5906,8 +5906,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "3$-3ıΩ0Ïûò13ıΩ0Î™ª13ıΩ0Îêú1 XML: %s (3ıΩ0Ï§Ñ13ıΩ0Î≤à13ıΩ0Ìò∏1 %d)"-A+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "3$-3ıΩ0Ïûò13ıΩ0Î™ª13ıΩ0Îêú1 XML: %s (3ıΩ0Ï§Ñ13ıΩ0Î≤à13ıΩ0Ìò∏1 %d); XML:\n%.240s"-A  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/ja.po	(revision 14986)+++ subversion/po/ja.po	(working copy)@@ -6463,8 +6463,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "3$-3ıΩ0Áï∞13ıΩ0Â∏∏1$-2æ  XML æ«æπ: %s (3$-3ıΩ0Ë°å1 %d)"-A+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "3$-3ıΩ0Áï∞13ıΩ0Â∏∏1$-2æ  XML æ«æπ: %s(3$-3ıΩ0Ë°å1 %d); XML:\n%.240s"-A  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/pl.po	(revision 14986)+++ subversion/po/pl.po	(working copy)@@ -6103,8 +6103,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "Uszkodzony XML: %s w linii %d"+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "Uszkodzony XML: %s w linii %d; XML wersja:\n%.240s"  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/zh_TW.po	(revision 14986)+++ subversion/po/zh_TW.po	(working copy)@@ -5896,8 +5896,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "3$-3ıΩ0Êúâ13ıΩ0Áº∫13ıΩ0Èô∑1 XML: %s3ıΩ0Êñº13ıΩ0Á¨¨1 %d 3ıΩ0Âàó1"-A+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "3$-3ıΩ0Êúâ13ıΩ0Áº∫13ıΩ0Èô∑1 XML: %s3ıΩ0Êñº13ıΩ0Á¨¨1 %d 3ıΩ0Âàó1; XML:\n%.240s"-A  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/nb.po	(revision 14986)+++ subversion/po/nb.po	(working copy)@@ -5995,8 +5995,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "Misdannet XML: %s i linje %d"+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "Misdannet XML: %s i linje %d; XML starter:\n%.240s"  #: libsvn_wc/adm_crawler.c:380 #, c-format--- subversion/po/zh_CN.po	(revision 14986)+++ subversion/po/zh_CN.po	(working copy)@@ -5955,8 +5955,8 @@  #: libsvn_subr/xml.c:400 #, c-format-msgid "Malformed XML: %s at line %d"-msgstr "3$-3ıΩ0Áï∏13ıΩ0ÂΩ¢13ıΩ0ÁöÑ1XMLÚ˙%s3ıΩ0Âú(r)13ıΩ0Á¨¨1 %d 3ıΩ0Ë°å1"-A+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"+msgstr "3$-3ıΩ0Áï∏13ıΩ0ÂΩ¢13ıΩ0ÁöÑ1XMLÚ˙%s3ıΩ0Âú(r)13ıΩ0Á¨¨1 %d 3ıΩ0Ë°å1; XML:\n%.240s"-A  #: libsvn_wc/adm_crawler.c:380 #, c-format
### End of patch ###
-- Regards,Charles BaileyLists: bailey _dot_ charles _at_ gmail _dot_ comOther: bailey _at_ newman _dot_ upenn _dot_ edu

Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On 4/27/05, Julian Foad <ju...@btopenworld.com> wrote:
> Charles Bailey wrote:
> > Comments welcomed, before I invest any more time along this path.
> 
> Uh, no really useful comments from me, just a simple bug: your example program
> says "255" a lot where you mean "256" at least in some places like here:
> 
> > svn_escape_utf8 (svn_stringbuf_t **outsbuf,
> [...]
> >   static unsigned char utf8safe[255];
> [...]
> >     memset (utf8safe + 0xf5, 255, 11);  /* oxf5 - 0xff always illegal */
> 
> This "memset" overruns the array.

Urk.  Dumb been-up-too-long thinko.  Thanks.

-- 
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Julian Foad <ju...@btopenworld.com>.

Charles Bailey wrote:
> Comments welcomed, before I invest any more time along this path.

Uh, no really useful comments from me, just a simple bug: your example program 
says "255" a lot where you mean "256" at least in some places like here:

> svn_escape_utf8 (svn_stringbuf_t **outsbuf,
[...]
>   static unsigned char utf8safe[255];
[...]
>     memset (utf8safe + 0xf5, 255, 11);  /* oxf5 - 0xff always illegal */

This "memset" overruns the array.

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On 2/28/05, Peter N. Lundblad <pe...@famlundblad.se> wrote:
> 
> Not its output, but the strings internally. With few exceptions our APIs
> work with UTF8 internally. The problem with adding "raw data" is that the
> error message will be invalid UTF8 and you will have recoding errors
> later. Our own error output routines will escape the whole string then,
> but that might not other libraries do. So, yes, keeping our strings valid
> UTF8 is a goal.
> 
> > If it is a general policy to convert to UTF-8, should I code this as a
> > separate function, rather than putting the logic into parse_xml?
> >
> You can put it in a separate function. Keep it internal to the file,
> though, until we see another use case for it. We can export it if that
> happens.

Well, after umpteen interrupts from the rest of life,I finally got a
few hours to look at this again.    In checking was was already
available, I found a handful of "string escaping" function in various
places which perform similar tasks (at least one with the comment
"this should share code with other_string_escaping_routine()").  Since
I'd have to add ya such function, I thought I'd try to abstract it a
bit, with the hope that similar routines could use a common base. 
I've appended a short proposal at the bottom of this messages,
containing a common "engine" and an example implementation for
creating a UTF-8-safe version of an arbitrary string.

While it's certainly more complex than just coding another
fuzzy_escape_foo(), I think it'd be a win in the long run to minimize
the number of near-parallel routines in subversion.  Of course,
there'll still have to be multiple glue routines for C strings,
counted strings, etc.

BTW, on the topic of common behaviors, what led to the use of decimal
"?\uuu" as the common representation of "illegal" bytes?  I don't see
a technical problem; I'm just asking because my brain thinks aout bit
patterns more easily in "?\xx".

Comments welcomed, before I invest any more time along this path.

-- 
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

/*
 * Quick prototype of "string escaping" common routine and UTF-8 task.
 * Builds with
 *    gcc -I/usr/local/include/subversion-1 -I/usr/local/apr/include/apr-0 \
 *           -L/usr/local/apr/lib -lsvn_subr-1 -lapr-0 -o esctest  esctest.c
 */

#include <stdio.h>
#include <string.h>
#include <assert.h>

#include "svn_pools.h"
#include "svn_string.h"

/** Scan @a instr of length @a len bytes, copying to stringbuf @a outsbuf,
 * escaping bytes as indicated by the lookup array @isok and the mapping
 * function @mapper.  Memory is allocated from @a pool.
 * 
 * This is designed to be the common pathway for various string "escaping"
 * functions scattered through subversion.  The basic approach is to scan
 * an input and decide whether each byte is OK as it stands, needs to be
 * "escaped" using subversion's "?\uuu" default format, or needs to be
 * transformed in some other way.  The decision is made using a two step
 * process, which is designed to handle the simple cases quickly but allow
 * for more complex mappings.  Since the typical string will (we hope)
 * comprise mostly simple cases, this shouldn't require much code
 * complexity or loss of efficiency.  The two steps used are:
 *
 * 1. The value of a byte from the input string ("test byte") is used as an
 *    index into a (usually 255 byte) array passed in by the caller.
 *      - If the value of the appropriate array element is 0xff, 
 *        then the test byte is escaped as a "?\uuu" string in the output.
 *      - If the value of the appropriate element is otherwise non-zero,
 *        that many bytes are copied verbatim from the input to the output.
 * 2. If the array yields a 0 value, then a mapping function provided by
 *    the caller is used to allow for more complex evaluation.  This function
 *    receives five arguments:
 *      - a pointer to the pointer used by svn_do_char_escape to
 *        mark the test byte in the input string
 *      - a pointer to the start of the input string
 *      - the length of the input string
 *      - a pointer to the output stringbuf
 *      - the ever-helpful pool.
 *    The mapping function may return an nonzero value, which is interpreted
 *    as described in step 1 above, or zero, indicating that the test byte
 *    should be ignored.  In the latter case, this is generally because the
 *    mapping function has done the necessary work itself; it's free to
 *    modify the output stringbuf and adjust the pointer to the test byte
 *    as it sees fit (within the bounds of the input string).  At a minimum,
 *    it should at least increment the pointer to the test byte before
 *    returning 0, in order to avoid an infinite loop.
 */
static void
svn_do_char_escape (svn_stringbuf_t **outsbuf,
		    unsigned char *instr,
		    apr_size_t len,
		    unsigned char *isok,
		    int (*mapper)(),
		    apr_pool_t *pool)
{
  unsigned char *base, *c;

  if (*outsbuf == NULL)
    *outsbuf = svn_stringbuf_create ("", pool);
 
  for (c = base = instr; c < instr + len; ) {
    apr_size_t count = isok[*c];
    if (count == 0) {
      if (c > base)
	svn_stringbuf_appendbytes (*outsbuf, base, c - base);
      count = mapper (&c,instr,len,*outsbuf,pool);
    }
    if (count == 255) {
      char esc[6];

      if (c > base)
	svn_stringbuf_appendbytes (*outsbuf, base, c - base);
      sprintf (esc,"?\\%03u",*c);
      svn_stringbuf_appendcstr (*outsbuf, esc);
      c++;
      base = c;
    }
    else c += count;
  }
  if (c > base)
    svn_stringbuf_appendbytes (*outsbuf, base, c - base);
}

/** Determine whether the (presumably high-half) byte pointed to by
 * @a *cur is the start of a legal UTF-8 sequence in @a str, and tell
 * the caller to either copy the legal sequence or escape the current
 * byte as illegal.
 */
static int
svn_utf8_mapper (char **cur,
		 unsigned char *str,
		 apr_size_t len,
		 svn_stringbuf_t *target,
		 apr_pool_t *pool)
{
  unsigned char *c, *end = str + len;

  if (!cur || !*cur) return 255;  /* Can't help you; sorry */
  c = *cur;
  if (c[0] < 0x80) return 1; /* Shouldn't happen */
  if ( (c[0] >= 0xc2 && c[0] < 0xdf)   &&
       (c + 1 <= end)                  &&
       (c[1] >= 0x80 && c[1] <= 0xbf) )
    return 2;
  if (c[0] >= 0xe0 && c[0] <= 0xef) {
    if ( (c + 2 > end)                 ||
	 (c[1] <  0x80 || c[1] > 0xbf) ||
	 (c[2] <  0x80 || c[2] > 0xbf) ||
	 (c[0] == 0xe0 && c[1] < 0xa0) ||
	 (c[0] == 0xed && c[1] > 0x9f) )
      return 255;
    return 3;
  }
  if (c[0] >= 0xf0 && c[0] <= 0xf4) {
    if ( (c + 3 > end)                 ||
	 (c[1] <  0x80 || c[1] > 0xbf) ||
	 (c[2] <  0x80 || c[2] > 0xbf) ||
	 (c[3] <  0x80 || c[3] > 0xbf) ||
	 (c[0] == 0xf0 && c[1] < 0x90) ||
	 (c[0] == 0xf4 && c[1] > 0x8f) )
      return 255;
    return 4;
  }
  return 255;  /* Shouldn't happen */
}

/** Copy @a str of length @a len to @a *outsbuf, escaping bytes which are
 * not legal UTF-8 sequences.  Memory is allocated from @a pool as needed.
 * If @a *outsbuf is NULL, a new stringbuf is created.
 *
 * This is the "public" entry point to the common escaping engine for
 * building UTF-8-safe strings.
 */
static void
svn_escape_utf8 (svn_stringbuf_t **outsbuf,
		 unsigned char *str,
		 apr_size_t len,
		 apr_pool_t *pool)
{
  static unsigned char utf8safe[255];

  if (!utf8safe[0]) {
    memset (utf8safe, 1, 0x7f);         /* 0x00 - 0x7f always legal */
    memset (utf8safe + 0x80, 0, 0x7f);  /* 0x80 - 0xff need check, except */
    memset (utf8safe + 0xc0, 255, 2);   /* 0xc0 - 0xc1 always illegal */
    memset (utf8safe + 0xf5, 255, 11);  /* oxf5 - 0xff always illegal */
  }

  svn_do_char_escape (outsbuf, str, len, utf8safe, svn_utf8_mapper, pool);
}


/* Quick hack to run some simple tests */
int main() {
  char test[4][6] = { { 'A', 'S', 'C', 'I', 'I', '\0'  },
		      { 'B', 'E', 'L', 0x07, '!', '\0' },
		      { 0xe2, 0x98, 0xba, 'O', 'K', '\0' },
		      { 0xe2, 0xff, 0xba, 'N', 'O', '\0' }, };

  int i;
  svn_stringbuf_t *outsbuf;
  apr_pool_t *pool;

  apr_initialize();
  apr_pool_create(&pool,NULL);

  for (i = 0; i < 4; i++) {
    outsbuf = NULL;
    printf("Original: \"%s\"\n", test[i]);
    svn_escape_utf8(&outsbuf, test[i], 5, pool);
    printf("Escaped: \"%s\"\n", outsbuf->data);
  }
  return 0;
}

/*  End of message */

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.

On Sun, 27 Feb 2005, Charles Bailey wrote:

> On Sun, 27 Feb 2005 22:25:11 +0100 (CET), Peter N. Lundblad
> <pe...@famlundblad.se> wrote:
[snip discussion about ensuring that the error stirng is valid UTF8]

> This looks like a SMOP, but I'm not sure what the real benefit is.
> Does Subversion make the guarantee that all of its output will be
> UTF-8?  Unless that's a major goal, I think there's something to be
> said for inserting the offending XML "as is" into the error message.

Not its output, but the strings internally. With few exceptions our APIs
work with UTF8 internally. The problem with adding "raw data" is that the
error message will be invalid UTF8 and you will have recoding errors
later. Our own error output routines will escape the whole string then,
but that might not other libraries do. So, yes, keeping our strings valid
UTF8 is a goal.

> If it is a general policy to convert to UTF-8, should I code this as a
> separate function, rather than putting the logic into parse_xml?
>
You can put it in a separate function. Keep it internal to the file,
though, until we see another use case for it. We can export it if that
happens.

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On Sun, 27 Feb 2005 22:25:11 +0100 (CET), Peter N. Lundblad
<pe...@famlundblad.se> wrote:
> On Sat, 26 Feb 2005, Charles Bailey wrote:
> 
> > On Thu, 24 Feb 2005 14:17:20 +0100 (CET), Peter N. Lundblad
> > <pe...@famlundblad.se> wrote:
> > > On Wed, 23 Feb 2005, Charles Bailey wrote:
> > >
> > A revised patch is appended below my sig, with appropriate attention
> > to l10n.   I've taken idiom from other messages where I can (z.B.
> > 'beginnen' v. 'anfingen').  More fluent readers may want to pitch in
> > on the middle European and Scandinavian languages, where my morphology
> > may be laughably in error, and I've punted almost entirely on the east
> > Asian languages, but I hope it'll serve as a decent start nonetheless.
> >
> You needn't bother with the .po files in a code patch. The translators
> will take care of that. (We generally don't expect patch submitters to
> provide translations of all 10 languages:-)

I'd assumed that to be the case generally, but if I can save them some
work, so much the better.

> > Index: subversion/libsvn_subr/xml.c
> > ===================================================================
> > --- subversion/libsvn_subr/xml.c      (revision 13160)
> > +++ subversion/libsvn_subr/xml.c      (working copy)
> > @@ -417,9 +417,9 @@
> >      {
> >        err = svn_error_createf
> >          (SVN_ERR_XML_MALFORMED, NULL,
> > -         _("Malformed XML: %s at line %d"),
> > +         _("Malformed XML: %s at line %d; XML starts:\n%.240s"),
> >           XML_ErrorString (XML_GetErrorCode (svn_parser->parser)),
> > -         XML_GetCurrentLineNumber (svn_parser->parser));
> > +         XML_GetCurrentLineNumber (svn_parser->parser), buf);
> >
> I discussed this with Erik Heulsmann on IRC and our conclusion is that we
> need to be mroe careful with encodings here. As I stated earlier, we have
> no guarantee that this string will be valid UTF8 (and we can't know from
> the error either, since the encoding error might be after the error being
> reported). We have code in utf_valid.c to decide if a string is valid
> UTF8. You need to use this. We have code in utf. (fuzzy_escape) to fuzzily
> escape non-ASCII characters in a string. Export this function as an
> internal "API" (svn_utf__fuzzy_escape) and use it if the part of the
> string is invalid UTF8. To make it even more complicated, you should
> ensure that you cut the string on an UTF8 boundary. There is code for this
> in svn_ctype.h. Make sure you break the string before a LEAD character or
> ASCII character. This will avoid unnecessary escaping.

This looks like a SMOP, but I'm not sure what the real benefit is. 
Does Subversion make the guarantee that all of its output will be
UTF-8?  Unless that's a major goal, I think there's something to be
said for inserting the offending XML "as is" into the error message. 
If it is a general policy to convert to UTF-8, should I code this as a
separate function, rather than putting the logic into parse_xml?

> As so often, a patch turns out to be a little more work when anticipated
> at the first glance... But, please continue. It will help us much in the
> future.

Sure, if it's helpful.  I just want to make sure I'm not
overengineering the solution to a small problem.

--
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.

On Sat, 26 Feb 2005, Charles Bailey wrote:

> On Thu, 24 Feb 2005 14:17:20 +0100 (CET), Peter N. Lundblad
> <pe...@famlundblad.se> wrote:
> > On Wed, 23 Feb 2005, Charles Bailey wrote:
> >
> > Good start:-) A few commets below. (Also, could you please try to get the
> > MIME type of your attachments as some text mime type Makes it easier to
> > review.)
>
> Definitely.  I put a little too much faith in the mail client's
> ability to recognize text; for now, I'll just inline the patch, since
> it's small.
>
That's fine.

> A revised patch is appended below my sig, with appropriate attention
> to l10n.   I've taken idiom from other messages where I can (z.B.
> 'beginnen' v. 'anfingen').  More fluent readers may want to pitch in
> on the middle European and Scandinavian languages, where my morphology
> may be laughably in error, and I've punted almost entirely on the east
> Asian languages, but I hope it'll serve as a decent start nonetheless.
>
You needn't bother with the .po files in a code patch. The translators
will take care of that. (We generally don't expect patch submitters to
provide translations of all 10 languages:-)

> Index: subversion/libsvn_subr/xml.c
> ===================================================================
> --- subversion/libsvn_subr/xml.c	(revision 13160)
> +++ subversion/libsvn_subr/xml.c	(working copy)
> @@ -417,9 +417,9 @@
>      {
>        err = svn_error_createf
>          (SVN_ERR_XML_MALFORMED, NULL,
> -         _("Malformed XML: %s at line %d"),
> +         _("Malformed XML: %s at line %d; XML starts:\n%.240s"),
>           XML_ErrorString (XML_GetErrorCode (svn_parser->parser)),
> -         XML_GetCurrentLineNumber (svn_parser->parser));
> +         XML_GetCurrentLineNumber (svn_parser->parser), buf);
>
I discussed this with Erik Heulsmann on IRC and our conclusion is that we
need to be mroe careful with encodings here. As I stated earlier, we have
no guarantee that this string will be valid UTF8 (and we can't know from
the error either, since the encoding error might be after the error being
reported). We have code in utf_valid.c to decide if a string is valid
UTF8. You need to use this. We have code in utf. (fuzzy_escape) to fuzzily
escape non-ASCII characters in a string. Export this function as an
internal "API" (svn_utf__fuzzy_escape) and use it if the part of the
string is invalid UTF8. To make it even more complicated, you should
ensure that you cut the string on an UTF8 boundary. There is code for this
in svn_ctype.h. Make sure you break the string before a LEAD character or
ASCII character. This will avoid unnecessary escaping.

As so often, a patch turns out to be a little more work when anticipated
at the first glance... But, please continue. It will help us much in the
future.

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by Charles Bailey <ba...@gmail.com>.

On Thu, 24 Feb 2005 14:17:20 +0100 (CET), Peter N. Lundblad
<pe...@famlundblad.se> wrote:
> On Wed, 23 Feb 2005, Charles Bailey wrote:
> 
> > Attached is a patchlet that, when expat fails to parse an hunk of XML,
> > appends at least part of the offending hunk to the error message.  It
> > provides a better jumping-off point for further debugging than the
> > current message, especially since the line number often isn't all that
> > meaningful.  It's not intended to flag the error specifically, or even
> > to insure it's actually in the output fragment, since it'd be too hard
> > to reliably identify the problem.  (For that matter, I can envision
> > cases where the error is actually in a previously parsed hunk of XML,
> > but doesn't become apparent until a tag mismatches in the current hunk
> > or the like.) By way of example, the current error message:
> >
> I think your approach is good enough in this respect.
> 
> > A few comments:

> We generally try to avoid line breaks in error messages. That's because of
> GUI issues. But with l10n and long filenames, we can't possibly keep
> messages within 80 chars. IN this case, I think it is a good idea to keep
> the newline befre the XML.

> >  - I placed delimiters before and after the XML with the thought that
> > it might help in some
> >    situations where the XML itself started with odd characters; again,
> > this is relatively
> >    arbitrary.  It'd arguably be of some help for parsing the error
> > (the delimiters aren't
> >    legal XML), but as this is a fatal and (one hopes) uncommon error,
> > I don't think many
> >    people will be writing tools to anticipate it.  If the consensus is
> > that they just add noise,
> >    they're easy enough to drop.
> >
> IMO, just drop them.

OK, as 1 of 1 respondents recommends dropping them, they're gone.

> Good start:-) A few commets below. (Also, could you please try to get the
> MIME type of your attachments as some text mime type Makes it easier to
> review.)

Definitely.  I put a little too much faith in the mail client's
ability to recognize text; for now, I'll just inline the patch, since
it's small.

>> Display fragment of offending XML on "Malformed XML" error
> 
> Missing period at end of sentence and blank line after the summary.
> 
>   * subversion/libsvn_subr/xml.c
>     (svn_xml_parse): Append reasonable hunk of XML to error message
> 
> We usually put the asterisk in the first column.

OK -- I overtrained on incidental details in HACKING.


> +         _("Malformed XML: %s at line %d; XML %s:\n>>>>>\n%.240s\n<<<<<\n"),
> Drop the trailing newline.
>           XML_ErrorString (XML_GetErrorCode (svn_parser->parser)),
> -         XML_GetCurrentLineNumber (svn_parser->parser));
> +         XML_GetCurrentLineNumber (svn_parser->parser),
> +        len > 240 ? "starts" : "is", buf);
> You can't use message fragment like this in internationalized messages.
> Think of what happens when the _() marked string gets translated, but no
> the "is" or "starts":-) You have tomove the condition outside the message,
> creating two complete messages instead.

OK -- my oversight.  Sorry.  The problem's really not worth two
messages, so I'll just use "starts"; some small fraction of the time
the offending XML may coincidentally also end with the displayerd
fragment.

A revised patch is appended below my sig, with appropriate attention
to l10n.   I've taken idiom from other messages where I can (z.B.
'beginnen' v. 'anfingen').  More fluent readers may want to pitch in
on the middle European and Scandinavian languages, where my morphology
may be laughably in error, and I've punted almost entirely on the east
Asian languages, but I hope it'll serve as a decent start nonetheless.

--
Regards,
Charles Bailey
Lists: bailey _dot_ charles _at_ gmail _dot_ com
Other: bailey _at_ newman _dot_ upenn _dot_ edu

Display XML fragment in "Malformed XML" error message to aid debugging.
* subversion/libsvn_subr/xml.c
  (svn_xml_parse): Append reasonable hunk of XML to error message
* subversion/po/de.po,
  subversion/po/es.po,
  subversion/po/ja.po,
  subversion/po/ko.po,
  subversion/po/nb.po,
  subversion/po/pl.po,
  subversion/po/pt_BR.po,
  subversion/po/sv.po,
  subversion/po/zh_CN.po,
  subversion/po/zh_CN.po:
  Update error message.

Index: subversion/libsvn_subr/xml.c
===================================================================
--- subversion/libsvn_subr/xml.c	(revision 13160)
+++ subversion/libsvn_subr/xml.c	(working copy)
@@ -417,9 +417,9 @@
     {
       err = svn_error_createf
         (SVN_ERR_XML_MALFORMED, NULL, 
-         _("Malformed XML: %s at line %d"),
+         _("Malformed XML: %s at line %d; XML starts:\n%.240s"),
          XML_ErrorString (XML_GetErrorCode (svn_parser->parser)),
-         XML_GetCurrentLineNumber (svn_parser->parser));
+         XML_GetCurrentLineNumber (svn_parser->parser), buf);
       
       /* Kill all parsers and return the expat error */
       svn_xml_free_parser (svn_parser);
Index: subversion/po/pt_BR.po
===================================================================
--- subversion/po/pt_BR.po	(revision 13160)
+++ subversion/po/pt_BR.po	(working copy)
@@ -4943,8 +4943,8 @@
 
 #: libsvn_subr/xml.c:420
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "XML mal formado: %s na linha %d"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "XML mal formado: %s na linha %d; XML começa:\n%.240s"
 
 #: libsvn_wc/adm_crawler.c:374
 #, c-format
Index: subversion/po/es.po
===================================================================
--- subversion/po/es.po	(revision 13160)
+++ subversion/po/es.po	(working copy)
@@ -5011,8 +5011,8 @@
 
 #: libsvn_subr/xml.c:420
 #, fuzzy, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "Datos de flujo malformados"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "Datos de XML malformados: %s en línea %d; XML comienza:\n%.240s"
 
 #: libsvn_wc/adm_crawler.c:374
 #, c-format
Index: subversion/po/sv.po
===================================================================
--- subversion/po/sv.po	(revision 13160)
+++ subversion/po/sv.po	(working copy)
@@ -4930,8 +4930,8 @@
 
 #: libsvn_subr/xml.c:367
 #, fuzzy, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "Felaktig data i strÃ¶m"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "Felaktig data i strÃ¶m: %s vid rad %d; XML starta:\n%.240s"
 
 #: libsvn_wc/adm_crawler.c:374
 #, c-format
Index: subversion/po/de.po
===================================================================
--- subversion/po/de.po	(revision 13160)
+++ subversion/po/de.po	(working copy)
@@ -5006,8 +5006,8 @@
 
 #: libsvn_subr/xml.c:420
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "Fehlerhaftes XML: %s in Zeile %d"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "Fehlerhaftes XML: %s in Zeile %d; XML beginnt:\n%.240s"
 
 #: libsvn_wc/adm_crawler.c:374
 #, c-format
Index: subversion/po/ko.po
===================================================================
--- subversion/po/ko.po	(revision 13160)
+++ subversion/po/ko.po	(working copy)
@@ -4882,8 +4882,8 @@
 
 #: libsvn_subr/xml.c:420
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "잘못된 XML: %s (줄 번호 %d)"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "잘못된 XML: %s (줄 번호 %d); XML:\n%.240s"
 
 #: libsvn_wc/adm_crawler.c:374
 #, c-format
Index: subversion/po/ja.po
===================================================================
--- subversion/po/ja.po	(revision 13160)
+++ subversion/po/ja.po	(working copy)
@@ -5262,8 +5262,8 @@
 
 #: libsvn_subr/xml.c:420
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "異常な XML です: %s (行 %d)"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "異常な XML です: %s (行 %d); XML:\n%.240s"
 
 #: libsvn_wc/adm_crawler.c:374
 #, c-format
Index: subversion/po/zh_TW.po
===================================================================
--- subversion/po/zh_TW.po	(revision 13160)
+++ subversion/po/zh_TW.po	(working copy)
@@ -4890,8 +4890,8 @@
 
 #: libsvn_subr/xml.c:420
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "有缺陷 XML: %s 於第 %d 列"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "有缺陷 XML: %s 於第 %d 列; XML:\n%.240s"
 
 #: libsvn_wc/adm_crawler.c:374
 #, c-format
Index: subversion/po/pl.po
===================================================================
--- subversion/po/pl.po	(revision 13160)
+++ subversion/po/pl.po	(working copy)
@@ -5054,8 +5054,8 @@
 
 #: libsvn_subr/xml.c:420
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "Uszkodzony XML: %s w linii %d"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "Uszkodzony XML: %s w linii %d; XML wersja:\n%.240s"
 
 #: libsvn_wc/adm_crawler.c:374
 #, c-format
Index: subversion/po/nb.po
===================================================================
--- subversion/po/nb.po	(revision 13160)
+++ subversion/po/nb.po	(working copy)
@@ -4935,8 +4935,8 @@
 
 #: libsvn_subr/xml.c:420
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "Misdannet XML: %s i linje %d"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "Misdannet XML: %s i linje %d; XML starter:\n%.240s"
 
 #: libsvn_wc/adm_crawler.c:374
 #, c-format
Index: subversion/po/zh_CN.po
===================================================================
--- subversion/po/zh_CN.po	(revision 13160)
+++ subversion/po/zh_CN.po	(working copy)
@@ -4805,8 +4805,8 @@
 
 #: libsvn_subr/xml.c:367
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "畸形的XML：%s 在第 %d 行"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "畸形的XML：%s 在第 %d 行; XML:\n%.240s"
 
 #: libsvn_wc/adm_crawler.c:374
 #, c-format
  
## End of patch ##

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Include offending XML in "Malformed XML" error message

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.

On Wed, 23 Feb 2005, Charles Bailey wrote:

> Attached is a patchlet that, when expat fails to parse an hunk of XML,
> appends at least part of the offending hunk to the error message.  It
> provides a better jumping-off point for further debugging than the
> current message, especially since the line number often isn't all that
> meaningful.  It's not intended to flag the error specifically, or even
> to insure it's actually in the output fragment, since it'd be too hard
> to reliably identify the problem.  (For that matter, I can envision
> cases where the error is actually in a previously parsed hunk of XML,
> but doesn't become apparent until a tag mismatches in the current hunk
> or the like.) By way of example, the current error message:
>
I think your approach is good enough in this respect.

> A few comments:
>
>  - The error message now extends over more than one line.  I understand that svn
>     generally avoids this, but I think it's necessary to supply a
> large enough fragment
>     that the user can make a well-educated guess about the location of
> the problem.
>     The presence of \ns in svn-generated XML would also make it tough
> to guarantee
>     that any meaningful fragment stayed on one line.  Finally, trying
> to flatten the XML to
>     fit on a single logical line would go well past 80 cols, which I
> think would be more of a
>     problem than multiple lines that fit into an 80 column display.
>
We generally try to avoid line breaks in error messages. That's because of
GUI issues. But with l10n and long filenames, we can't possibly keep
messages within 80 chars. IN this case, I think it is a good idea to keep
the newline befre the XML.

>  - I chose a maximum fragment length of 240 chars
arbitrarily, as long > enough to give the
>    reader some context and short enough to avoid a huge trail of text.
>  It seemed like a
>    reasonable balance to me.
>
Agree.

>  - I placed delimiters before and after the XML with the thought that
> it might help in some
>    situations where the XML itself started with odd characters; again,
> this is relatively
>    arbitrary.  It'd arguably be of some help for parsing the error
> (the delimiters aren't
>    legal XML), but as this is a fatal and (one hopes) uncommon error,
> I don't think many
>    people will be writing tools to anticipate it.  If the consensus is
> that they just add noise,
>    they're easy enough to drop.
>
IMO, just drop them.

> 'Nuff said about a minor patch for an infrequent occurrence.  Chalk it
> up to it being my first foray onto this list.
>
Good start:-) A few commets below. (Also, could you please try to get the
MIME type of your attachments as some text mime type Makes it easier to
review.)


Display fragment of offending XML on "Malformed XML" error

Missing period at end of sentence and blank line after the summary.

  * subversion/libsvn_subr/xml.c
    (svn_xml_parse): Append reasonable hunk of XML to error message

We usually put the asterisk in the first column.

Index: subversion/libsvn_subr/xml.c
===================================================================
--- subversion/libsvn_subr/xml.c	(revision 12919)
+++ subversion/libsvn_subr/xml.c	(working copy)
@@ -417,9 +417,10 @@
     {
       err = svn_error_createf
         (SVN_ERR_XML_MALFORMED, NULL,
-         _("Malformed XML: %s at line %d"),
+         _("Malformed XML: %s at line %d; XML %s:\n>>>>>\n%.240s\n<<<<<\n"),
Drop the trailing newline.
          XML_ErrorString (XML_GetErrorCode (svn_parser->parser)),
-         XML_GetCurrentLineNumber (svn_parser->parser));
+         XML_GetCurrentLineNumber (svn_parser->parser),
+	 len > 240 ? "starts" : "is", buf);
You can't use message fragment like this in internationalized messages.
Think of what happens when the _() marked string gets translated, but no
the "is" or "starts":-) You have tomove the condition outside the message,
creating two complete messages instead.

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org