You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Karl Fogel <kf...@newton.ch.collab.net> on 2002/07/21 06:05:47 UTC

converting unconvertible UTF-8 data

kfogel@tigris.org writes:
> Log:
> In revisions 2600 and 2598, the Subversion repository has UTF-8 data
> that cannot be converted to ISO-8859-1, among others.  (The data is
> Branko Čibej's full name, which I include here just to make this
> revision self-proving).
> 
> * subversion/clients/cmdline/log-cmd.c
>   (log_message_receiver): Don't error if encounter unconvertible data,
>   just print a placeholder and move on.

Whew!

Okay, now that the immediate problem is fixed, we need to decide how
to deal with this better.

   Problem: A log message may have data with characters that cannot be
            converted from UTF-8 to the local encoding.

The current solution makes "svn log" work again, but loses more
information than it has to.  Some revisions may print out with a log
message that says simply:

   "[unconvertible log msg]"

This could be... frustrating... for users :-).

I can think of three improvements, not mutually exclusive:

   1) A --raw option (or whatever it would be called) that tells log
      to print the raw bytes of the data, instead of trying to
      convert.  I don't know what other commands this flag might
      affect; only log comes to mind right now.  It's your own fault
      if it screws up your tty :-).

   2) A --allow-raw option, meaning, convert if can, else emit the raw
      data if conversion fails.

   3) Have a fuzzy conversion function that tries to convert all the
      data, but if that fails, converts every character it can and
      replaces the others with ?\XXX (or some standard sequence) to
      indicate the Unicode value of the failed character.

   4) My brain is puny and weak.  There are surely other ways to
      address this problem that I'm not thinking of.  Suggestions?

Right now I like (3) the best, since it doesn't force the user to do
something different.  Of course, we'd have to choose wisely where we
use the fuzzy function -- again, only "log" comes to mind so far.

Thoughts?,
-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: converting unconvertible UTF-8 data

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Karl Fogel <kf...@newton.ch.collab.net> writes:

>    3) Have a fuzzy conversion function that tries to convert all the
>       data, but if that fails, converts every character it can and
>       replaces the others with ?\XXX (or some standard sequence) to
>       indicate the Unicode value of the failed character.
>
>[...]
> 
> Right now I like (3) the best, since it doesn't force the user to do
> something different.  Of course, we'd have to choose wisely where we
> use the fuzzy function -- again, only "log" comes to mind so far.

iconv already does fuzzy conversion on some systems.  For example, on
Solaris 8, I get ?:s instead of the "offending" character in Brankos
name when I do svn log in a ISO-8859-1 locale.  Right now, apr_xlate
doesn't inform us that this has happened, since it doesn't check the
return code of iconv properly (unless this has been fixed recently).
But it _should_ tell us, so a choice where to allow it should be made
anyway.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: converting unconvertible UTF-8 data

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Karl Fogel <kf...@newton.ch.collab.net> writes:
>    3) Have a fuzzy conversion function that tries to convert all the
>       data, but if that fails, converts every character it can and
>       replaces the others with ?\XXX (or some standard sequence) to
>       indicate the Unicode value of the failed character.

Btw, my intention is that this would go in APR, of course.

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: converting unconvertible UTF-8 data

Posted by Justin Erenkrantz <je...@apache.org>.
On Sun, Jul 21, 2002 at 01:05:47AM -0500, Karl Fogel wrote:
>    3) Have a fuzzy conversion function that tries to convert all the
>       data, but if that fails, converts every character it can and
>       replaces the others with ?\XXX (or some standard sequence) to
>       indicate the Unicode value of the failed character.

+1.  It should degrace gracefully if it can.  (Much like what
mutt does for poor Branko's name.)  -- justin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: converting unconvertible UTF-8 data

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Ulrich Drepper <dr...@redhat.com> writes:
> Simple or not is not the problem.  Time is.  Unfortunately.  I'll try it
> and I'll try to come with a patch but it might take until the weekend. 
> My weeks are pretty well filed up.

Understood; do what you can, you've been very helpful even just posting.

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: converting unconvertible UTF-8 data

Posted by Ulrich Drepper <dr...@redhat.com>.
On Mon, 2002-07-22 at 08:14, Karl Fogel wrote:

> You need to be able to build Subversion, so you can contribute code as
> well as ideas.  This *should* be simply a matter of

Simple or not is not the problem.  Time is.  Unfortunately.  I'll try it
and I'll try to come with a patch but it might take until the weekend. 
My weeks are pretty well filed up.

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------

Re: converting unconvertible UTF-8 data

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Ulrich Drepper <dr...@redhat.com> writes:
> When it comes to these things it really a quality of implementation
> thing.  You don't want to drag down to quality on one system just
> because there are some others which don't have implementations of the
> needed functionality.
>
> Also, consistency in the output of information which is meant for the
> user (e.g., log files, changelogs etc) is not that important.  A human
> reader is able to figure out a lot despite noise on the channel.

Hmm, yeah, that seems reasonable to me.

> All you need is a configure test and a bit of #ifdef'ed code to append
> "//TRANSLIT" to the codeset name.

Yup.

Okay, Ulrich, it's time for me to make the Big Request:

You need to be able to build Subversion, so you can contribute code as
well as ideas.  This *should* be simply a matter of

   1) Download the latest tarball from
      http://subversion.tigris.org/servlets/ProjectDocumentList?folderID=74

   2) Build it, then use it to check out the sources
      "svn co http://svn.collab.net/repos/svn"

   3) Then build your new working copy.  First run autogen.sh, it will
      tell you exactly what dependencies you need and where to get
      them.

This is a slightly more complex recipe than users need, since they
aren't trying to get a working tree for development, but it's not
really _that_ hard -- dozens of people here have done it.

Can you write this configuration patch?  Sure, someone else could do
it, but probably you are the most qualified to get it done quickly, so
why settle for less? :-)

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: converting unconvertible UTF-8 data

Posted by Ulrich Drepper <dr...@redhat.com>.
On Sun, 2002-07-21 at 20:21, Karl Fogel wrote:

> That's not the only problem -- the portability issue in your previous
> paragraph is the real showstopper.

I wouldn't say so.  At compile time you can have found out whether the
implementation supports //TRANSLIT or not.  Just don't enable it on
platforms where it's not available.

And no, //TRANSLIT is not ignored where it is not recognized.  The use
of slashes to separate fields in the name is something which I've
introduced and only the GNU libiconv uses it as well (at least to the
best of my knowledge).


> Either way, we may still eventually want our own fuzzy function to
> supply whatever cannot be depended on from iconv.  It's good if
> Subversion behaves as close to the same everywhere as possible.

When it comes to these things it really a quality of implementation
thing.  You don't want to drag down to quality on one system just
because there are some others which don't have implementations of the
needed functionality.

Also, consistency in the output of information which is meant for the
user (e.g., log files, changelogs etc) is not that important.  A human
reader is able to figure out a lot despite noise on the channel.


> And we can eventually give our "fuzzy" function the option of doing
> transliteration.  But I think the initial implementation would better
> output ?\XXX for each unconverted byte, since that's simple to get
> right initially.

If you do this (and it's not easy if you use iconv for the conversion),
why not enable the transliteration for the platforms which support it? 
All you need is a configure test and a bit of #ifdef'ed code to append
"//TRANSLIT" to the codeset name.

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------

Re: converting unconvertible UTF-8 data

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Ulrich Drepper <dr...@redhat.com> writes:
> >    3) Have a fuzzy conversion function that tries to convert all the
> >       data, but if that fails, converts every character it can and
> >       replaces the others with ?\XXX (or some standard sequence) to
> >       indicate the Unicode value of the failed character.
> 
> Preferrable to this is the use of transliteration.  You are talking
> about a transformation which can lose information anyway.  Some iconv()
> implementation (glibc's and GNU libiconv's) support transliteration. 
> Just add //TRANSLIT to the to-charset option string of the iconv_open
> call.
> 
> The problem with transliteration is, though, that it is locale
> dependent.  So the result may differ depending on the selected locale.

That's not the only problem -- the portability issue in your previous
paragraph is the real showstopper.

What happens if we add "//TRANSLIT" to a charset with an iconv
implementation that doesn't know anything about transliteration?  Is
it guaranteed to ignore unknown appends of the form "//FOO", or can it
bomb because can't find the charset named "ISO-8859-1//TRANSLIT"?

If we at least know that adding "//TRANSLIT" will do no harm, then we
could add it right away (where it's not present already).  But if it
could cause a problem, then it doesn't help us.

Either way, we may still eventually want our own fuzzy function to
supply whatever cannot be depended on from iconv.  It's good if
Subversion behaves as close to the same everywhere as possible.

And we can eventually give our "fuzzy" function the option of doing
transliteration.  But I think the initial implementation would better
output ?\XXX for each unconverted byte, since that's simple to get
right initially.  Incremental improvements (perhaps with additional
run-time options) are possible from there.

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: converting unconvertible UTF-8 data

Posted by Ulrich Drepper <dr...@redhat.com>.
On Sat, 2002-07-20 at 23:05, Karl Fogel wrote:


>    3) Have a fuzzy conversion function that tries to convert all the
>       data, but if that fails, converts every character it can and
>       replaces the others with ?\XXX (or some standard sequence) to
>       indicate the Unicode value of the failed character.

Preferrable to this is the use of transliteration.  You are talking
about a transformation which can lose information anyway.  Some iconv()
implementation (glibc's and GNU libiconv's) support transliteration. 
Just add //TRANSLIT to the to-charset option string of the iconv_open
call.

The problem with transliteration is, though, that it is locale
dependent.  So the result may differ depending on the selected locale.

Just to make myself clear: transliteration here means replacement of
some unconvertable input with something readable in the output format. 
I.e., when converting 'ä' (a-umlaut) to ASCII it would be replaced in a
German locale with 'ae'.  In a Danish locale it would be only 'a',
though.

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------