You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Bert Huijben <be...@qqmail.nl> on 2012/07/09 14:47:25 UTC

RE: svn commit: r1358022 - in /subversion/trunk: LICENSE NOTICE subversion/include/svn_utf.h subversion/libsvn_subr/utf_width.c subversion/svn/file-merge.c


> -----Original Message-----
> From: stsp@apache.org [mailto:stsp@apache.org]
> Sent: vrijdag 6 juli 2012 04:21
> To: commits@subversion.apache.org
> Subject: svn commit: r1358022 - in /subversion/trunk: LICENSE NOTICE
> subversion/include/svn_utf.h subversion/libsvn_subr/utf_width.c
> subversion/svn/file-merge.c
> 
> Author: stsp
> Date: Fri Jul  6 02:20:40 2012
> New Revision: 1358022
> 
> URL: http://svn.apache.org/viewvc?rev=1358022&view=rev
> Log:
> Add support for determining the display width of a unicode string, and make
> use of this when trimming lines for display by the internal file merge tool.
> 
> Based on suitably licensed code by Markus Kuhn:
> http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
> 
> * LICENSE, NOTICE: Add Markus Kuhn copyright notices.
> 
> * subversion/include/svn_utf.h
>   (svn_utf_cstring_utf8_width): Declare.
> 
> * subversion/libsvn_subr/utf_width.c: New file, implementation of the
> above
>    newly declared function. This file is based on Markus Kuhn's code, and has
>    been adapted to use APR types. The svn_utf_cstring_utf8_width() function
>    was written from scratch and replaces Markus' mk_wcswidth() function.
> 
> * subversion/svn/file-merge.c
>   (MAX_LINE_DISPLAY_LEN): Rename to ...
>   (LINE_DISPLAY_WIDTH): ... this, because it is not a maximum but a fixed
>    length since all lines are either trimmed or padded to this length.
>   (prepare_line_for_display): Use the new character display width support
>    to properly detect the width of unicode characters (tested with a bunch
>    of asian texts from wikipedia). Track rename of MAX_LINE_DISPLAY_LEN.

How do you check if the file you are merging is valid utf-8?

I assumed that we currently just passed files to the console mostly unmodified to allow the terminal to do the hard work.

I'm pretty sure that we can assume at least many (if not most) text files stored in Subversion are *not* utf-8 and will fail when tested for utf-8 validness.

How does this library handle non-utf8 strings?
(Just assuming width 1 would probably be safe for our usage, but it should never crash)

	Bert


Re: svn commit: r1358022 - in /subversion/trunk: LICENSE NOTICE subversion/include/svn_utf.h subversion/libsvn_subr/utf_width.c subversion/svn/file-merge.c

Posted by Stefan Sperling <st...@elego.de>.
On Mon, Jul 09, 2012 at 04:04:42PM +0200, Johan Corveleyn wrote:
> On Mon, Jul 9, 2012 at 3:30 PM, Stefan Sperling <st...@apache.org> wrote:
> > On Mon, Jul 09, 2012 at 02:47:25PM +0200, Bert Huijben wrote:
> >> How do you check if the file you are merging is valid utf-8?
> >
> > See the merge_chunks() function.
> >
> > We convert data to UTF-8 from the native (locale) encoding.
> > This cannot fail (every encoding can be represented in UTF-8)
> > but the result might look funny in case the file uses some other encoding
> > than the native one. But that's OK -- this conversion happens only for
> > display purposes, data in the actual file is never changed, so you can
> > still edit individual chunks in their original form.
> 
> I'm a bit confused (encoding issues always confuse me). If we only
> care about the width of the string for display purposes, doesn't this
> (also) depend on the encoding used by the console / terminal? How does
> that actually work: if you have a UTF-8 encoded file, and you 'cat' it
> to a terminal with LC_ALL=iso_8859_1 ... ?

Our cmdline output routines accept UTF-8 and try convert back to the
locale's native encoding before printing. If this conversion fails,
it falls back to svn_cmdline_cstring_from_utf8_fuzzy() which will
create some ASCII-representation of the data.

So what will happen in that case is that you'll see whatever unicode
character latin1 can represent as-is, while others are converted
in a fuzzy way. This might lead to mis-aligned side-by-side diff output.

However if you're trying to display unicode data on a terminal
that isn't unicode capable then such issues are the norm rather
then the exception.

In general, if your terminal can display your files, then the
side-by-side diff will also be shown properly. Else, the side-by-side
diff might look OK, or it might not, depending on how much longer
the "fuzzy" representation of the string really is.

Configure your locale properly and you want have an issue.

Re: svn commit: r1358022 - in /subversion/trunk: LICENSE NOTICE subversion/include/svn_utf.h subversion/libsvn_subr/utf_width.c subversion/svn/file-merge.c

Posted by Johan Corveleyn <jc...@gmail.com>.
On Mon, Jul 9, 2012 at 3:30 PM, Stefan Sperling <st...@apache.org> wrote:
> On Mon, Jul 09, 2012 at 02:47:25PM +0200, Bert Huijben wrote:
>> How do you check if the file you are merging is valid utf-8?
>
> See the merge_chunks() function.
>
> We convert data to UTF-8 from the native (locale) encoding.
> This cannot fail (every encoding can be represented in UTF-8)
> but the result might look funny in case the file uses some other encoding
> than the native one. But that's OK -- this conversion happens only for
> display purposes, data in the actual file is never changed, so you can
> still edit individual chunks in their original form.

I'm a bit confused (encoding issues always confuse me). If we only
care about the width of the string for display purposes, doesn't this
(also) depend on the encoding used by the console / terminal? How does
that actually work: if you have a UTF-8 encoded file, and you 'cat' it
to a terminal with LC_ALL=iso_8859_1 ... ?

-- 
Johan

Re: svn commit: r1358022 - in /subversion/trunk: LICENSE NOTICE subversion/include/svn_utf.h subversion/libsvn_subr/utf_width.c subversion/svn/file-merge.c

Posted by Stefan Sperling <st...@apache.org>.
On Mon, Jul 09, 2012 at 02:47:25PM +0200, Bert Huijben wrote:
> How do you check if the file you are merging is valid utf-8?

See the merge_chunks() function.

We convert data to UTF-8 from the native (locale) encoding.
This cannot fail (every encoding can be represented in UTF-8)
but the result might look funny in case the file uses some other encoding
than the native one. But that's OK -- this conversion happens only for
display purposes, data in the actual file is never changed, so you can
still edit individual chunks in their original form.

> I assumed that we currently just passed files to the console mostly unmodified to allow the terminal to do the hard work.

That works fine as long as you don't care about the width of the
line you're printing. 

For the side-by-side display we make an effort to make it look nice.
If that doesn't work, the side-by-side display might look strange
because lines appear with varying lengths. That is the fallback mode
which assumes width=1 and one-byte-per-character for all characters.
 
> I'm pretty sure that we can assume at least many (if not most) text files stored in Subversion are *not* utf-8 and will fail when tested for utf-8 validness.
>
> How does this library handle non-utf8 strings?

You mean the svn_utf_cstring_utf8_width() function? It will return
an error for invalid UTF-8.

In our usage of this API, the UTF-8 validness check in is performed
on data that the merge tool has converted to UTF-8. The API must fail for
invalid UTF-8 input since it cannot convert such input to UTF-32 in
order to run mk_wcwidth() on it.

Again, this is in-memory data which we're going to display to the user
in a formatted way to so we need to know its width.
None of this has anything to do with any versioned data in files.