You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Hiroaki Nakamura <hn...@gmail.com> on 2012/01/29 11:38:44 UTC

Let's discuss about unicode compositions for filenames!

Hi folks!

I read the note about unicode compositions for filenames
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
and would like to drive the discussion.

First, for me, the short term solution (4) seems too difficult to implement.
It is very complex and error-prone, so here I focus to the long term
solution (2).

It is simple. We convert all input paths into the 'normal' normal form (NFC),
using utf8proc.
http://www.public-software-group.org/utf8proc

I made a quick-and-dirty proof-of-concept patch for the further discussion.

If you run apache + mod_dav_svn with this patch,
NFD filenames in commits by svn client without this patch will be
converted to NFC.

This patch has following limitations right now but we can fix them.
- It does not handle all input paths, only two:
  one for mod_dav_svn open_stream, one for svn_path_cstring_to_utf8.
- The error handling is not yet implemented.
- The configure script should be modified for linking against the
utf8proc library.
  Currently it needs EXTRA_LDFLAGS=-lutf8proc when running make.


To test this patch, please do the steps below.

(1) build and install utf8proc
The example below is for Scientific Linux 6.1 x86_64.
Currently I install utf8proc to system library locations (/usr/include
and /usr/lib64),
not places like /usr/local/include and /usr/local/lib64, just because I don't
want to bother about modifying the configure script right now.

wget http://www.public-software-group.org/pub/projects/utf8proc/v1.1.5/utf8proc-v1.1.5.tar.gz
tar xf utf8proc-v1.1.5.tar.gz
cd utf8proc-v1.1.5
make c-library
sudo install -m 644 libutf8proc.so /usr/lib64/libutf8proc.so.1.1.5
sudo ln -s libutf8proc.so.1.1.5 /usr/lib64/libutf8proc.so.1
sudo ln -s libutf8proc.so.1 /usr/lib64/libutf8proc.so
sudo install -m 644 utf8proc.h /usr/include

(2) build Subversion 1.7.2 with this patch.
cd subversion-1.7.2
patch -p1 < ../subversion-1.7.2-NFC.diff
./configure
EXTRA_LDFLAGS=-lutf8proc make
sudo make install

One thing I'd like to discuss is how we link to utf8proc.
There are two options.
(1) Install utf8proc as a shared library and modify the configure script to
     have --with-utf8proc option.
(2) Copy the utf8proc source files in the subversion source directories and
     use static link (like sqlite-amalgamation).

The option (1) needs the utf8proc package to be created for each OS distribution
and modify the dependency of the subversion package. I think this is
the ideal way,
but that is a lot of work. I think the option (2) is easier. Put
utf8proc source files in
the subversion source tarballs.

Am I on the right track?
Let's discuss and fix this problem and we will be happy ever after!

-- 
)Hiroaki Nakamura) hnakamur@gmail.com


==== subversion-1.7.2-NFC.diff
diff -ruN subversion-1.7.2.orig/subversion/include/svn_utf.h
subversion-1.7.2/subversion/include/svn_utf.h
--- subversion-1.7.2.orig/subversion/include/svn_utf.h	2009-11-17
04:07:17.000000000 +0900
+++ subversion-1.7.2/subversion/include/svn_utf.h	2012-01-29
11:54:20.150665621 +0900
@@ -220,6 +220,14 @@
                                  const svn_string_t *src,
                                  apr_pool_t *pool);

+/** Set @a *dest to a NFC canonicalized C string from string @a src;
+ * allocate @a *dest in @a pool.
+ */
+svn_error_t *
+svn_utf_cstring_NFC(const char **dest,
+                    const char *src,
+                    apr_pool_t *pool);
+
 #ifdef __cplusplus
 }
 #endif /* __cplusplus */
diff -ruN subversion-1.7.2.orig/subversion/libsvn_subr/path.c
subversion-1.7.2/subversion/libsvn_subr/path.c
--- subversion-1.7.2.orig/subversion/libsvn_subr/path.c	2011-01-18
06:45:39.000000000 +0900
+++ subversion-1.7.2/subversion/libsvn_subr/path.c	2012-01-29
18:01:06.900398904 +0900
@@ -1119,15 +1119,17 @@
                          const char *path_apr,
                          apr_pool_t *pool)
 {
+  char *path_nfc;
+  SVN_ERR(svn_utf_cstring_NFC(&path_nfc, path_apr, pool));
   svn_boolean_t path_is_utf8;
   SVN_ERR(get_path_encoding(&path_is_utf8, pool));
   if (path_is_utf8)
     {
-      *path_utf8 = apr_pstrdup(pool, path_apr);
+      *path_utf8 = apr_pstrdup(pool, path_nfc);
       return SVN_NO_ERROR;
     }
   else
-    return svn_utf_cstring_to_utf8(path_utf8, path_apr, pool);
+    return svn_utf_cstring_to_utf8(path_utf8, path_nfc, pool);
 }


diff -ruN subversion-1.7.2.orig/subversion/libsvn_subr/utf.c
subversion-1.7.2/subversion/libsvn_subr/utf.c
--- subversion-1.7.2.orig/subversion/libsvn_subr/utf.c	2011-08-24
00:04:38.000000000 +0900
+++ subversion-1.7.2/subversion/libsvn_subr/utf.c	2012-01-29
17:55:33.643895922 +0900
@@ -42,6 +42,7 @@
 #include "private/svn_utf_private.h"
 #include "private/svn_dep_compat.h"
 #include "private/svn_string_private.h"
+#include "utf8proc.h"

 

@@ -1029,3 +1030,58 @@

   return err;
 }
+
+static ssize_t svn_utf_map(
+  const uint8_t *str, ssize_t len, uint8_t **dstptr, int options,
+  apr_pool_t *pool
+) {
+  int32_t *buffer;
+  ssize_t result;
+  *dstptr = NULL;
+  /* We use svn_utf_map only for conversion from NFC/NFD to NFC.
+   * NFC is the most compact form of the two (NFC and NFD).
+   * So the result buffer length never exceeds the source string length.
+   * Therefore we first allocate the result buffer and run decompose only once.
+   */
+  if (options & UTF8PROC_NULLTERM)
+    len = strlen(str);
+  buffer = apr_palloc(pool, len * sizeof(int32_t) + 1);
+  result = utf8proc_decompose(str, len, buffer, len, options);
+  if (result < 0)
+    return result;
+  if (result > len)
+    {
+      /* We never reach here when converting to NFC.  */
+      buffer = apr_palloc(pool, result * sizeof(int32_t) + 1);
+      result = utf8proc_decompose(str, len, buffer, result, options);
+      if (result < 0)
+        return result;
+    }
+  result = utf8proc_reencode(buffer, result, options);
+  if (result < 0)
+    return result;
+  /* We don't shrink the result buffer because:
+   * - the buffer will be short lived and freed at the end of the transaction.
+   * - APR does not have realloc and free API, so if we ever allocate another
+   *   buffer, we use more memory.
+   */
+  *dstptr = (uint8_t *)buffer;
+  return result;
+}
+
+svn_error_t *
+svn_utf_cstring_NFC(const char **dest,
+                    const char *src,
+                    apr_pool_t *pool)
+{
+  ssize_t ret = svn_utf_map(src, 0, dest,
+                            UTF8PROC_NULLTERM | UTF8PROC_STABLE |
+                            UTF8PROC_COMPOSE,
+                            pool);
+  if (ret < 0)
+    {
+      /* TODO: implement error handling. */
+      return SVN_NO_ERROR;
+    }
+  return SVN_NO_ERROR;
+}
diff -ruN subversion-1.7.2.orig/subversion/mod_dav_svn/repos.c
subversion-1.7.2/subversion/mod_dav_svn/repos.c
--- subversion-1.7.2.orig/subversion/mod_dav_svn/repos.c	2011-11-29
01:12:28.000000000 +0900
+++ subversion-1.7.2/subversion/mod_dav_svn/repos.c	2012-01-29
18:00:50.166648215 +0900
@@ -48,6 +48,7 @@
 #include "mod_dav_svn.h"
 #include "svn_ra.h"  /* for SVN_RA_CAPABILITY_* */
 #include "svn_dirent_uri.h"
+#include "svn_utf.h"
 #include "private/svn_log.h"
 #include "private/svn_fspath.h"

@@ -2672,6 +2673,7 @@
             dav_stream **stream)
 {
   svn_node_kind_t kind;
+  char *repos_path;
   dav_error *derr;
   svn_error_t *serr;

@@ -2694,19 +2696,28 @@
     }
 #endif

+  serr = svn_utf_cstring_NFC(&repos_path, resource->info->repos_path,
+                             resource->pool);
+  if (serr != NULL)
+    {
+      return dav_svn__convert_err(serr, HTTP_INTERNAL_SERVER_ERROR,
+                                  "Could not canonicalize filename to NFC.",
+                                  resource->pool);
+    }
+
   /* start building the stream structure */
   *stream = apr_pcalloc(resource->pool, sizeof(**stream));
   (*stream)->res = resource;

   derr = fs_check_path(&kind, resource->info->root.root,
-                       resource->info->repos_path, resource->pool);
+                       repos_path, resource->pool);
   if (derr != NULL)
     return derr;

   if (kind == svn_node_none) /* No existing file. */
     {
       serr = svn_fs_make_file(resource->info->root.root,
-                              resource->info->repos_path,
+                              repos_path,
                               resource->pool);

       if (serr != NULL)
@@ -2730,7 +2741,7 @@

       serr = svn_fs_node_prop(&mime_type,
                               resource->info->root.root,
-                              resource->info->repos_path,
+                              repos_path,
                               SVN_PROP_MIME_TYPE,
                               resource->pool);

@@ -2744,7 +2755,7 @@
       if (!mime_type)
         {
           serr = svn_fs_change_node_prop(resource->info->root.root,
-                                         resource->info->repos_path,
+                                         repos_path,
                                          SVN_PROP_MIME_TYPE,
                                          svn_string_create
                                              (resource->info->r->content_type,
@@ -2762,7 +2773,7 @@
   serr = svn_fs_apply_textdelta(&(*stream)->delta_handler,
                                 &(*stream)->delta_baton,
                                 resource->info->root.root,
-                                resource->info->repos_path,
+                                repos_path,
                                 resource->info->base_checksum,
                                 resource->info->result_checksum,
                                 resource->pool);

Re: Let's discuss about unicode compositions for filenames!

Posted by Branko Čibej <br...@apache.org>.

On 30.01.2012 13:30, Stefan Sperling wrote:
> On Sun, Jan 29, 2012 at 07:38:44PM +0900, Hiroaki Nakamura wrote:
>> Hi folks!
>>
>> I read the note about unicode compositions for filenames
>> http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
>> and would like to drive the discussion.
> Hi,
>
> I am very happy to hear that you want to work towards getting this
> problem fixed. Thank you for your help!
>
> I've just re-read the unicode-composition-for-filenames notes.
> I think they are a bit outdated. For instance, they still talk about
> the 1.6 working copy format. They also don't clearly explain the problems
> with backwards compatibility we're facing here.

[...]

We have to track two distinct normalizations, the internal (wc.db,
repos) form, most likely NFC, and the working copy, on-disk form. This
last will depend on the host system; most likely NFD on Mac OS and NFC
everywhere else. The on-disk normalization needs to happen before
conversion to the system encoding, of course.

libsvn_repos should do its own normalization to NFC because we can't
trust old clients to do it right.
Doing a dump/reload cycle should then be sufficient to upgrade the
repository, and probably the only viable one, too.

For working copies, we may want to teach "svn upgrade" to do the on-disk
and wc.db normalization dance. Clearly, client-side normalization
requires a WC format bump, but it need not be automatic.

We should probably give serious thought to using the restricted
normalisation forms (NFKC and NFKD) and tell people who want proper
Unicode Roman numerals in their file names to think again. :)

-- Brane

Re: Let's discuss about unicode compositions for filenames!

Posted by Neels J Hofmeyr <ne...@elego.de>.

On 01/30/2012 02:00 PM, Markus Schaber wrote:
> Maybe the best solution to this issue is a client-only solution, in a similar way the case sensitivity problem is tackled.

Spinning the client-only thought a bit: Imagine a repos with a un*x user
adding a file called "föö". Now an OSX user checks it out and gets the path
normalized to "fo:o:".

1. wc.db on OSX's HFS+ file systems has to be aware that the file "föö" is
stored locally as "fo:o:".

2. Whenever the OSX user types in "fo:o:", the client must remember that the
repos expects the path for this node to be sent as "föö", or the repos will
reply that the node does not exist. It could be solved with a translation
table between the repos and the client, but it remains quite a messy
endeavor, because:

3. New files may be added remotely at any given moment. For example, a path
'föö/bar' is checked out to OSX's fs and becomes 'fo:o:/bar'. Then someone
else adds 'fo:o:/bar' to the repos as well -- we now have two distinct 'bar'
files in the repos that share the same normalized path. Now OSX potentially
mistakes its checked-out 'föö/bar' for the later added 'fo:o:/bar', as that
matches the local path without any de-normalisations... The OSX client
basically has no chance to show "conflicting" files to its user
simultaneously. Data is "hidden".

Thus, OSX admins will want the repository to be able to disallow having
multiple representations of the same normalized path -- not that easy to
achieve, in fact: before accepting a path name from the client, the repos
needs to either cycle through all possible unicode representations or needs
to normalize and compare all existing paths. Normalizing a client's path
before storing in the repos is a no-go, as the client won't be able find its
nodes later. Probably the best option is to define a given normalization per
repos and then refuse commits that add non-normalized paths, like a
pre-commit hook.

On the other hand, an all-un*x shop must be allowed to operate the way they
always did. Their OSs only see byte sequences and don't mess around with
normalization. Say they want to have a folder of differently normalized
representations of the same file for testing *their own* code for unicode
robustness. They should be able to do that. (They obviously can't use OSX's
HFS+ for that, though.)

So, on top of client-only fixes, it would be good to have ways to enforce
certain repository behavior, based on self-imposed policy -- I mean, we
won't have "The Subversion Normalization", each admin decides alone.

On 01/30/2012 01:30 PM, Stefan Sperling wrote:
> I am not convinced that it is impossible to fix.

Nicely put :)

~Neels

[[[
fred@mac $ svn co http://svn/repos
A foo
A bar
*** Warning:
You are checking out to an HFS+ file system. Your WC may not accurately
represent this revision. Consider using a different file system!
Continue? (Y/n) Y
A föö
*** File name collision detected. Skipping 'föo:'
*** File name collision detected. Skipping 'fo:ö'
*** File name collision detected. Skipping 'fo:o:'
A baz
fred@mac $
]]]
:P

AW: Let's discuss about unicode compositions for filenames!

Posted by Markus Schaber <m....@3s-software.com>.

Hi, Peter,

Von: Peter Samuelson [mailto:peter@p12n.org] 
>> [Stefan Sperling]
>> > We could also open the parent directory, read all the filenames 
>> > within it, normalise them all, and then search the resulting list. 
>> > This works, expect if a name exists twice, once in NFC form and once 
>> > in NFD form. We'd somehow have to solve the name collision in the 
>> > filesystem.

>[Markus Schaber]
> This sounds astonishingly similar to the lower/upper case problem of 
> UN*X vs. Mac/Win.

> There are similarities, but there are some important differences:

>- We have to support Mac OS X, which stores all files in NFD.  In the
>  upper/lowercase analogy, think of OS X as MS-DOS, which does not
>  preserve mixed case at all but always represents files in uppercase.
>  Subversion doesn't support MS-DOS and I hope we never need to.  MS
>  Windows, OTOH, at least preserves the upper/lowercase distinction
>  presented to it when you create a file.  Big difference.

The preservation of cases does not help that much - a simple "map all to lower case when accessing the working copy, and search case insensitive in the database" could solve that problem - but there's the problem that the repository can contain files whose filename differs only in case, and then the preserving of original case does not help that much either.

>- Also, the Subversion platform has chosen to support files like README
>  and Readme that conflict on Windows.  Our reasoning is "if you have
>  users on Windows, don't do that."  Most solutions to the NFC/NFD
>  problem will affect all platforms, not just one, and we probably
>  can't just say "well, don't do that" - we'll need to actually prevent
>  it - and somehow deal with existing clients, WCs, and repositories).

> Because of those differences, my gut feeling is that we can't treat the two issues in the same way.

There seem to be clients which allow files whose name differs only by encoding. So the position of "unicode encoding collisions" could be the same than on "case insensitivity collisions " (allow in the repository what the most capable clients allow). My guess is that the fixes for that scenario are rather similar (mainly client-based, specific to the capabilities of the platform, and "if you have users on mac, don't do that"). Of course Mac clients internally need to map to their normalized encoding in a similar way as it is done for case sensitivity now, and in case of encoding collisions, they've lost (similar to case collisions on Mac and Windows).

If the position is to disallow files whose name only differs by encoding in the repositories, things are a little bit different.

But I think that even this can be solved purely on the client, by only sending normalized names to the server for all new objects (imports, additions, copy targets, ...), and using the existing encodings for all existing objects.

For existing collisions, which harm work on MacOS, the usual workarounds apply: Rename the colliding files via repo-browser or in a more capable client. Additionally, we could develop a dump filter tool for name normalization, maybe with a switch whether to error out or silently rename on collisions.

With proper documentation, this will cause the problem to fade out in the future, and - in theory - it can be implemented on top of the first one at a later time. I don't see any need to change anything on the server (both implicit conversion and rejection of invalid encodings would break existing clients and working copies). My personal guess is that actual encoding collisions are rather rare, and workarounds exist, so servers can start to reject invalid encodings with version 2.0, or whatever future version is allowed to break compatibility to old clients.


Best regards

Markus Schaber
-- 
___________________________
We software Automation.

3S-Smart Software Solutions GmbH
Markus Schaber | Developer
Memminger Str. 151 | 87439 Kempten | Germany | Tel. +49-831-54031-0 | Fax +49-831-54031-50

Email: m.schaber@3s-software.com | Web: http://www.3s-software.com 
CoDeSys internet forum: http://forum.3s-software.com
Download CoDeSys sample projects: http://www.3s-software.com/index.shtml?sample_projects

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915

Re: Let's discuss about unicode compositions for filenames!

Posted by Peter Samuelson <pe...@p12n.org>.

  [Stefan Sperling]
> > We could also open the parent directory, read all the filenames
> > within it, normalise them all, and then search the resulting
> > list. This works, expect if a name exists twice, once in NFC form
> > and once in NFD form. We'd somehow have to solve the name collision
> > in the filesystem.

[Markus Schaber]
> This sounds astonishingly similar to the lower/upper case problem of
> UN*X vs. Mac/Win.

There are similarities, but there are some important differences:

- We have to support Mac OS X, which stores all files in NFD.  In the
  upper/lowercase analogy, think of OS X as MS-DOS, which does not
  preserve mixed case at all but always represents files in uppercase.
  Subversion doesn't support MS-DOS and I hope we never need to.  MS
  Windows, OTOH, at least preserves the upper/lowercase distinction
  presented to it when you create a file.  Big difference.

  (I'm not saying OS X is like MS-DOS in other respects.  Just for the
  purpose of the NFC/NFD vs. upper/lower analogy.)

- Also, the Subversion platform has chosen to support files like README
  and Readme that conflict on Windows.  Our reasoning is "if you have
  users on Windows, don't do that."  Most solutions to the NFC/NFD
  problem will affect all platforms, not just one, and we probably
  can't just say "well, don't do that" - we'll need to actually prevent
  it - and somehow deal with existing clients, WCs, and repositories).

Because of those differences, my gut feeling is that we can't treat the
two issues in the same way.

Peter

AW: Let's discuss about unicode compositions for filenames!

Posted by Markus Schaber <m....@3s-software.com>.

Hi,

Von: Stefan Sperling [mailto:stsp@elego.de] 
> On Sun, Jan 29, 2012 at 07:38:44PM +0900, Hiroaki Nakamura wrote:
>> I read the note about unicode compositions for filenames 
>> http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames and would like to drive the discussion.
[...]
> We could also open the parent directory, read all the filenames within it, normalise them all, and then search the resulting list. This works, expect if a name exists twice, once in NFC form and once in NFD form. We'd somehow have to solve the name collision in the filesystem.

This sounds astonishingly similar to the lower/upper case problem of UN*X vs. Mac/Win.

> But it gets worse. Recall the filesystem name collision problem mentioned above. This problem can also happen in the repository filesystem! For instance, assume that in the repository there already exist two filenames, one NFD, the other NFC, but they both are actually the same name.

The same here. So whatever solution is found for one of those problems could also help to solve (or mitigate) the other problem.

> These are the questions which we'll need to answer to solve this issue.
> I honestly do not have good answers. I hope that you will find ways of solving these problems.

Maybe the best solution to this issue is a client-only solution, in a similar way the case sensitivity problem is tackled.


Best regards

Markus Schaber
-- 
___________________________
We software Automation.

3S-Smart Software Solutions GmbH
Markus Schaber | Developer
Memminger Str. 151 | 87439 Kempten | Germany | Tel. +49-831-54031-0 | Fax +49-831-54031-50

Email: m.schaber@3s-software.com | Web: http://www.3s-software.com 
CoDeSys internet forum: http://forum.3s-software.com
Download CoDeSys sample projects: http://www.3s-software.com/index.shtml?sample_projects

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915 


-----Ursprüngliche Nachricht-----

Re: Let's discuss about unicode compositions for filenames!

Posted by Johan Corveleyn <jc...@gmail.com>.

On Mon, Jan 30, 2012 at 9:09 PM, Branko Čibej <br...@xbc.nu> wrote:
> On 30.01.2012 21:00, Johan Corveleyn wrote:
>> On Mon, Jan 30, 2012 at 8:10 PM, Stefan Sperling <st...@elego.de> wrote:
>>> On Tue, Jan 31, 2012 at 01:42:21AM +0900, Hiroaki Nakamura wrote:
>>>> 2012/1/30 Stefan Sperling <st...@elego.de>:
>> [ ... ]
>>
>>> And mixing various unicode forms works fine today if the filesystem
>>> used by the client supports this. The use case Neels contrived, where
>>> developers want to test their code with unicode filenames in various
>>> NFD/NFC forms, and check those test files into Subversion, should still
>>> be supported.
>> Indeed.
>>
>> Though this means that unconditional NFC (or whatever) normalization
>> in the working copy database is not an option, since it precludes
>> representing multiple forms at the same time in the wc. Except maybe
>> dependent on the (filesystem of the) client platform.
>
> Are you seriously proposing that we /support/ such broken, hackish
> nonsense? How do you expect users to tell the difference between file
> names that look identical on the character level, but are not on the
> code point level?

Huh? I'm not proposing anything. Hiroaki suggested (with his patch and
followup discussion) to do normalization to NFC in wc.db (or something
like that, so that all paths that enter wc.db are in NFC form). All
I'm saying is that this conflicts with the "use case
Neels contrived", to represent multiple forms in the working copy.
Except if you allow some clients to do it, and others not (either by a
client-side option, or by platform-specific behavior).

> Supporting such hacks would only be a source of bug reports. I don't see
> this as a desirable feature.

No problem, I don't either. I'm not really participating in this
discussion (got enough discussions going on already :-)). Just wanted
to point out the conflict ...

> And as for doing the server-side checks in pre-commit hooks ... i guess
> you could write a whole libsvn_repos implementation merely as a set of
> pre-commit hooks, but who would want to? Hooks aren't intended for
> implementing core functionality..

Ok, then I also propose that case-insensitive.py should be folded into
core functionality (server-side option). That would be vastly better
of course, more performant etc ...

So I totally agree.

-- 
Johan

Re: Let's discuss about unicode compositions for filenames!

Posted by Daniel Shahaf <da...@elego.de>.

Hiroaki Nakamura wrote on Fri, Feb 03, 2012 at 05:33:02 +0900:
> 2012/2/3 Daniel Shahaf <da...@elego.de>:
> > Branko Čibej wrote on Thu, Feb 02, 2012 at 21:03:47 +0100:
> >> On 02.02.2012 20:22, Peter Samuelson wrote:
> >> > [Hiroaki Nakamura]
> >> >> In option (2), we do n12n on all clients on all platforms, and we
> >> >> include web_dav_svn in "clients". So we convert all input paths to
> >> >> the "server encoding", which is NFC.
> >> > Indeed.  But the very concept of a "server encoding" means we are
> >> > involving the server side.  Which invokes a lot of difficult questions
> >> > like "what about existing 1.x clients", "what about existing checkouts"
> >> > and "what about existing repositories".
> >> >
> >> > By proposing a client-only solution, I hope to avoid _all_ those
> >> > questions.
> >>
> >> Can't see how that works, unless you either make the client-side
> >> solution optional, create a mapping table, or make name lookup on the
> >> server agnostic to character representation. I can't envision how any of
> >> those solutions would work all the time.
> >>
> >> It would be nice if we could normalize paths in the repository without
> >> having to perform a dump/reload cycle, but I don't know how that would
> >> work in FSFS
> >
> > It won't.  Changing the encoding increase the length (in bytes) of the
> > string (in the dirents hash, for example), and thus change the offsets
> > of the node-revs that are later in the file --- to which subsequent
> > revisions, and the id's of those node-revs, refer.
> 
> Changes from NFD to NFC does not increase the length.
> The length will be same or smaller, not larger.
> 

If the conversion is guaranteed to be monotone non-increasing (in
length) then I believe could be made to work "in place".

As to keeping concurrent readers and preexisting working copies sane ---
for now I'm LAAEFTR'ing that.

> Here I quote from
> http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
>   > The proposed internal 'normal form' should be NFC, if only if
>   > it were because it's the most compact form of the two:  when
>   > allocating memory to store a conversion result, it won't be
>   > necessary (ever) to allocate more than the size of the input buffer.
> 
> 
> -- 
> )Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

2012/2/3 Julian Foad <ju...@btopenworld.com>:
> You may well be correct that NFC is never longer than NFD, but that's not the question.  The question is whether NFC may be longer than the current paths (which are not normalized to normalization form C or to form D).  And the answer is yes it may be longer.  See <http://unicode.org/faq/normalization.html#11>.

Oh, I didn't know that. Thanks for letting me know.
I also read all other items in <http://unicode.org/faq/normalization.html#11>
and all of <http://www.unicode.org/reports/tr15/> and learned more about
normalization.

Maybe we should revise the note.
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames

>
>
>> Here I quote from
>> http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
>>   > The proposed internal 'normal form' should be NFC, if only if
>>   > it were because it's the most compact form of the two:  when
>>   > allocating memory to store a conversion result, it won't be
>>   > necessary (ever) to allocate more than the size of the input buffer.
>
> That statement seems to be talking about converting between NFC and NFD, not from un-normalized to normalized.

Yes, indeed.

So, we need to normalize input paths before processing.
We choose NFC as normalization form.

-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Julian Foad <ju...@btopenworld.com>.

Hiroaki Nakamura wrote:

>>>  It would be nice if we could normalize paths in the repository without
>>>  having to perform a dump/reload cycle, but I don't know how that 
>>> would  work in FSFS.
>> 
>>  It won't.  Changing the encoding increase the length (in bytes) of the
>>  string (in the dirents hash, for example), and thus change the offsets
>>  of the node-revs that are later in the file --- to which subsequent
>>  revisions, and the id's of those node-revs, refer.
> 
> Changes from NFD to NFC does not increase the length.
> The length will be same or smaller, not larger.

You may well be correct that NFC is never longer than NFD, but that's not the question.  The question is whether NFC may be longer than the current paths (which are not normalized to normalization form C or to form D).  And the answer is yes it may be longer.  See <http://unicode.org/faq/normalization.html#11>.

> Here I quote from
> http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
>   > The proposed internal 'normal form' should be NFC, if only if
>   > it were because it's the most compact form of the two:  when
>   > allocating memory to store a conversion result, it won't be
>   > necessary (ever) to allocate more than the size of the input buffer.

That statement seems to be talking about converting between NFC and NFD, not from un-normalized to normalized.

- Julian

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

2012/2/3 Daniel Shahaf <da...@elego.de>:
> Branko Čibej wrote on Thu, Feb 02, 2012 at 21:03:47 +0100:
>> On 02.02.2012 20:22, Peter Samuelson wrote:
>> > [Hiroaki Nakamura]
>> >> In option (2), we do n12n on all clients on all platforms, and we
>> >> include web_dav_svn in "clients". So we convert all input paths to
>> >> the "server encoding", which is NFC.
>> > Indeed.  But the very concept of a "server encoding" means we are
>> > involving the server side.  Which invokes a lot of difficult questions
>> > like "what about existing 1.x clients", "what about existing checkouts"
>> > and "what about existing repositories".
>> >
>> > By proposing a client-only solution, I hope to avoid _all_ those
>> > questions.
>>
>> Can't see how that works, unless you either make the client-side
>> solution optional, create a mapping table, or make name lookup on the
>> server agnostic to character representation. I can't envision how any of
>> those solutions would work all the time.
>>
>> It would be nice if we could normalize paths in the repository without
>> having to perform a dump/reload cycle, but I don't know how that would
>> work in FSFS
>
> It won't.  Changing the encoding increase the length (in bytes) of the
> string (in the dirents hash, for example), and thus change the offsets
> of the node-revs that are later in the file --- to which subsequent
> revisions, and the id's of those node-revs, refer.

Changes from NFD to NFC does not increase the length.
The length will be same or smaller, not larger.

Here I quote from
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
  > The proposed internal 'normal form' should be NFC, if only if
  > it were because it's the most compact form of the two:  when
  > allocating memory to store a conversion result, it won't be
  > necessary (ever) to allocate more than the size of the input buffer.


-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Daniel Shahaf <da...@elego.de>.

Branko Čibej wrote on Thu, Feb 02, 2012 at 21:03:47 +0100:
> On 02.02.2012 20:22, Peter Samuelson wrote:
> > [Hiroaki Nakamura]
> >> In option (2), we do n12n on all clients on all platforms, and we
> >> include web_dav_svn in "clients". So we convert all input paths to
> >> the "server encoding", which is NFC.
> > Indeed.  But the very concept of a "server encoding" means we are
> > involving the server side.  Which invokes a lot of difficult questions
> > like "what about existing 1.x clients", "what about existing checkouts"
> > and "what about existing repositories".
> >
> > By proposing a client-only solution, I hope to avoid _all_ those
> > questions.
> 
> Can't see how that works, unless you either make the client-side
> solution optional, create a mapping table, or make name lookup on the
> server agnostic to character representation. I can't envision how any of
> those solutions would work all the time.
> 
> It would be nice if we could normalize paths in the repository without
> having to perform a dump/reload cycle, but I don't know how that would
> work in FSFS

It won't.  Changing the encoding increase the length (in bytes) of the
string (in the dirents hash, for example), and thus change the offsets
of the node-revs that are later in the file --- to which subsequent
revisions, and the id's of those node-revs, refer.

> (BDB would be fairly easy, modulo collisions, but I don't
> think those are very likely).
> 
> -- Brane
>

Re: Let's discuss about unicode compositions for filenames!

Posted by Erik Huelsmann <eh...@gmail.com>.

On Thu, Feb 2, 2012 at 10:59 PM, Hiroaki Nakamura <hn...@gmail.com> wrote:
> 2012/2/3 Peter Samuelson <pe...@p12n.org>:
>>
>>> On 02.02.2012 20:22, Peter Samuelson wrote:
>>> > By proposing a client-only solution, I hope to avoid _all_ those
>>> > questions.
>>
>> [Branko Cibej]
>>> Can't see how that works, unless you either make the client-side
>>> solution optional, create a mapping table, or make name lookup on the
>>> server agnostic to character representation.
>>
>> Yes, I did propose a mapping table in wc.db.
>>
>> Old clients on OS X would continue to be confused; the solution is to
>> upgrade.
>
> Until upgrading all clients, there are possibilities that NFD filenames
> are checked in to repositories. So I proposed servers change filenames
> to NFC before checking in to repositories.

How about checking existence of a path to be added using NFC encoding?
If it does not exist when both the repository paths and the new
path(s) are converted to NFC, go ahead and add it using the encoding
that you were handed off the network?

Bye,


Erik

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

2012/2/3 Peter Samuelson <pe...@p12n.org>:
>
>> On 02.02.2012 20:22, Peter Samuelson wrote:
>> > By proposing a client-only solution, I hope to avoid _all_ those
>> > questions.
>
> [Branko Cibej]
>> Can't see how that works, unless you either make the client-side
>> solution optional, create a mapping table, or make name lookup on the
>> server agnostic to character representation.
>
> Yes, I did propose a mapping table in wc.db.
>
> Old clients on OS X would continue to be confused; the solution is to
> upgrade.

Until upgrading all clients, there are possibilities that NFD filenames
are checked in to repositories. So I proposed servers change filenames
to NFC before checking in to repositories.

But you gave me advice it is not good as a short term solution.
So I withdraw this part at this time. Please see my another post
( message id is
CAN-DUMS6=ymxMYBjzDjSyQFvSQ+n4MG=PG6dihwEiyW_6agF7Q@mail.gmail.com )

>
> New clients on OS X (and elsewhere) would maintain a mapping table
> between 'repository path' and 'local filesystem representation'.  We
> already have these two concepts, given we support non-UTF-8 client
> encodings.
>
>> It would be nice if we could normalize paths in the repository
>> without having to perform a dump/reload cycle, but I don't know how
>> that would work in FSFS
>
> Indeed, it's a problem similar to obliterate, and carries the same risk
> of invalidating every wc.  This is why I don't think it's a reasonable
> path to take in the short term (1.x).

Well, I don't understand here. Upgrading subversion client 1.6 to 1.7
did invalidated every wc, didn't it?

When I run 'svn info' with subversion 1.7 client on wc created by 1.6,
I see these message and cannot use it with 1.7.

svn: E155036: Please see the 'svn upgrade' command
svn: E155036: Working copy '/your/wc/path/here' is too old (format 10,
created by Subversion 1.6)

-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Peter Samuelson <pe...@p12n.org>.

> On 02.02.2012 20:22, Peter Samuelson wrote:
> > By proposing a client-only solution, I hope to avoid _all_ those
> > questions.

[Branko Cibej]
> Can't see how that works, unless you either make the client-side
> solution optional, create a mapping table, or make name lookup on the
> server agnostic to character representation.

Yes, I did propose a mapping table in wc.db.

Old clients on OS X would continue to be confused; the solution is to
upgrade.

New clients on OS X (and elsewhere) would maintain a mapping table
between 'repository path' and 'local filesystem representation'.  We
already have these two concepts, given we support non-UTF-8 client
encodings.

> It would be nice if we could normalize paths in the repository
> without having to perform a dump/reload cycle, but I don't know how
> that would work in FSFS

Indeed, it's a problem similar to obliterate, and carries the same risk
of invalidating every wc.  This is why I don't think it's a reasonable
path to take in the short term (1.x).

Peter

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

2012/2/3 Peter Samuelson <pe...@p12n.org>:
>
> [Hiroaki Nakamura]
>> Existing repositories, I think it would be better to convert them too using
>> svndump/svnload. And we change svnload to convert filenames to NFC.
>> However in reality we cannot force users to convert every existing repository.
>
> Also note that if you convert a repository (via dump/load or whatever),
> all working copies based on the repository are invalidated and need to
> be re-checked-out.  Avoiding _that_ problem would be really hairy, I
> think, very similar to the sort of work that would be needed to support
> obliterate without losing working copies.
>
>> We also need to changes servers in order to deal with existing 1.x
>> clients.  We convert filenames to NFC when web_dav_svn and svnserve
>> receive filenames from clients, they must first convert filenames to
>> NFC.
>
> You keep saying what we "must" do on the server side.  I propose
> something that is purely on the client side.  It will solve the OS X /
> non-OS X interoperability problem.  It will not solve every problem
> ever faced by a Subversion user.  That's a job for 2.0.

OK. When I started this thread, I suppose we'd like to focus to
long term solution 2.x. That's because the short term solution options (4)
written in
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
seems too diificult and complex for me.

But if a modification to my proposal will fit in short term 1.x,
I will modify it delightedly.

>
>> Yes, like I said above, "clients" actually includes components that
>> run on servers like web_dav_svn, and it should read as any components
>> that access to repositories and working copies.
>
> No.  By "clients" I mean components that run on the client side.  If my
> proposal had required changes to mod_dav_svn, I would not have said
> "strictly client-side".  I do not propose any change to mod_dav_svn,
> svnserve, svnadmin, libsvn_repos, libsvn_fs, the repository data, or
> anything else on the server side.
>
>> If you think in analogy to ASCII uppercase and lowercase examples,
>> you miss the point. Please reread the Unicode Standard Annex #15
>> UAX #15: Unicode Normalization Forms
>> http://unicode.org/reports/tr15/
>
> Thanks, I've read it.  The analogy stands.  We could prevent NFC/NFD
> collisions as an additional service to users, something we have not
> done for the past 10 years.  This would be along the lines of
> preventing users from shooting themselves in the foot.
>
> The actual _software_ problem that is solved by preventing collisions
> is the same as the software problem solved by preventing upper/lower
> case collisions: certain clients are unable to check out a folder that
> has such collisions.  (Windows clients, in the case of upper/lower
> collisions; OS X clients, in the case of NFC/NFD collisions.)

Yes, I agree with that.

>
> I think we are talking past each other.  You are trying to solve two
> distinct but related problems: 1. OS X client-side confusion when faced
> with a non-NFD repository path; 2. NFC/NFD collisions.  I am only
> trying to solve problem 1.  I'm ignoring problem 2 for two reasons:
>
>    (a) Problem 2 requires server-side work and complex compatibility /
>    upgrade scenarios (dump/load, re-check-out all wcs, etc).
>
>    (b) Problem 2 can be worked around, for new repositories (or
>    repositories with no existing collisions), with a pre-commit hook.
>
> ...neither of which are true for my proposal to solve problem 1.
>
> So long as you continue to insist that, to solve problem 1, we must
> also solve problem 2, I'm pretty sure we will never come to any
> agreement.

OK. So how about changing my proposal like:
(1) No sever modification. Just modify svn_path_cstring_to_utf8 only.
(2) Let users install a pre-commit hook which rejects any non-NFC filenames.

In this way, we only need one function. Modification is just like
the original OS X unicode path patch:
utf8precompose_macosx_2.patch
http://subversion.tigris.org/nonav/issues/showattachment.cgi/813/utf8precompose_macosx_2.patch
in
http://subversion.tigris.org/issues/show_bug.cgi?id=2464

Only difference the original patch to my patch will be mine use
utf8proc so that we can use it on all platforms, Mac OS X, Windows
and Linux.

-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Peter Samuelson <pe...@p12n.org>.

[Hiroaki Nakamura]
> Existing repositories, I think it would be better to convert them too using
> svndump/svnload. And we change svnload to convert filenames to NFC.
> However in reality we cannot force users to convert every existing repository.

Also note that if you convert a repository (via dump/load or whatever),
all working copies based on the repository are invalidated and need to
be re-checked-out.  Avoiding _that_ problem would be really hairy, I
think, very similar to the sort of work that would be needed to support
obliterate without losing working copies.

> We also need to changes servers in order to deal with existing 1.x
> clients.  We convert filenames to NFC when web_dav_svn and svnserve
> receive filenames from clients, they must first convert filenames to
> NFC.

You keep saying what we "must" do on the server side.  I propose
something that is purely on the client side.  It will solve the OS X /
non-OS X interoperability problem.  It will not solve every problem
ever faced by a Subversion user.  That's a job for 2.0.

> Yes, like I said above, "clients" actually includes components that
> run on servers like web_dav_svn, and it should read as any components
> that access to repositories and working copies.

No.  By "clients" I mean components that run on the client side.  If my
proposal had required changes to mod_dav_svn, I would not have said
"strictly client-side".  I do not propose any change to mod_dav_svn,
svnserve, svnadmin, libsvn_repos, libsvn_fs, the repository data, or
anything else on the server side.

> If you think in analogy to ASCII uppercase and lowercase examples,
> you miss the point. Please reread the Unicode Standard Annex #15
> UAX #15: Unicode Normalization Forms
> http://unicode.org/reports/tr15/

Thanks, I've read it.  The analogy stands.  We could prevent NFC/NFD
collisions as an additional service to users, something we have not
done for the past 10 years.  This would be along the lines of
preventing users from shooting themselves in the foot.

The actual _software_ problem that is solved by preventing collisions
is the same as the software problem solved by preventing upper/lower
case collisions: certain clients are unable to check out a folder that
has such collisions.  (Windows clients, in the case of upper/lower
collisions; OS X clients, in the case of NFC/NFD collisions.)

I think we are talking past each other.  You are trying to solve two
distinct but related problems: 1. OS X client-side confusion when faced
with a non-NFD repository path; 2. NFC/NFD collisions.  I am only
trying to solve problem 1.  I'm ignoring problem 2 for two reasons:

    (a) Problem 2 requires server-side work and complex compatibility /
    upgrade scenarios (dump/load, re-check-out all wcs, etc).

    (b) Problem 2 can be worked around, for new repositories (or
    repositories with no existing collisions), with a pre-commit hook.

...neither of which are true for my proposal to solve problem 1.

So long as you continue to insist that, to solve problem 1, we must
also solve problem 2, I'm pretty sure we will never come to any
agreement.

Peter

Re: Let's discuss about unicode compositions for filenames!

Posted by Branko Čibej <br...@apache.org>.

On 02.02.2012 20:22, Peter Samuelson wrote:
> [Hiroaki Nakamura]
>> In option (2), we do n12n on all clients on all platforms, and we
>> include web_dav_svn in "clients". So we convert all input paths to
>> the "server encoding", which is NFC.
> Indeed.  But the very concept of a "server encoding" means we are
> involving the server side.  Which invokes a lot of difficult questions
> like "what about existing 1.x clients", "what about existing checkouts"
> and "what about existing repositories".
>
> By proposing a client-only solution, I hope to avoid _all_ those
> questions.

Can't see how that works, unless you either make the client-side
solution optional, create a mapping table, or make name lookup on the
server agnostic to character representation. I can't envision how any of
those solutions would work all the time.

It would be nice if we could normalize paths in the repository without
having to perform a dump/reload cycle, but I don't know how that would
work in FSFS (BDB would be fairly easy, modulo collisions, but I don't
think those are very likely).

-- Brane

Re: Let's discuss about unicode compositions for filenames!

Posted by Branko Čibej <br...@apache.org>.

On 02.02.2012 21:28, Hiroaki Nakamura wrote:
> 2012/2/3 Branko Čibej <br...@xbc.nu>:
>> On 02.02.2012 20:59, Hiroaki Nakamura wrote:
>>> So we need to change servers too. When servers read filenames from
>>> repositories, they first convert to NFC and then process commands.
>> That won't work. You have to do the initial lookup in a
>> normalization-agnostic way, and neither BDB nor FSFS makes that possible
>> wihout scanning whole directories.
> OK, then do scan whole directories.

But we can't make old clients do that. So ... by normalizing paths that
come from the server, we're effectively killing off all old clients that
would otherwise work with said servers.

-- Brane

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

2012/2/3 Branko Čibej <br...@xbc.nu>:
> On 02.02.2012 20:59, Hiroaki Nakamura wrote:
>> So we need to change servers too. When servers read filenames from
>> repositories, they first convert to NFC and then process commands.
>
> That won't work. You have to do the initial lookup in a
> normalization-agnostic way, and neither BDB nor FSFS makes that possible
> wihout scanning whole directories.

OK, then do scan whole directories. If you do not want that,
we force users to convert existing repositories. I think we must
choose one of the two. Tough choices, but I cannot think of a
better way at least right now.

>
>> We also need to changes servers in order to deal with existing 1.x
>> clients. We convert filenames to NFC when web_dav_svn and svnserve
>> receive filenames from clients, they must first convert filenames to NFC.
>
> Actually, libsvn_repos; this has to work with ra_local as well. And it
> would have to maintain a table for converting results back to how the
> client knows them. This is the hard part to get right; imagine:
>
>    $ svn up
>    U čombe
>
> How will the server know if the client represents the "č" in the same
> encoding that the now-normalizing server sends? Will the client scan the
> directory and normalize the names to find the local file that needs
> updating?

Yes, without upgrading working copies, we must do that.

If there is a better way, I would like to know.
Please give us better solution if you have an idea > all.

-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Branko Čibej <br...@xbc.nu>.

On 02.02.2012 20:59, Hiroaki Nakamura wrote:
> So we need to change servers too. When servers read filenames from
> repositories, they first convert to NFC and then process commands.

That won't work. You have to do the initial lookup in a
normalization-agnostic way, and neither BDB nor FSFS makes that possible
wihout scanning whole directories.

> We also need to changes servers in order to deal with existing 1.x
> clients. We convert filenames to NFC when web_dav_svn and svnserve
> receive filenames from clients, they must first convert filenames to NFC. 

Actually, libsvn_repos; this has to work with ra_local as well. And it
would have to maintain a table for converting results back to how the
client knows them. This is the hard part to get right; imagine:

    $ svn up
    U čombe

How will the server know if the client represents the "č" in the same
encoding that the now-normalizing server sends? Will the client scan the
directory and normalize the names to find the local file that needs
updating?

-- Brane

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

2012/2/3 Peter Samuelson <pe...@p12n.org>:
>
> [Hiroaki Nakamura]
>> In option (2), we do n12n on all clients on all platforms, and we
>> include web_dav_svn in "clients". So we convert all input paths to
>> the "server encoding", which is NFC.
>
> Indeed.  But the very concept of a "server encoding" means we are
> involving the server side.  Which invokes a lot of difficult questions
> like "what about existing 1.x clients", "what about existing checkouts"
> and "what about existing repositories".

Svn 1.7 forces me to upgrade existing 1.6 working copies.
So we can let users to upgrade working copies.

Existing repositories, I think it would be better to convert them too using
svndump/svnload. And we change svnload to convert filenames to NFC.
However in reality we cannot force users to convert every existing repository.
So we need to change servers too. When servers read filenames
from repositories, they first convert to NFC and then process commands.

We also need to changes servers in order to deal with existing 1.x clients.
We convert filenames to NFC when web_dav_svn and svnserve
receive filenames from clients, they must first convert filenames to NFC.

>
> By proposing a client-only solution, I hope to avoid _all_ those
> questions.  (Except "what about existing checkouts" - there would be a
> wc upgrade of some sort.)  No recoding of existing repository paths is
> necessary.  In my proposal, the only recoding that is done is on the
> client side, on a platform that does not support the original pathname
> (e.g., OS X HFS+ with a NFC path).
>
>> "All problems in computer science can be solved by another level of
>> indirection."
>
> Mostly true.  I can't tell if you quoted that as a point of support for
> my proposal, or as a point against it.
>
>> Yes, with the mapping table, you can mangle filenames. However I
>> think it is too complex for novice users. Users must care about the
>> original filenames and the mangled filenames all the time.
>
> Well, there is no need to use this same proposal to also work around
> other filesystem limitations like avoiding ":" on Windows.  It is just
> something that becomes _possible_.
>
>> Also you must adapt all clients to use the mapping table. That is
>> whole lot of work! Maybe you would create another version control
>> system.
>
> By "all clients" I guess you mean "all Subversion client libraries".
> Yes, that is the proposal.  It would touch libsvn_wc and probably
> libsvn_client and libsvn_subr.

Yes, like I said above, "clients" actually includes components that
run on servers like web_dav_svn, and it should read as any components
that access to repositories and working copies.

We also need to change svnserve. So we'd better say "all servers and clients".

>
>> So even if Windows NTFS can have the same abstract filenames in both
>> NFC and NFD simultaneously, we should avoid that, and we should only
>> allow NFC filenames.
>
> This could be done, if we wanted to go to the trouble.  Or we could
> just say "use a pre-commit hook," like we tell people who want to
> prevent REAMDE and Reamde in a single dir.  It is not the same level of
> interoperability problem as the one this thread is about.

If you think in analogy to ASCII uppercase and lowercase examples,
you miss the point. Please reread the Unicode Standard Annex #15
UAX #15: Unicode Normalization Forms
http://unicode.org/reports/tr15/

 > Canonical equivalence is a fundamental equivalency between
 > characters or sequences of characters that represent the same
 > abstract character, and when correctly displayed should always
 > have the same visual appearance and behavior. Figure 1 illustrates
 > this equivalence.

So, filenames in NFC and NFD are the equivalent, the same.
README and readme are different.
NFC/NFD and uppercase/lowercase are two different stories.

Should we allow the same filenames in one directory?
Of course not! If we allow that we go into really trouble and
confusion.

And OS X HSF+ does not allow that. So to support interoperability
to OS X, we should not allow it in subversion too.

-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Peter Samuelson <pe...@p12n.org>.

[Hiroaki Nakamura]
> In option (2), we do n12n on all clients on all platforms, and we
> include web_dav_svn in "clients". So we convert all input paths to
> the "server encoding", which is NFC.

Indeed.  But the very concept of a "server encoding" means we are
involving the server side.  Which invokes a lot of difficult questions
like "what about existing 1.x clients", "what about existing checkouts"
and "what about existing repositories".

By proposing a client-only solution, I hope to avoid _all_ those
questions.  (Except "what about existing checkouts" - there would be a
wc upgrade of some sort.)  No recoding of existing repository paths is
necessary.  In my proposal, the only recoding that is done is on the
client side, on a platform that does not support the original pathname
(e.g., OS X HFS+ with a NFC path).

> "All problems in computer science can be solved by another level of
> indirection."

Mostly true.  I can't tell if you quoted that as a point of support for
my proposal, or as a point against it.

> Yes, with the mapping table, you can mangle filenames. However I
> think it is too complex for novice users. Users must care about the
> original filenames and the mangled filenames all the time.

Well, there is no need to use this same proposal to also work around
other filesystem limitations like avoiding ":" on Windows.  It is just
something that becomes _possible_.

> Also you must adapt all clients to use the mapping table. That is
> whole lot of work! Maybe you would create another version control
> system.

By "all clients" I guess you mean "all Subversion client libraries".
Yes, that is the proposal.  It would touch libsvn_wc and probably
libsvn_client and libsvn_subr.

> So even if Windows NTFS can have the same abstract filenames in both
> NFC and NFD simultaneously, we should avoid that, and we should only
> allow NFC filenames.

This could be done, if we wanted to go to the trouble.  Or we could
just say "use a pre-commit hook," like we tell people who want to
prevent REAMDE and Reamde in a single dir.  It is not the same level of
interoperability problem as the one this thread is about.
-- 
Peter Samuelson | org-tld!p12n!peter | http://p12n.org/

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

Everyone, read
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
again carefully before commenting to this thread, please!

2012/2/1 Peter Samuelson <pe...@p12n.org>
>
>
> [reordering the conversation flow slightly]
>
>  [Peter Samuelson]
> > > That's the implementation I would like to see, to be honest.  Start
> > > with the observation that we can treat Mac OS X NFD paths as a
> > > client character encoding.  Now observe that it is lossy.  But
> > > ... almost all non-Unicode client charsets are equally lossy, for
> > > exactly the same reason!
>
> [Branko Cibej]
> > I don't see what you mean by "lossy" though. NFD and NFC can
> > represent exactly the same set of characters, it's just that the
> > representations of some of them are different.
>
> By "lossy" I just mean that if you convert to UTF-8 NFD, you can't
> reliably convert _back_ to the original bytes.  I'm assuming here that
> we continue to do _no_ n11n on the server side - pathnames from
> libsvn_(ra|repos|fs) are just UTF-8 with unspecified n11n.  Thus, if
> the "client encoding" is UTF-8 NFD, you can't reliably convert that to
> the "server encoding".

In option (2), we do n12n on all clients on all platforms, and we include
web_dav_svn in "clients". So we convert all input paths to the
"server encoding", which is NFC.

>
>
> And this is also true of most legacy (non-Unicode) encodings: they know
> nothing about Unicode's n11n forms, so they are "lossy" in the same
> way: you can't reliably take a pathname in, e.g., ISO-8859-1, and
> convert to the encoding found in the repository, because you don't know
> the n11n form used by the original committer.
>
> This is why I suggested the mapping table in wc.db.
>
> Actually, the fact that the mapping table works around the inherent
> lossiness of character encoding conversion suggests that it _could_, in
> the future, also account for lossiness for other reasons.  If we
> wished, we could have libsvn_wc mangle checked-out filenames on
> platforms with arbitrary limitations - escaping "<" and ":" characters
> on Windows, e.g. - using this same mechanism.  Even if the conversion
> is lossy, the mapping table in wc.db knows the original filename.  Of
> course you couldn't _create_ filenames with platform limitations on the
> same platform, but being able to check out the file at all is an
> improvement over today.  Probably 'svn status' would show some
> indication that a name has been mangled in a way users would actually
> care about (i.e., not just NFC/NFD).

"All problems in computer science can be solved by another level of
indirection."
Yes, with the mapping table, you can mangle filenames. However I think
it is too complex for novice users. Users must care about the original
filenames
and the mangled filenames all the time. Also you must adapt all clients to
use the mapping table. That is whole lot of work! Maybe you would
create another version control system.

>
>
> > > The implementation on OS X might be a bit hairy, if there isn't an
> > > easy way to retrieve the real pathname of the file you just
> > > created.  Anywhere else, we just store the pathname we just
> > > calcuated.
>
> > Afaik the OSX API normalizes everything to NFD automagically. So at
> > least on that platform there's no chance of having more than one form
> > for the same filename at the same time. Likewise on Windows, which
> > normalizes to NFC.
>
> Right.  The question is, if libsvn_wc tells OS X to store a given path,
> with unknown n11n, is there an easy way to retrieve the pathname that
> was _actually_ stored on disk?  That's what I mean by "might be a bit
> hairy".  It sounds like the thing to do on OS X is for libsvn_wc to
> pre-normalize to NFD before writing the file, and just assume the OS
> will (re-)normalize to the same byte array.

Wrong. Windows NTFS does not normalize filenames to NFC. See
the attached screenshots. We can have two "same abstract" filenames,
one NFC, one NFD in the same directory.

Explorer displays filenames in exactly the same form. Here is a quote
from Unicode Standard Annex #15, and Explorer conforms to the
last sentence.

  http://unicode.org/reports/tr15/
  > The Unicode Standard defines two equivalences between characters:
  > canonical equivalence and compatibility equivalence. Canonical
  > equivalence is a fundamental equivalency between characters or
  > sequences of characters that represent the same abstract character,
  > and when correctly displayed should always have the same visual
  > appearance and behavior.

By contract, Command Prompts displays NFC filenames and NFD filenames
differently. However it should displays in the same way according to the
Unicode Annex above.

In the attached screenshot named
"NFC and NFD coexists on Command Prompt.png",
The first filename is NFD, the second file name is NFC.
NFC filenames looks familiar to me, and this is my first time to see
NFD filenames. It looks unnatural because one character are decomposed
and laid out separately, too much space between decomposed parts.

If novice users the situation like attached screenshots, they must be
very confused and upset. They would think "Oh why here we have the
two same filenames in one directory! It must be bug or something!"

So even if Windows NTFS can have the same abstract filenames in both
NFC and NFD simultaneously, we should avoid that, and we should only
allow NFC filenames.

So in a way OS X HSF+ is good, because it avoid the coexistence of
NFC and NFD filenames. It would be very happy for us if it choose NFC
instead of NFD!

--
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Peter Samuelson <pe...@p12n.org>.

[reordering the conversation flow slightly]

  [Peter Samuelson]
> > That's the implementation I would like to see, to be honest.  Start
> > with the observation that we can treat Mac OS X NFD paths as a
> > client character encoding.  Now observe that it is lossy.  But
> > ... almost all non-Unicode client charsets are equally lossy, for
> > exactly the same reason!

[Branko Cibej]
> I don't see what you mean by "lossy" though. NFD and NFC can
> represent exactly the same set of characters, it's just that the
> representations of some of them are different.

By "lossy" I just mean that if you convert to UTF-8 NFD, you can't
reliably convert _back_ to the original bytes.  I'm assuming here that
we continue to do _no_ n11n on the server side - pathnames from
libsvn_(ra|repos|fs) are just UTF-8 with unspecified n11n.  Thus, if
the "client encoding" is UTF-8 NFD, you can't reliably convert that to
the "server encoding".

And this is also true of most legacy (non-Unicode) encodings: they know
nothing about Unicode's n11n forms, so they are "lossy" in the same
way: you can't reliably take a pathname in, e.g., ISO-8859-1, and
convert to the encoding found in the repository, because you don't know
the n11n form used by the original committer.

This is why I suggested the mapping table in wc.db.

Actually, the fact that the mapping table works around the inherent
lossiness of character encoding conversion suggests that it _could_, in
the future, also account for lossiness for other reasons.  If we
wished, we could have libsvn_wc mangle checked-out filenames on
platforms with arbitrary limitations - escaping "<" and ":" characters
on Windows, e.g. - using this same mechanism.  Even if the conversion
is lossy, the mapping table in wc.db knows the original filename.  Of
course you couldn't _create_ filenames with platform limitations on the
same platform, but being able to check out the file at all is an
improvement over today.  Probably 'svn status' would show some
indication that a name has been mangled in a way users would actually
care about (i.e., not just NFC/NFD).

> > The implementation on OS X might be a bit hairy, if there isn't an
> > easy way to retrieve the real pathname of the file you just
> > created.  Anywhere else, we just store the pathname we just
> > calcuated.

> Afaik the OSX API normalizes everything to NFD automagically. So at
> least on that platform there's no chance of having more than one form
> for the same filename at the same time. Likewise on Windows, which
> normalizes to NFC.

Right.  The question is, if libsvn_wc tells OS X to store a given path,
with unknown n11n, is there an easy way to retrieve the pathname that
was _actually_ stored on disk?  That's what I mean by "might be a bit
hairy".  It sounds like the thing to do on OS X is for libsvn_wc to
pre-normalize to NFD before writing the file, and just assume the OS
will (re-)normalize to the same byte array.
-- 
Peter Samuelson | org-tld!p12n!peter | http://p12n.org/

Re: Let's discuss about unicode compositions for filenames!

Posted by Branko Čibej <br...@apache.org>.

On 31.01.2012 02:47, Bert Huijben wrote:
> Last time we discussed this in depth (a few years ago), Windows didn't perform the normalization you describe here.
> Was this added later? (Any documentation pointers?)

Ouch, you're right ... Windows API doesn't normalize the paths.

-- Brane

RE: Let's discuss about unicode compositions for filenames!

Posted by Bert Huijben <be...@qqmail.nl>.


> -----Original Message-----
> From: Branko Čibej [mailto:brane@xbc.nu]
> Sent: maandag 30 januari 2012 16:11
> To: dev@subversion.apache.org
> Subject: Re: Let's discuss about unicode compositions for filenames!
> 
> On 31.01.2012 00:14, Peter Samuelson wrote:
> > [Stefan Sperling]
> >> It is indeed harder because we are passing paths verbatim to sqlite.
> >> I doubt having more than one form of a given path in wc.db is fun...
> > That's the implementation I would like to see, to be honest.  Start
> > with the observation that we can treat Mac OS X NFD paths as a client
> > character encoding.  Now observe that it is lossy.  But ... almost all
> > non-Unicode client charsets are equally lossy, for exactly the same
> > reason!
> >
> > This suggests maintaining a mapping table in wc.db between server paths
> > (UTF-8, unspecified NF) and wc paths (local charset, which is
> > occasionally UTF-8 with NFD).
> >
> > This mapping table would be maintained any time we write to the wc.
> > It would be consulted any time we search for files in the wc.
> >
> > It's not really extra work - we have to do those UTF-8 <-> local
> > charset conversions all the time anyway.  This would in fact cache
> > those conversions.
> >
> > The implementation on OS X might be a bit hairy, if there isn't an easy
> > way to retrieve the real pathname of the file you just created.
> > Anywhere else, we just store the pathname we just calcuated.
> >
> 
> Afaik the OSX API normalizes everything to NFD automagically. So at
> least on that platform there's no chance of having more than one form
> for the same filename at the same time. Likewise on Windows, which
> normalizes to NFC.
> 
> I don't see what you mean by "lossy" though. NFD and NFC can represent
> exactly the same set of characters, it's just that the representations
> of some of them are different. Thus, this does not preclude normalizing
> the paths in wc.db, and that's even easily automated. If such a
> conversion finds a name collision ... the user is in serious trouble
> already. :)
> 
> It's more likely to find such a collision on Unix than either Mac OS or
> Windows (both of which normalize on the FS API level). But this case is
> probably so rare that I wouldn't worry about it.

Last time we discussed this in depth (a few years ago), Windows didn't perform the normalization you describe here.
Was this added later? (Any documentation pointers?)

I think the keyboard/editor support performs some normalization so users are unlikely to create the sequences not-normalized, but our old documents say that it just stores whatever it gets passed.
(Probably for the same reason as Subversion does it: compatibility with the time where we didn't know about these problems)

	Bert
> 
> -- Brane

Re: Let's discuss about unicode compositions for filenames!

Posted by Branko Čibej <br...@xbc.nu>.

On 31.01.2012 00:14, Peter Samuelson wrote:
> [Stefan Sperling]
>> It is indeed harder because we are passing paths verbatim to sqlite.
>> I doubt having more than one form of a given path in wc.db is fun...
> That's the implementation I would like to see, to be honest.  Start
> with the observation that we can treat Mac OS X NFD paths as a client
> character encoding.  Now observe that it is lossy.  But ... almost all
> non-Unicode client charsets are equally lossy, for exactly the same
> reason!
>
> This suggests maintaining a mapping table in wc.db between server paths
> (UTF-8, unspecified NF) and wc paths (local charset, which is
> occasionally UTF-8 with NFD).
>
> This mapping table would be maintained any time we write to the wc.
> It would be consulted any time we search for files in the wc.
>
> It's not really extra work - we have to do those UTF-8 <-> local
> charset conversions all the time anyway.  This would in fact cache
> those conversions.
>
> The implementation on OS X might be a bit hairy, if there isn't an easy
> way to retrieve the real pathname of the file you just created.
> Anywhere else, we just store the pathname we just calcuated.
>

Afaik the OSX API normalizes everything to NFD automagically. So at
least on that platform there's no chance of having more than one form
for the same filename at the same time. Likewise on Windows, which
normalizes to NFC.

I don't see what you mean by "lossy" though. NFD and NFC can represent
exactly the same set of characters, it's just that the representations
of some of them are different. Thus, this does not preclude normalizing
the paths in wc.db, and that's even easily automated. If such a
conversion finds a name collision ... the user is in serious trouble
already. :)

It's more likely to find such a collision on Unix than either Mac OS or
Windows (both of which normalize on the FS API level). But this case is
probably so rare that I wouldn't worry about it.

-- Brane

Re: Let's discuss about unicode compositions for filenames!

Posted by Peter Samuelson <pe...@p12n.org>.

[Stefan Sperling]
> It is indeed harder because we are passing paths verbatim to sqlite.
> I doubt having more than one form of a given path in wc.db is fun...

That's the implementation I would like to see, to be honest.  Start
with the observation that we can treat Mac OS X NFD paths as a client
character encoding.  Now observe that it is lossy.  But ... almost all
non-Unicode client charsets are equally lossy, for exactly the same
reason!

This suggests maintaining a mapping table in wc.db between server paths
(UTF-8, unspecified NF) and wc paths (local charset, which is
occasionally UTF-8 with NFD).

This mapping table would be maintained any time we write to the wc.
It would be consulted any time we search for files in the wc.

It's not really extra work - we have to do those UTF-8 <-> local
charset conversions all the time anyway.  This would in fact cache
those conversions.

The implementation on OS X might be a bit hairy, if there isn't an easy
way to retrieve the real pathname of the file you just created.
Anywhere else, we just store the pathname we just calcuated.

Peter

Re: Let's discuss about unicode compositions for filenames!

Posted by Stefan Sperling <st...@elego.de>.

On Mon, Jan 30, 2012 at 09:34:03PM +0100, Branko Čibej wrote:
> Sure, if you want to turn on such normalization, you pretty much have to
> dump and reload the repository as well as upgrading all working copies
> (again). Either that, or use form-independent comparison on the server,
> which isn't such a bad idea anyway. Doing that in wc.db is probably harder.

It is indeed harder because we are passing paths verbatim to sqlite.
I doubt having more than one form of a given path in wc.db is fun...

Re: Let's discuss about unicode compositions for filenames!

Posted by Branko Čibej <br...@apache.org>.

On 30.01.2012 21:29, Stefan Sperling wrote:
> On Mon, Jan 30, 2012 at 09:09:22PM +0100, Branko Čibej wrote:
>> Are you seriously proposing that we /support/ such broken, hackish
>> nonsense? How do you expect users to tell the difference between file
>> names that look identical on the character level, but are not on the
>> code point level?
>>
>> Supporting such hacks would only be a source of bug reports. I don't see
>> this as a desirable feature.
> The question is why you would want to break it now that it works.
> Because of HFS+? Isn't what HFS+ does just as broken if you think
> about it? Why normalise paths in the filesystem if nobody else does it?

You're aware that MacPorts subversion already has a hack to normalize
the other way, at least over the wire. :)

Sure, if you want to turn on such normalization, you pretty much have to
dump and reload the repository as well as upgrading all working copies
(again). Either that, or use form-independent comparison on the server,
which isn't such a bad idea anyway. Doing that in wc.db is probably harder.

-- Brane

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

2012/2/17 Vincent Lefevre <vi...@vinc17.net>:
> On 2012-02-17 13:54:35 +0900, Hiroaki Nakamura wrote:
>> Actually, whether filename is in NFC or NFD depends on the way of
>> inputting filenames.
>> If you type all characters, it is in NFC.
>
> No, or actually, perhaps this depends on the user configuration
> (e.g. keyboard configuration / input method). Here's from an old
> mail I wrote:
>
>  I don't use Terminal (except sometimes for testing) because I don't
>  like its UI. It is possible to type characters with dead keys, but
>  they are entered in the NFD form. That's another reason why I don't
>  use Terminal, because many applications don't support the combining
>  characters very well.
>
> And it seems that other users had similar problems. I've seen a
> report from someone seeing a difference between Apple's Terminal
> and iTerm when typing his password, which had a non-ASCII character.

Thanks for pointing me out. I used Terminal and Google Japanese
Input on Lion. So, it depends on the user configuration.

-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Vincent Lefevre <vi...@vinc17.net>.

On 2012-02-17 13:54:35 +0900, Hiroaki Nakamura wrote:
> Actually, whether filename is in NFC or NFD depends on the way of
> inputting filenames.
> If you type all characters, it is in NFC.

No, or actually, perhaps this depends on the user configuration
(e.g. keyboard configuration / input method). Here's from an old
mail I wrote:

  I don't use Terminal (except sometimes for testing) because I don't
  like its UI. It is possible to type characters with dead keys, but
  they are entered in the NFD form. That's another reason why I don't
  use Terminal, because many applications don't support the combining
  characters very well.

And it seems that other users had similar problems. I've seen a
report from someone seeing a difference between Apple's Terminal
and iTerm when typing his password, which had a non-ASCII character.

-- 
Vincent Lefèvre <vi...@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

2012/2/17 Vincent Lefevre <vi...@vinc17.net>:
> On 2012-01-30 21:29:41 +0100, Stefan Sperling wrote:
>> On Mon, Jan 30, 2012 at 09:09:22PM +0100, Branko Čibej wrote:
>> > Are you seriously proposing that we /support/ such broken, hackish
>> > nonsense? How do you expect users to tell the difference between file
>> > names that look identical on the character level, but are not on the
>> > code point level?
>> >
>> > Supporting such hacks would only be a source of bug reports. I don't see
>> > this as a desirable feature.
>>
>> The question is why you would want to break it now that it works.
>> Because of HFS+? [...]
>
> I think you mean because of Mac OS X. Indeed, unless this has changed,
> with the Mac OS X Terminal, when a user types an accented character,
> it is in NFD at the command line level. So, even if the user uses a
> conventional file system that can store both NFC and NFD, the filename
> will be in NFD, which will annoy Linux users.

Actually, whether filename is in NFC or NFD depends on the way of
inputting filenames.
If you type all characters, it is in NFC.
If you use shell filename completion by hitting tab key, it is in NFD.
I tried with Japanese filenames and confirmed this.

So, it is HFS+ which returns the filenames in NFD.
-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Vincent Lefevre <vi...@vinc17.net>.

On 2012-01-30 21:29:41 +0100, Stefan Sperling wrote:
> On Mon, Jan 30, 2012 at 09:09:22PM +0100, Branko Čibej wrote:
> > Are you seriously proposing that we /support/ such broken, hackish
> > nonsense? How do you expect users to tell the difference between file
> > names that look identical on the character level, but are not on the
> > code point level?
> >
> > Supporting such hacks would only be a source of bug reports. I don't see
> > this as a desirable feature.
> 
> The question is why you would want to break it now that it works.
> Because of HFS+? [...]

I think you mean because of Mac OS X. Indeed, unless this has changed,
with the Mac OS X Terminal, when a user types an accented character,
it is in NFD at the command line level. So, even if the user uses a
conventional file system that can store both NFC and NFD, the filename
will be in NFD, which will annoy Linux users.

-- 
Vincent Lefèvre <vi...@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Re: Let's discuss about unicode compositions for filenames!

Posted by Stefan Sperling <st...@elego.de>.

On Mon, Jan 30, 2012 at 09:09:22PM +0100, Branko Čibej wrote:
> Are you seriously proposing that we /support/ such broken, hackish
> nonsense? How do you expect users to tell the difference between file
> names that look identical on the character level, but are not on the
> code point level?
>
> Supporting such hacks would only be a source of bug reports. I don't see
> this as a desirable feature.

The question is why you would want to break it now that it works.
Because of HFS+? Isn't what HFS+ does just as broken if you think
about it? Why normalise paths in the filesystem if nobody else does it?

I'd prefer a universe where svn normalises anything to NFC from the
1.0 release onwards. Alas, we're in the wrong one.
Compare http://www.qwantz.com/index.php?comic=34 and following.
. o O (Where's my goatee?)

Re: Let's discuss about unicode compositions for filenames!

Posted by Branko Čibej <br...@xbc.nu>.

On 30.01.2012 21:00, Johan Corveleyn wrote:
> On Mon, Jan 30, 2012 at 8:10 PM, Stefan Sperling <st...@elego.de> wrote:
>> On Tue, Jan 31, 2012 at 01:42:21AM +0900, Hiroaki Nakamura wrote:
>>> 2012/1/30 Stefan Sperling <st...@elego.de>:
> [ ... ]
>
>> And mixing various unicode forms works fine today if the filesystem
>> used by the client supports this. The use case Neels contrived, where
>> developers want to test their code with unicode filenames in various
>> NFD/NFC forms, and check those test files into Subversion, should still
>> be supported.
> Indeed.
>
> Though this means that unconditional NFC (or whatever) normalization
> in the working copy database is not an option, since it precludes
> representing multiple forms at the same time in the wc. Except maybe
> dependent on the (filesystem of the) client platform.

Are you seriously proposing that we /support/ such broken, hackish
nonsense? How do you expect users to tell the difference between file
names that look identical on the character level, but are not on the
code point level?

Supporting such hacks would only be a source of bug reports. I don't see
this as a desirable feature.

And as for doing the server-side checks in pre-commit hooks ... i guess
you could write a whole libsvn_repos implementation merely as a set of
pre-commit hooks, but who would want to? Hooks aren't intended for
implementing core functionality..

-- Brane

Re: Let's discuss about unicode compositions for filenames!

Posted by Johan Corveleyn <jc...@gmail.com>.

On Mon, Jan 30, 2012 at 8:10 PM, Stefan Sperling <st...@elego.de> wrote:
> On Tue, Jan 31, 2012 at 01:42:21AM +0900, Hiroaki Nakamura wrote:
>> 2012/1/30 Stefan Sperling <st...@elego.de>:

[ ... ]

> And mixing various unicode forms works fine today if the filesystem
> used by the client supports this. The use case Neels contrived, where
> developers want to test their code with unicode filenames in various
> NFD/NFC forms, and check those test files into Subversion, should still
> be supported.

Indeed.

Though this means that unconditional NFC (or whatever) normalization
in the working copy database is not an option, since it precludes
representing multiple forms at the same time in the wc. Except maybe
dependent on the (filesystem of the) client platform.

Of course, if a repository needs to support also checkouts to OSX/HFS+
clients, it should be configured to disallow multiple (conflicting)
forms to enter the repository. This can be done with a pre-commit
hook, similar to case-insensitive.py [1], which does the same for
case-clashing files.

(BTW, case-insensitive.py works by comparing incoming adds with the
list of directory entries of the corresponding directory within the
txn (comparing their normalized forms))

-- 
Johan

[1] http://svn.apache.org/repos/asf/subversion/trunk/contrib/hook-scripts/case-insensitive.py

Re: Let's discuss about unicode compositions for filenames!

Posted by Stefan Sperling <st...@elego.de>.

On Tue, Jan 31, 2012 at 01:42:21AM +0900, Hiroaki Nakamura wrote:
> 2012/1/30 Stefan Sperling <st...@elego.de>:
> > My friend is not willing to upgrade to a new client version yet, which
> > is fine because all 1.x releases of Subversion clients are supposed
> > to be compatible with all 1.y releases of Subversion servers. He should
> > not have to upgrade his client just because the server was upgraded.
> >
> > In his working copy, the file name is also in NFD form. When he
> > talks to the server, the server provides the name in NFC. Because he
> > is using the old client the client has no way of knowing how to map
> > the NFC name to its local NFD file. So we've broken backwards
> > compatibility for my friend.
> 
> I think we cannot avoid this. So this patch is for 2.x, which may
> break backward compatibility.

If we are ever going to break compatibility, this issue will
certainly be addressed by normalising all paths as you suggest.
It was an unfortunate oversight that no NFD/NFC normalisation
was implemented in the first place :(

However, we really do not want to break compatibility at this time.
A solution that does not require us to break compatibility would
be much better. Nobody knows yet when the time for 2.x will come.

As far as I know, HFS+ is the only filesystem that has this problem.
It is possible to use other filesystems on Mac OS X as a workaround.
For example UFS, ext2, or NTFS (via FUSE).

I think Subversion's backwards compatibility is very important and
should not be jeopardised because of the behaviour of one filesystem.

> If we have two files of the same filenames, one in NFC, the other in NFD,
> it is really a headache for us to normalize all paths to NFC. The only thing
> we can do is just keep one file of the two and throw the other file.
> 
> In reality, I think this is rare case. If we find this collision when upgrading
> repositories, we should stop and provide the way for users to choose which
> one to save.

I agree that this is probably a rare case in practice. However, we must
be prepared to handle it. Users who run into this problem can lose the
ability to use newer versions of Subversion to read their data.
This cannot be allowed to happen because we want to stay compatible.

> > As you can see, there is a lot of complexity involved in fixing this
> > issue. I hope you aren't discouraged by this. Someone will need to
> > explore the details of these problems to fix this issue. I am not convinced
> > that it is impossible to fix. We'll need to be very careful about backwards
> > compatibility when making decisions. But there might be ways to achieve a
> > satisfying solution nonetheless.
> 
> Like other people say, we should prohibit the NFC/NFD same filename collision,
> not in the subversion system, but in operational rules, just don't do that.

So far, "don't do that" has been the answer to this entire problem.
We've been telling people if they want to use non-ASCII characters
with both Windows/Linux and Mac OS X clients they should not be using HFS+.

And mixing various unicode forms works fine today if the filesystem
used by the client supports this. The use case Neels contrived, where
developers want to test their code with unicode filenames in various
NFD/NFC forms, and check those test files into Subversion, should still
be supported.

> Then, the rest problem seems rather simple. Convert *all* input paths to NFC
> first, then do the work. All input means paths passed to servers from clients,
> paths obtained by servers from repositories, paths obtained by clients from
> working copies. Is that correct?

Yes, that is correct. Also, paths obtained by clients from the local
filesystem, and paths sent by servers to clients.

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

Hi,

2012/1/30 Stefan Sperling <st...@elego.de>:
> On Sun, Jan 29, 2012 at 07:38:44PM +0900, Hiroaki Nakamura wrote:

> Let's say I have a working copy which contains filenames normalised
> to NFD, as is the case on Mac OS X. The server gets upgraded to a new
> release of Subversion which contains your patch. This means the server
> will now send all paths as NFC. Let's say there are changes made to a
> file which has 3 "a umlaut" characters in its name. When I run 'svn update'
> my client will try to find the NFC form of the name in its meta-data,
> and fail to locate it because the file was stored as NFD.

Well, my patch is supposed to be applied to both servers and clients.
Clients with patched svn_path_cstring_to_utf8 in libsvn_subr/path.c will convert
NFD paths obtained from local filesystem to NFC on client sides.

>
> So this means your patch will break compatibility with the working copy.
> Therefore, we would need to provide an upgrade path for those working
> copies. E.g. 'svn upgrade' could be modified to normalise all filenames
> stored in the DB to NFC. Problem solved.
>
> But now comes the next problem. Given a filename in NFC which we read from
> meta data, how can we locate the corresponding on-disk file if its form
> is not NFC? We could of course rename the on-disk file. Except this
> won't work on Mac OS X unless we decide to use NFD encoding. So we could
> decide to also use NFD everywhere -- but this would break as soon as
> some other operating system decides to normalise to NFC, so it's not a
> good solution. We could also open the parent directory, read all the
> filenames within it, normalise them all, and then search the resulting
> list. This works, expect if a name exists twice, once in NFC form and once
> in NFD form. We'd somehow have to solve the name collision in the
> filesystem.

In my experiments, NFC filenames in meta-data are automatically converted
by filesystems and saved as NFD filenames on Mac OS X. I commited NFC
filenames on Windows to my Linux server, then I checkouted on Mac OS X
and I realized filenames are NFD. So we will just use NFC everywhere in
subversion.

On client side, we must first convert NFD filenames obtained from Mac OS X
filesystems to NFC, and after that we just comapre them to NFC filenames
in meta-data.

>
> But well, let's assume we had a way of storing NFC in meta-data and not
> caring about the on-disk form. Now things get even more complicated.
>
> My friend is not willing to upgrade to a new client version yet, which
> is fine because all 1.x releases of Subversion clients are supposed
> to be compatible with all 1.y releases of Subversion servers. He should
> not have to upgrade his client just because the server was upgraded.
>
> In his working copy, the file name is also in NFD form. When he
> talks to the server, the server provides the name in NFC. Because he
> is using the old client the client has no way of knowing how to map
> the NFC name to its local NFD file. So we've broken backwards
> compatibility for my friend.

I think we cannot avoid this. So this patch is for 2.x, which may
break backward compatibility.

>
> But it gets worse. Recall the filesystem name collision problem
> mentioned above. This problem can also happen in the repository
> filesystem! For instance, assume that in the repository there already
> exist two filenames, one NFD, the other NFC, but they both are actually
> the same name. This currently works fine, expect on Mac OS X.
> What should be done now when the server is upgraded to normalise all paths
> to NFC? How can we still access content of the file which has the name
> in NFD form? Should one of the files be renamed in the HEAD revision?
> Or all historic revisions? Or removed from history? How do we help users
> carrying out such upgrades, without breaking existing working copies used
> by older clients which do not know anything about the NFC/NFD problem?

If we have two files of the same filenames, one in NFC, the other in NFD,
it is really a headache for us to normalize all paths to NFC. The only thing
we can do is just keep one file of the two and throw the other file.

In reality, I think this is rare case. If we find this collision when upgrading
repositories, we should stop and provide the way for users to choose which
one to save.

>
> These are the questions which we'll need to answer to solve this issue.
> I honestly do not have good answers. I hope that you will find ways of
> solving these problems.
>
> There may even be more problems hidden here which I haven't though of yet.
> It will be quite hard to thoroughly make sure that no unforeseen problems
> will arise when this issue gets fixed one way or another. A good solution
> needs to be carefully planned, implemented, and thoroughly tested.
>
> I think the following caveats would be acceptable if they help
> with fixing the issue:
>
>  - An upgrade path which optionally requires people to check all
>   working copies out again, when either the server or the client is upgraded.
>   Note again, this must be _optional_. Only people affected by the issue
>   should have to make this choice, e.g. by changing configuration
>   parameters from the default settings. By default, existing working
>   copies should keep working after upgrading the client or server.
>   Because imagine what would happen if an upgrade of the server broke
>   many working copies checked out from a hosting service such as
>   sourceforge.net -- not good.

Exising working copied may have NFD filenames, so if upgrade is optional,
we must take care of them. However, it is easy. We just always convert
filenames obtained from working copies meta-data to NFC before any
comparisions.


>
>  - An upgrade path which requires everyone to run 'svn upgrade' on their
>   working copies in order to use the new client version, but not the
>   new server version.
>
>  - An upgrade path which requires people to dump/load their existing
>   repositories in order to get rid of the problem. Existing
>   repositories which are left alone should keep working as they do
>   today, with problems on Mac OS X clients but no problems on other
>   clients (anything else would cause too much breakage and confusion).
>   E.g. this step could normalise all paths in all revisions. But keep in
>   mind the problem of name collisions which can happen when the same name
>   exists as both NFC and NFD. Something needs to happen in this case to
>   resolve the problem, ideally giving users a choice about how to proceed.

I agree.

>
> As you can see, there is a lot of complexity involved in fixing this
> issue. I hope you aren't discouraged by this. Someone will need to
> explore the details of these problems to fix this issue. I am not convinced
> that it is impossible to fix. We'll need to be very careful about backwards
> compatibility when making decisions. But there might be ways to achieve a
> satisfying solution nonetheless.

Like other people say, we should prohibit the NFC/NFD same filename collision,
not in the subversion system, but in operational rules, just don't do that.

Then, the rest problem seems rather simple. Convert *all* input paths to NFC
first, then do the work. All input means paths passed to servers from clients,
paths obtained by servers from repositories, paths obtained by clients from
working copies. Is that correct?


-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Julian Foad <ju...@btopenworld.com>.

Let me just note some of the main similarities and differences between this issue of Unicode compositions and the issue of case-sensitivity in file names.
Differences:

  * NFC and NFD look the same when 
displayed, and most users haven't heard of them and don't expect that a computer might treat two 
identical-looking filenames as different.  With letter case, most users are aware that some systems treat upper and lower case letters as the same while other systems treat them as different, and they learn to behave according to the system's rules.


  *The main 
case-insensitive file systems are case-preserving with no "normal form", whereas the main system that treats NFC and NFD as equivalent(MacOS) chooses one form as the "normal form" and always normalizes the given file name to that form.


Similarities:
  * If two Unicode strings differ only by letter case, on some computer systems they refer to the same file, while on other systems they refer to different files.  The rules are created by the 
designers of the systems, sometimes explicitly and sometimes 
implicitly.  Different parts of a system can have different rules.  The 
same applies if two Unicode strings differ only by composition. 

  * Subversion  interoperates with different systems.  When two file names that differ only by letter case are transferred from a 
case-sensitive system to a case-insensitive system, they will collide 
and Subversion shouldhandle thisin some friendly way.  The same applies if two file namesdiffer only by composition.

The differences are important, but the similarities are enough that we should be looking for some commonality in the implementation.

- Julian

Re: Let's discuss about unicode compositions for filenames!

Posted by Thomas Åkesson <th...@akesson.cc>.

On 12 feb 2012, at 16:59, Stefan Sperling wrote:

> On Sun, Feb 12, 2012 at 04:47:45PM +0100, Thomas Åkesson wrote:
>> Would it make sense to formalize the different approaches into a
>> couple of RFCs attempting to summarize the respective implications of
>> each approach? I could try to write one up for the "Non-normalizing
>> approach". 
> 
> Detailed design specs and proposals are always welcome and useful.
> Of course, we cannot tell in advance whether they will be implemented.
> But the more concrete and detailed information we have about different
> solutions and trade-offs, the more likely it is that this problem will
> be resolved at some point.

I have almost completed a first iteration of a more detailed spec for the proposal. I have difficulty with the details of wc-ng since I have no "under the hood" knowledge. I hope someone is willing to help out with wc-ng insight. I will happily iterate the proposal to capture that.

Will post the proposal tomorrow... ok today (Monday).

/Thomas Å.

Re: Let's discuss about unicode compositions for filenames!

Posted by Stefan Sperling <st...@elego.de>.

On Sun, Feb 12, 2012 at 04:47:45PM +0100, Thomas Åkesson wrote:
> Would it make sense to formalize the different approaches into a
> couple of RFCs attempting to summarize the respective implications of
> each approach? I could try to write one up for the "Non-normalizing
> approach". 

Detailed design specs and proposals are always welcome and useful.
Of course, we cannot tell in advance whether they will be implemented.
But the more concrete and detailed information we have about different
solutions and trade-offs, the more likely it is that this problem will
be resolved at some point.

Re: Let's discuss about unicode compositions for filenames!

Posted by Thomas Åkesson <th...@akesson.cc>.

On 11 feb 2012, at 13:10, Hiroaki Nakamura wrote:

> Hi,
> 
> 2012/2/9 Thomas Åkesson <th...@akesson.cc>:
>> Hi,
>> I have been interested in this issue for a couple of years and I remember it was discussed briefly at Subconf in Germany a couple of years ago.
>> 
>> Branching the thread here because I'd like to propose a different approach than Hiroaki. This proposition is not very different from the note "unicode-composition-for-filenames" or what Peter S, Neels and others suggested, perhaps just combining 2 changes slightly differently.
>> 
>> This is based on my limited understanding of WC-NG, please correct me if I make incorrect assumptions.
>> 
>> - Server will still accept both NFC and NFD, however, it will no longer accept collisions. Enforced by normalising to NFD before uniqueness checks during add operations (yes, might be more expensive). There will be no unified normalisation, but the subversion server will work like most filesystems; return what was given to it.
> 
> For compatibility, we cannot ignore existing repositories and working
> copies which have filename
> collisions. So we cannot enforce subversion servers and clients to
> normalize filenames.
> We must let users to choose whether filenames are normalized or not
> per repository.
> 

Perhaps I did not describe this well enough, but I am _not_ suggesting a normalized repository storage, just normalized uniqueness check during add operations. I believe that a normalized repository storage would cause too much compatibility issues with historical data (as well as other negative effects noted below). 

The proposition I outlined has _no_ issues what so ever with existing repositories or working copies, even if they do have name collisions (which we all agree is rare). What  would change is the ability to create _new_ name collisions (normalized) while old name collisions could be resolved with 'svn mv'.

I am not sure anyone has yet voiced the opinion that Subversion must continue to accept the creation of new name collisions. Anyone? I think Neels was closest to that opinion that but my interpretation is that he suggested that a Subversion server should not normalize. The more times I read Neels' post (2012-01-30), it is increasingly obvious that what I proposed is very similar.

There is consensus that a high priority for Subversion is compatibility. Introducing a normalization/translation/etc is risky business for compatibility. The HFS+ file system has been chastised (both here and other dev-lists) for its behaviour. A file system is expected to return exactly what was stored, or refuse up-front. 

Would it make sense to formalize the different approaches into a couple of RFCs attempting to summarize the respective implications of each approach? I could try to write one up for the "Non-normalizing approach". 

/Thomas Å.

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

Hi,

2012/2/9 Thomas Åkesson <th...@akesson.cc>:
> Hi,
> I have been interested in this issue for a couple of years and I remember it was discussed briefly at Subconf in Germany a couple of years ago.
>
> Branching the thread here because I'd like to propose a different approach than Hiroaki. This proposition is not very different from the note "unicode-composition-for-filenames" or what Peter S, Neels and others suggested, perhaps just combining 2 changes slightly differently.
>
> This is based on my limited understanding of WC-NG, please correct me if I make incorrect assumptions.
>
> - Server will still accept both NFC and NFD, however, it will no longer accept collisions. Enforced by normalising to NFD before uniqueness checks during add operations (yes, might be more expensive). There will be no unified normalisation, but the subversion server will work like most filesystems; return what was given to it.

For compatibility, we cannot ignore existing repositories and working
copies which have filename
collisions. So we cannot enforce subversion servers and clients to
normalize filenames.
We must let users to choose whether filenames are normalized or not
per repository.

-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Thomas Åkesson <th...@akesson.cc>.

Hi,
I have been interested in this issue for a couple of years and I remember it was discussed briefly at Subconf in Germany a couple of years ago. 

Branching the thread here because I'd like to propose a different approach than Hiroaki. This proposition is not very different from the note "unicode-composition-for-filenames" or what Peter S, Neels and others suggested, perhaps just combining 2 changes slightly differently.

This is based on my limited understanding of WC-NG, please correct me if I make incorrect assumptions. 

- Server will still accept both NFC and NFD, however, it will no longer accept collisions. Enforced by normalising to NFD before uniqueness checks during add operations (yes, might be more expensive). There will be no unified normalisation, but the subversion server will work like most filesystems; return what was given to it.

- WC currently has a column containing the name as stored on server, I assume. This column will be kept, and an additional column will be added that contains the name in normalised form. This form will be NFD for all platforms, unless one is found that normalises to NFC. This column will be used on Mac OS X to identify files and on all platforms to ensure normalised uniqueness.


Preliminary analysis of side-effects below. Regarding still supporting developers that want to test both NFC and NFD, this will still work, but not in the same directory.


On 30 jan 2012, at 13:30, Stefan Sperling <st...@elego.de> wrote:

> On Sun, Jan 29, 2012 at 07:38:44PM +0900, Hiroaki Nakamura wrote:
>> Hi folks!
>> 
>> I read the note about unicode compositions for filenames
>> http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
>> and would like to drive the discussion.
> 
> Hi,
> 
> I am very happy to hear that you want to work towards getting this
> problem fixed. Thank you for your help!
> 
> I've just re-read the unicode-composition-for-filenames notes.
> I think they are a bit outdated. For instance, they still talk about
> the 1.6 working copy format. They also don't clearly explain the problems
> with backwards compatibility we're facing here.
> 
> We won't be able to apply your patch as it is. The problem is that
> it can break operation for some existing repositories and working
> copies.
> 
> Generally, I think that writing code that implements a solution for
> this problem is not hard, no matter what the solution is.
> The real challenge lies in finding a solution that is backwards
> compatible with existing repositories and working copies.
> 
> I will explain what I mean by giving examples below.
> But first, let's recap the basic problem, if only so others can more
> easily follow this discussion.
> 
> As you know, in Unicode, some characters can be represented in two distinct
> ways: pre-composed form (NFC) and de-composed form (NFD).
> For instance, the letter ä (a umlaut) can be represented by Unicode
> code point 0x00E4 ( ä ), which is the pre-composed form, or by code
> point 0x0061 ( a ) followed by code point 0x0308 ( ̈  ), which is the
> de-composed form.
> 
> This is a basic property of Unicode. It simply contains both ways of
> representing these characters in its character tables.
> I.e. any byte-string representation of Unicode, be it UTF-8, UTF-16,
> must also be able to represent both ways of encoding such characters.
> So when filenames are given in Unicode, a filename may contain any
> combination of NFC and NFD characters.
> 
> Because Subversion never normalises filenames to one form or the other,
> the space of all possible filenames in a Subversion repository or working
> copy contains a large amount of redundancy. There are many filenames which
> look the same to the user but differ in terms of the Unicode code points
> used to represent them.
> 
> For instance, imagine a filename containing 3 "a umlaut" characters
> and otherwise only characters from the ASCII set.
> There are 8 (2^3) different ways of representing this filename in Unicode,
> and hence 8 different UTF-8 byte strings which can be used in the repository
> or working copy to represent what is, from the user's point of view,
> the same filename.
> 
> The problem we have on Mac OS X is that when we write any of these
> 8 different byte strings to the filesystem to name the file, and later
> read the filename back from the filesystem (e.g. by opening the parent
> directory and asking for a list of files it contains), we will always
> receive the name with all "a umlaut" characters expanded to de-composed
> form.
> 
> Now, in the working copy meta data (.svn/wc.db) we can use any of 8 forms
> of the filename. If we don't use NFC for all characters in the filename,
> the filename read from disk may fail to match any name stored in meta data.
> 
> Let's simplify the discussion a bit by assuming only two possible ways
> of encoding a filename: One with all characters normalised to NFC, and
> one with all characters normalised to NFD. We don't really need to
> consider the mixed forms for the purpose of this discussion (though it
> helps to keep in mind that they exist).
> 
> So let's talk about what would happen if we applied your patch.
> 
> Let's say I have a working copy which contains filenames normalised
> to NFD, as is the case on Mac OS X. The server gets upgraded to a new
> release of Subversion which contains your patch. This means the server
> will now send all paths as NFC. Let's say there are changes made to a
> file which has 3 "a umlaut" characters in its name. When I run 'svn update'
> my client will try to find the NFC form of the name in its meta-data,
> and fail to locate it because the file was stored as NFD.

Ok. Server will not change in this regard. 

> 
> So this means your patch will break compatibility with the working copy.
> Therefore, we would need to provide an upgrade path for those working
> copies. E.g. 'svn upgrade' could be modified to normalise all filenames
> stored in the DB to NFC. Problem solved.

Upgrade would create and populate new column. 

> 
> But now comes the next problem. Given a filename in NFC which we read from
> meta data, how can we locate the corresponding on-disk file if its form
> is not NFC?

Platforms known not to normalise would use current name column. Mac and any other normaliser would use the new column. 

> We could of course rename the on-disk file. Except this
> won't work on Mac OS X unless we decide to use NFD encoding. So we could
> decide to also use NFD everywhere -- but this would break as soon as
> some other operating system decides to normalise to NFC, so it's not a
> good solution. We could also open the parent directory, read all the
> filenames within it, normalise them all, and then search the resulting
> list. This works, expect if a name exists twice, once in NFC form and once
> in NFD form. We'd somehow have to solve the name collision in the
> filesystem.

This way, there will be no new issues with collisions, just the same old issues on Mac but it will no longer be possible to create new such situations. 

> 
> But well, let's assume we had a way of storing NFC in meta-data and not
> caring about the on-disk form. Now things get even more complicated.
> 
> My friend is not willing to upgrade to a new client version yet, which
> is fine because all 1.x releases of Subversion clients are supposed
> to be compatible with all 1.y releases of Subversion servers. He should
> not have to upgrade his client just because the server was upgraded.

Fine. 

> 
> In his working copy, the file name is also in NFD form. When he
> talks to the server, the server provides the name in NFC. Because he
> is using the old client the client has no way of knowing how to map
> the NFC name to its local NFD file. So we've broken backwards
> compatibility for my friend.

No problem. 

> 
> But it gets worse. Recall the filesystem name collision problem
> mentioned above. This problem can also happen in the repository
> filesystem! For instance, assume that in the repository there already
> exist two filenames, one NFD, the other NFC, but they both are actually
> the same name. This currently works fine, expect on Mac OS X.
> What should be done now when the server is upgraded to normalise all paths
> to NFC? How can we still access content of the file which has the name
> in NFD form? Should one of the files be renamed in the HEAD revision?
> Or all historic revisions? Or removed from history? How do we help users
> carrying out such upgrades, without breaking existing working copies used
> by older clients which do not know anything about the NFC/NFD problem?

This solution avoids this whole mess. 

> 
> These are the questions which we'll need to answer to solve this issue.
> I honestly do not have good answers. I hope that you will find ways of
> solving these problems.
> 
> There may even be more problems hidden here which I haven't though of yet.
> It will be quite hard to thoroughly make sure that no unforeseen problems
> will arise when this issue gets fixed one way or another. A good solution
> needs to be carefully planned, implemented, and thoroughly tested.
> 
> I think the following caveats would be acceptable if they help
> with fixing the issue:
> 
> - An upgrade path which optionally requires people to check all
>  working copies out again, when either the server or the client is upgraded.
>  Note again, this must be _optional_. Only people affected by the issue
>  should have to make this choice, e.g. by changing configuration
>  parameters from the default settings. By default, existing working
>  copies should keep working after upgrading the client or server.
>  Because imagine what would happen if an upgrade of the server broke
>  many working copies checked out from a hosting service such as
>  sourceforge.net -- not good.

No problem

> 
> - An upgrade path which requires everyone to run 'svn upgrade' on their
>  working copies in order to use the new client version, but not the
>  new server version.

Yes, will be required. 

> 
> - An upgrade path which requires people to dump/load their existing
>  repositories in order to get rid of the problem. Existing
>  repositories which are left alone should keep working as they do
>  today, with problems on Mac OS X clients but no problems on other
>  clients (anything else would cause too much breakage and confusion).
>  E.g. this step could normalise all paths in all revisions. But keep in
>  mind the problem of name collisions which can happen when the same name
>  exists as both NFC and NFD. Something needs to happen in this case to
>  resolve the problem, ideally giving users a choice about how to proceed.

No need to dump/load. Just need to rename collisions in HEAD in order to get Mac clients back into the game. 

> 
> As you can see, there is a lot of complexity involved in fixing this
> issue. I hope you aren't discouraged by this. Someone will need to
> explore the details of these problems to fix this issue. I am not convinced
> that it is impossible to fix. We'll need to be very careful about backwards
> compatibility when making decisions. But there might be ways to achieve a
> satisfying solution nonetheless.


/Thomas Å.

Re: Let's discuss about unicode compositions for filenames!

Posted by Daniel Shahaf <da...@elego.de>.

Hiroaki Nakamura wrote on Thu, Feb 09, 2012 at 07:16:57 +0900:
> 2012/2/9 Stefan Sperling <st...@elego.de>:
> >  - What happens if NFC/NFD is enabled in repository config, but the
> >   repository contains non-normalised paths (i.e. did not go through
> >   a dump/load cycle to normalise all paths)?
> 
> I think we will provide the check command for finding out:
> - whether a repository contains the same filenames of different unicode
>   normalized/unnormalized forms.
> - all filenames in a repository are NFC.
> - all filenames in a repository are NFD.
> 
> I think of an idea that we can change this config during loading cycle only,
> that is, we can specify this config as an option to load command.
> When load command finishes, the option value is saved in config.
> 
> However, administrators can cheat to change config file without loading,
> as the config file is a plain text file. So we cannot enforce this config must
> be set only by load command.
> 

How about:

- We add an svn_tristate_t bit[1] to the 'db/format' file

- The tristate is 'unknown' for 1.7 repositories

- For repositories created by 1.8, the tristate is set to either 'NFC'
  or 'NFD' at 'svnadmin create' time (before any 'svnadmin load')

- We provide an svn_fs_* API that checks the history and upgrades the
  bit from 'unknown' to either 'NFC' or 'NFD' if possible, or reports
  an error otherwise

Note that administrators aren't allowed to change the 'format' file not
through svnadmin.

> Therefore I think It should be administrators' responsibility to ensure this
> config match a repository.

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

Hi, thanks for your review.

2012/2/9 Stefan Sperling <st...@elego.de>:
> Open questions:

Here I try to answer these. Of course, I welcome everyone to answer.

>
>  - How can the client retrieve the configuration from the server?
>   This is related to server-dictated configuration, see
>   http://wiki.apache.org/subversion/ServerDictatedConfiguration
>   and http://subversion.tigris.org/issues/show_bug.cgi?id=1974
>   This issue would need to be solved first.

I read those two pages and I think it can be done with server-dictated
configuration.

>
>  - What happens if NFC/NFD is enabled in repository config, but the
>   repository contains non-normalised paths (i.e. did not go through
>   a dump/load cycle to normalise all paths)?

I think we will provide the check command for finding out:
- whether a repository contains the same filenames of different unicode
  normalized/unnormalized forms.
- all filenames in a repository are NFC.
- all filenames in a repository are NFD.

I think of an idea that we can change this config during loading cycle only,
that is, we can specify this config as an option to load command.
When load command finishes, the option value is saved in config.

However, administrators can cheat to change config file without loading,
as the config file is a plain text file. So we cannot enforce this config must
be set only by load command.

Therefore I think It should be administrators' responsibility to ensure this
config match a repository.

>
>  - How do we handle name collisions if both NFC and NFD forms exist
>   in a repository that sets the configuration to NCF or NFD?
>
>   Is an upgrade not supported in this case?

No, I think we don't support to change this config to NFC/NFD in this case.
Only unicode-normalization 'none' is allowed.

>
>   Or will duplicate paths need to be discarded from history?
>    How can the user filter the paths, and how can the user decide
>    which path is kept?

I think we don't support these. Maybe repository admin users
can remove one of duplicated filenames from history in repository
and try to load again, I wonder?

>
>    Or will duplicate paths be renamed throughout history?
>    How can the user rename the paths?

I think users can only normalize filenames during load command.
Users cannot rename filenames arbitrarily.

>
> Anything else? I cannot think of more questions but there might
> be more things to consider here.



-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

2012/2/11 Branko Čibej <br...@apache.org>:
> On 11.02.2012 13:05, Hiroaki Nakamura wrote:
>> 2012/2/9 Markus Schaber <m....@3s-software.com>:
>>> Von: Stefan Sperling [mailto:stsp@elego.de]
>>> On Thu, Feb 09, 2012 at 12:20:14AM +0900, Hiroaki Nakamura wrote:
>>>>> [Upgrade options / backwards compatibility for proposed unicode normalization fix]
>>>> - Need to re-checkout existing working copies of the repository?
>>>>   => Yes, but only if config is changed from the default.
>>> Maybe this could even be avoided if newer clients (or an utility script) can "upgrade" the working copy to the normalized format.
>> Yes, if the working copy does not have filename collisions. However,
>> for compatibility,
>> we cannot let newer clients upgrade working copies automatically
>> because existing
>> working copies may have filename collisions.
>
> That's not entirely true, since we can detect the collisions in advance,
> and a partially upgraded working copy would still work
>
> From a practical point of view, it's very, very unlikely that there
> would be any such collisions in a valid working copy. People would tend
> to notice. :)

Yes, I agree wholeheartedly!
At work, I notice there are a few repositories which have NFC filenames
and NFD filenames. However there is no repository which have collisions
as far as I know.

-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Branko Čibej <br...@apache.org>.

On 11.02.2012 13:05, Hiroaki Nakamura wrote:
> 2012/2/9 Markus Schaber <m....@3s-software.com>:
>> Hi,
>>
>> Von: Stefan Sperling [mailto:stsp@elego.de]
>>
>> On Thu, Feb 09, 2012 at 12:20:14AM +0900, Hiroaki Nakamura wrote:
>>>> [Upgrade options / backwards compatibility for proposed unicode normalization fix]
>>> - Need to re-checkout existing working copies of the repository?
>>>   => Yes, but only if config is changed from the default.
>> Maybe this could even be avoided if newer clients (or an utility script) can "upgrade" the working copy to the normalized format.
> Yes, if the working copy does not have filename collisions. However,
> for compatibility,
> we cannot let newer clients upgrade working copies automatically
> because existing
> working copies may have filename collisions.

That's not entirely true, since we can detect the collisions in advance,
and a partially upgraded working copy would still work

>From a practical point of view, it's very, very unlikely that there
would be any such collisions in a valid working copy. People would tend
to notice. :)

-- Brane

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

2012/2/9 Markus Schaber <m....@3s-software.com>:
> Hi,
>
> Von: Stefan Sperling [mailto:stsp@elego.de]
>
> On Thu, Feb 09, 2012 at 12:20:14AM +0900, Hiroaki Nakamura wrote:
>> > [Upgrade options / backwards compatibility for proposed unicode normalization fix]
>
>> - Need to re-checkout existing working copies of the repository?
>>   => Yes, but only if config is changed from the default.
>
> Maybe this could even be avoided if newer clients (or an utility script) can "upgrade" the working copy to the normalized format.

Yes, if the working copy does not have filename collisions. However,
for compatibility,
we cannot let newer clients upgrade working copies automatically
because existing
working copies may have filename collisions.

>
> Best regards
>
> Markus Schaber
> --
> ___________________________
> We software Automation.
>
> 3S-Smart Software Solutions GmbH
> Markus Schaber | Developer
> Memminger Str. 151 | 87439 Kempten | Germany | Tel. +49-831-54031-0 | Fax +49-831-54031-50
>
> Email: m.schaber@3s-software.com | Web: http://www.3s-software.com
> CoDeSys internet forum: http://forum.3s-software.com
> Download CoDeSys sample projects: http://www.3s-software.com/index.shtml?sample_projects
>
> Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915



-- 
中村 弘輝 )Hiroaki Nakamura) hnakamur@gmail.com

AW: Let's discuss about unicode compositions for filenames!

Posted by Markus Schaber <m....@3s-software.com>.

Hi,

Von: Stefan Sperling [mailto:stsp@elego.de] 

On Thu, Feb 09, 2012 at 12:20:14AM +0900, Hiroaki Nakamura wrote:
> > [Upgrade options / backwards compatibility for proposed unicode normalization fix]

> - Need to re-checkout existing working copies of the repository?
>   => Yes, but only if config is changed from the default.

Maybe this could even be avoided if newer clients (or an utility script) can "upgrade" the working copy to the normalized format.

Best regards

Markus Schaber
-- 
___________________________
We software Automation.

3S-Smart Software Solutions GmbH
Markus Schaber | Developer
Memminger Str. 151 | 87439 Kempten | Germany | Tel. +49-831-54031-0 | Fax +49-831-54031-50

Email: m.schaber@3s-software.com | Web: http://www.3s-software.com 
CoDeSys internet forum: http://forum.3s-software.com
Download CoDeSys sample projects: http://www.3s-software.com/index.shtml?sample_projects

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915

Re: Let's discuss about unicode compositions for filenames!

Posted by Stefan Sperling <st...@elego.de>.

On Thu, Feb 09, 2012 at 12:20:14AM +0900, Hiroaki Nakamura wrote:
> 2012/1/30 Stefan Sperling <st...@elego.de>:
> > I think the following caveats would be acceptable if they help
> > with fixing the issue:
> >
> >  - An upgrade path which optionally requires people to check all
> >   working copies out again, when either the server or the client is upgraded.
> >   Note again, this must be _optional_. Only people affected by the issue
> >   should have to make this choice, e.g. by changing configuration
> >   parameters from the default settings. By default, existing working
> >   copies should keep working after upgrading the client or server.
> >   Because imagine what would happen if an upgrade of the server broke
> >   many working copies checked out from a hosting service such as
> >   sourceforge.net -- not good.
> >
> >  - An upgrade path which requires everyone to run 'svn upgrade' on their
> >   working copies in order to use the new client version, but not the
> >   new server version.
> >
> >  - An upgrade path which requires people to dump/load their existing
> >   repositories in order to get rid of the problem. Existing
> >   repositories which are left alone should keep working as they do
> >   today, with problems on Mac OS X clients but no problems on other
> >   clients (anything else would cause too much breakage and confusion).
> >   E.g. this step could normalise all paths in all revisions. But keep in
> >   mind the problem of name collisions which can happen when the same name
> >   exists as both NFC and NFD. Something needs to happen in this case to
> >   resolve the problem, ideally giving users a choice about how to proceed.
> 
> How about adding a config per repository for unicode-normalization?
> Possible config values are
> - none: input paths are not normalized.
> - NFC: input paths are normalized to NFC.
> - NFD: input paths are normalized to NFD.
> 
> For compatibility, repositories which don't have this config are treated as
> 'none' specified.
> 
> Clients have to look this config and will normalize paths appropriately.

Let's see how this fits the above contraints.

 - Backwards compatible by default?
   => Yes, no dump/load or re-checkout required during a normal upgrade

 - Repository can be used by old clients after changing config to
   NFC/NFD?
   => Yes, but the server will reject commits using non-normalised
      paths. This is no different to installing a pre-commit hook
      that performs this check, so that should be OK. Users of old
      clients must manually make sure that names are normalised.

 - Need to dump/load repository to fix the problem for all revisions?
   => Yes

 - Need to re-checkout existing working copies of the repository?
   => Yes, but only if config is changed from the default.

So I think this looks fine from the backwards compatibility standpoint.

Open questions:

 - How can the client retrieve the configuration from the server?
   This is related to server-dictated configuration, see
   http://wiki.apache.org/subversion/ServerDictatedConfiguration
   and http://subversion.tigris.org/issues/show_bug.cgi?id=1974
   This issue would need to be solved first.

 - What happens if NFC/NFD is enabled in repository config, but the
   repository contains non-normalised paths (i.e. did not go through
   a dump/load cycle to normalise all paths)?

 - How do we handle name collisions if both NFC and NFD forms exist
   in a repository that sets the configuration to NCF or NFD?

   Is an upgrade not supported in this case?

   Or will duplicate paths need to be discarded from history?
    How can the user filter the paths, and how can the user decide
    which path is kept?

    Or will duplicate paths be renamed throughout history?
    How can the user rename the paths?

Anything else? I cannot think of more questions but there might
be more things to consider here.

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

2012/1/30 Stefan Sperling <st...@elego.de>:
> I think the following caveats would be acceptable if they help
> with fixing the issue:
>
>  - An upgrade path which optionally requires people to check all
>   working copies out again, when either the server or the client is upgraded.
>   Note again, this must be _optional_. Only people affected by the issue
>   should have to make this choice, e.g. by changing configuration
>   parameters from the default settings. By default, existing working
>   copies should keep working after upgrading the client or server.
>   Because imagine what would happen if an upgrade of the server broke
>   many working copies checked out from a hosting service such as
>   sourceforge.net -- not good.
>
>  - An upgrade path which requires everyone to run 'svn upgrade' on their
>   working copies in order to use the new client version, but not the
>   new server version.
>
>  - An upgrade path which requires people to dump/load their existing
>   repositories in order to get rid of the problem. Existing
>   repositories which are left alone should keep working as they do
>   today, with problems on Mac OS X clients but no problems on other
>   clients (anything else would cause too much breakage and confusion).
>   E.g. this step could normalise all paths in all revisions. But keep in
>   mind the problem of name collisions which can happen when the same name
>   exists as both NFC and NFD. Something needs to happen in this case to
>   resolve the problem, ideally giving users a choice about how to proceed.

How about adding a config per repository for unicode-normalization?
Possible config values are
- none: input paths are not normalized.
- NFC: input paths are normalized to NFC.
- NFD: input paths are normalized to NFD.

For compatibility, repositories which don't have this config are treated as
'none' specified.

Clients have to look this config and will normalize paths appropriately.

-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Stefan Sperling <st...@elego.de>.

On Sun, Jan 29, 2012 at 07:38:44PM +0900, Hiroaki Nakamura wrote:
> Hi folks!
>
> I read the note about unicode compositions for filenames
> http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
> and would like to drive the discussion.

Hi,

I am very happy to hear that you want to work towards getting this
problem fixed. Thank you for your help!

I've just re-read the unicode-composition-for-filenames notes.
I think they are a bit outdated. For instance, they still talk about
the 1.6 working copy format. They also don't clearly explain the problems
with backwards compatibility we're facing here.

We won't be able to apply your patch as it is. The problem is that
it can break operation for some existing repositories and working
copies.

Generally, I think that writing code that implements a solution for
this problem is not hard, no matter what the solution is.
The real challenge lies in finding a solution that is backwards
compatible with existing repositories and working copies.

I will explain what I mean by giving examples below.
But first, let's recap the basic problem, if only so others can more
easily follow this discussion.

As you know, in Unicode, some characters can be represented in two distinct
ways: pre-composed form (NFC) and de-composed form (NFD).
For instance, the letter ä (a umlaut) can be represented by Unicode
code point 0x00E4 ( ä ), which is the pre-composed form, or by code
point 0x0061 ( a ) followed by code point 0x0308 ( ̈ ), which is the
de-composed form.

This is a basic property of Unicode. It simply contains both ways of
representing these characters in its character tables.
I.e. any byte-string representation of Unicode, be it UTF-8, UTF-16,
must also be able to represent both ways of encoding such characters.
So when filenames are given in Unicode, a filename may contain any
combination of NFC and NFD characters.

Because Subversion never normalises filenames to one form or the other,
the space of all possible filenames in a Subversion repository or working
copy contains a large amount of redundancy. There are many filenames which
look the same to the user but differ in terms of the Unicode code points
used to represent them.

For instance, imagine a filename containing 3 "a umlaut" characters
and otherwise only characters from the ASCII set.
There are 8 (2^3) different ways of representing this filename in Unicode,
and hence 8 different UTF-8 byte strings which can be used in the repository
or working copy to represent what is, from the user's point of view,
the same filename.

The problem we have on Mac OS X is that when we write any of these
8 different byte strings to the filesystem to name the file, and later
read the filename back from the filesystem (e.g. by opening the parent
directory and asking for a list of files it contains), we will always
receive the name with all "a umlaut" characters expanded to de-composed
form.

Now, in the working copy meta data (.svn/wc.db) we can use any of 8 forms
of the filename. If we don't use NFC for all characters in the filename,
the filename read from disk may fail to match any name stored in meta data.

Let's simplify the discussion a bit by assuming only two possible ways
of encoding a filename: One with all characters normalised to NFC, and
one with all characters normalised to NFD. We don't really need to
consider the mixed forms for the purpose of this discussion (though it
helps to keep in mind that they exist).

So let's talk about what would happen if we applied your patch.

Let's say I have a working copy which contains filenames normalised
to NFD, as is the case on Mac OS X. The server gets upgraded to a new
release of Subversion which contains your patch. This means the server
will now send all paths as NFC. Let's say there are changes made to a
file which has 3 "a umlaut" characters in its name. When I run 'svn update'
my client will try to find the NFC form of the name in its meta-data,
and fail to locate it because the file was stored as NFD.

So this means your patch will break compatibility with the working copy.
Therefore, we would need to provide an upgrade path for those working
copies. E.g. 'svn upgrade' could be modified to normalise all filenames
stored in the DB to NFC. Problem solved.

But now comes the next problem. Given a filename in NFC which we read from
meta data, how can we locate the corresponding on-disk file if its form
is not NFC? We could of course rename the on-disk file. Except this
won't work on Mac OS X unless we decide to use NFD encoding. So we could
decide to also use NFD everywhere -- but this would break as soon as
some other operating system decides to normalise to NFC, so it's not a
good solution. We could also open the parent directory, read all the
filenames within it, normalise them all, and then search the resulting
list. This works, expect if a name exists twice, once in NFC form and once
in NFD form. We'd somehow have to solve the name collision in the
filesystem.

But well, let's assume we had a way of storing NFC in meta-data and not
caring about the on-disk form. Now things get even more complicated.

My friend is not willing to upgrade to a new client version yet, which
is fine because all 1.x releases of Subversion clients are supposed
to be compatible with all 1.y releases of Subversion servers. He should
not have to upgrade his client just because the server was upgraded.

In his working copy, the file name is also in NFD form. When he
talks to the server, the server provides the name in NFC. Because he
is using the old client the client has no way of knowing how to map
the NFC name to its local NFD file. So we've broken backwards
compatibility for my friend.

But it gets worse. Recall the filesystem name collision problem
mentioned above. This problem can also happen in the repository
filesystem! For instance, assume that in the repository there already
exist two filenames, one NFD, the other NFC, but they both are actually
the same name. This currently works fine, expect on Mac OS X.
What should be done now when the server is upgraded to normalise all paths
to NFC? How can we still access content of the file which has the name
in NFD form? Should one of the files be renamed in the HEAD revision?
Or all historic revisions? Or removed from history? How do we help users
carrying out such upgrades, without breaking existing working copies used
by older clients which do not know anything about the NFC/NFD problem?

These are the questions which we'll need to answer to solve this issue.
I honestly do not have good answers. I hope that you will find ways of
solving these problems.

There may even be more problems hidden here which I haven't though of yet.
It will be quite hard to thoroughly make sure that no unforeseen problems
will arise when this issue gets fixed one way or another. A good solution
needs to be carefully planned, implemented, and thoroughly tested.

I think the following caveats would be acceptable if they help
with fixing the issue:

- An upgrade path which optionally requires people to check all
working copies out again, when either the server or the client is upgraded.
Note again, this must be _optional_. Only people affected by the issue
should have to make this choice, e.g. by changing configuration
parameters from the default settings. By default, existing working
copies should keep working after upgrading the client or server.
Because imagine what would happen if an upgrade of the server broke
many working copies checked out from a hosting service such as
sourceforge.net -- not good.

- An upgrade path which requires everyone to run 'svn upgrade' on their
working copies in order to use the new client version, but not the
new server version.

- An upgrade path which requires people to dump/load their existing
repositories in order to get rid of the problem. Existing
repositories which are left alone should keep working as they do
today, with problems on Mac OS X clients but no problems on other
clients (anything else would cause too much breakage and confusion).
E.g. this step could normalise all paths in all revisions. But keep in
mind the problem of name collisions which can happen when the same name
exists as both NFC and NFD. Something needs to happen in this case to
resolve the problem, ideally giving users a choice about how to proceed.

As you can see, there is a lot of complexity involved in fixing this
issue. I hope you aren't discouraged by this. Someone will need to
explore the details of these problems to fix this issue. I am not convinced
that it is impossible to fix. We'll need to be very careful about backwards
compatibility when making decisions. But there might be ways to achieve a
satisfying solution nonetheless.

Re: Let's discuss about unicode compositions for filenames!

Posted by Branko Čibej <br...@apache.org>.

On 07.02.2012 15:00, Stefan Sperling wrote:
> On Tue, Feb 07, 2012 at 02:43:19PM +0100, Branko Čibej wrote:
>> The client-side mapping table is a more general solution, if a
>> lot harder to implement.
>>
>> But it brings additional benefits in that we could use it to, e.g.,
>> transliterate characters that are allowed by some file systems, but not
>> by others; for example, on Unix, file names may contain colons, but they
>> can't on Windows. We could even use the mapping table to decorate local
>> files that differ only in case on case-insensitive file systems.
> These additioanl benefits are great. But to avoid misunderstandings
> I'd like to point out that they are of course not required to get
> the unicode NFD/NFC problem fixed. In the context of the unicode
> NFD/NFC issue, the mapping table exists only to provide backwards
> compatibility. 

Yes, of course.

-- Brane

Re: Let's discuss about unicode compositions for filenames!

Posted by Stefan Sperling <st...@elego.de>.

On Tue, Feb 07, 2012 at 02:43:19PM +0100, Branko Čibej wrote:
> The client-side mapping table is a more general solution, if a
> lot harder to implement.
> 
> But it brings additional benefits in that we could use it to, e.g.,
> transliterate characters that are allowed by some file systems, but not
> by others; for example, on Unix, file names may contain colons, but they
> can't on Windows. We could even use the mapping table to decorate local
> files that differ only in case on case-insensitive file systems.

These additioanl benefits are great. But to avoid misunderstandings
I'd like to point out that they are of course not required to get
the unicode NFD/NFC problem fixed. In the context of the unicode
NFD/NFC issue, the mapping table exists only to provide backwards
compatibility.

Re: Let's discuss about unicode compositions for filenames!

Posted by Branko Čibej <br...@apache.org>.

On 07.02.2012 14:30, Hiroaki Nakamura wrote:
> 2012/2/7 Branko Čibej <br...@apache.org>:
>> On 06.02.2012 22:26, Hiroaki Nakamura wrote:
>>> The Unicode Standard says canonical equivalent sequences should be
>>> interpreted the same way.
>>> * 1.1 Canonical and Compatibility Equivalence
>>>   http://unicode.org/reports/tr15/#Canonical_Equivalence
>>> * 2.12 Equivalent Sequences and Normalization
>>>   http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf
>>>
>>> So we should not have the same name multiple times in repositories
>>> and working copies. Therefore subversion servers and clients does
>>> not need to handle them.
>> *sigh*
>>
>> I don't give a gnat's whisker what the Unicode Standard says. I'm only
>> interested in real-world situations. Or are you implying that, e.g., the
>> Unix VFS layer will magically detect file name equality of different
>> (de)normalized forms? Because it won't.
>>
>> -- Brane
>>
> I'm interested in real-world situations, too. It is the reality that
> we need to avoid the same filenames in different forms because
> they confuse users so much.
>
> I don't think we expect file systems detect filename equality of
> different forms. Mac OS X HFS+ can have only NFD filenames
> and we must cope with it. And as you say, standard file systems
> in Linux and Windows does not magically detect file name equality
> of different forms. Also It's the reality we cannot force users to format
> their harddisks and change file systems.
>
> So communication layer must take care of this problem to provide
> interoperability among Windows, Linux and Mac.
> Subversion to the rescue!

I agree with all of that. The point I was trying to make, and which
Stefan spelled out a lot better, is that the existing MacPorts/Homebrew
patch is not a real solution (that's despite the fact that I use it
myself). The client-side mapping table is a more general solution, if a
lot harder to implement.

But it brings additional benefits in that we could use it to, e.g.,
transliterate characters that are allowed by some file systems, but not
by others; for example, on Unix, file names may contain colons, but they
can't on Windows. We could even use the mapping table to decorate local
files that differ only in case on case-insensitive file systems.

-- Brane

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

2012/2/7 Branko Čibej <br...@apache.org>:
> On 06.02.2012 22:26, Hiroaki Nakamura wrote:
>> The Unicode Standard says canonical equivalent sequences should be
>> interpreted the same way.
>> * 1.1 Canonical and Compatibility Equivalence
>>   http://unicode.org/reports/tr15/#Canonical_Equivalence
>> * 2.12 Equivalent Sequences and Normalization
>>   http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf
>>
>> So we should not have the same name multiple times in repositories
>> and working copies. Therefore subversion servers and clients does
>> not need to handle them.
>
> *sigh*
>
> I don't give a gnat's whisker what the Unicode Standard says. I'm only
> interested in real-world situations. Or are you implying that, e.g., the
> Unix VFS layer will magically detect file name equality of different
> (de)normalized forms? Because it won't.
>
> -- Brane
>

I'm interested in real-world situations, too. It is the reality that
we need to avoid the same filenames in different forms because
they confuse users so much.

I don't think we expect file systems detect filename equality of
different forms. Mac OS X HFS+ can have only NFD filenames
and we must cope with it. And as you say, standard file systems
in Linux and Windows does not magically detect file name equality
of different forms. Also It's the reality we cannot force users to format
their harddisks and change file systems.

So communication layer must take care of this problem to provide
interoperability among Windows, Linux and Mac.
Subversion to the rescue!

-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Branko Čibej <br...@apache.org>.

On 06.02.2012 22:26, Hiroaki Nakamura wrote:
> The Unicode Standard says canonical equivalent sequences should be
> interpreted the same way.
> * 1.1 Canonical and Compatibility Equivalence
>   http://unicode.org/reports/tr15/#Canonical_Equivalence
> * 2.12 Equivalent Sequences and Normalization
>   http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf
>
> So we should not have the same name multiple times in repositories
> and working copies. Therefore subversion servers and clients does
> not need to handle them.

*sigh*

I don't give a gnat's whisker what the Unicode Standard says. I'm only
interested in real-world situations. Or are you implying that, e.g., the
Unix VFS layer will magically detect file name equality of different
(de)normalized forms? Because it won't.

-- Brane

Re: Let's discuss about unicode compositions for filenames!

Posted by Stefan Sperling <st...@elego.de>.

On Tue, Feb 07, 2012 at 06:26:54AM +0900, Hiroaki Nakamura wrote:
> 2012/2/6 Stefan Sperling <st...@elego.de>:
> >  2) Do something else that effects repositories, too, and provide
> >    a clean upgrade path for everyone (servers and clients).
> >    AFAIK nobody has made a suggestion as to what could be done here.
> 
> What do you mean by a clean upgrade?
> Is it clean if we do dump and load for repositories and re-checkout for
> working copies?

Yes, this is what I meant. I listed earlier in this thread what
the acceptable upgrade paths are in terms of our compatibility
guidelines: http://svn.haxx.se/dev/archive-2012-01/0427.shtml

So, the bottom line is that there is more work that needs to be done
than you've done so far to get this problem fixed. I realise that this
may be frustrating for you, and I hope that you don't give up but try
to work towards a solution that does not harm compatibility.

Please understand that we're trying to find the right balance for our
user base. Some users have the problem you want to fix, and others would
run into compatibility problems if we fixed it by applying your patch.
This is not an easy tradeoff to make for us. I would prefer if Subversion
didn't have this stupid problem on Mac OS X. But so far we have never
compromised compatibility and this is a very important property of Subversion
that we do not want to lose. Compatibility is a high priority for us.

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

2012/2/6 Stefan Sperling <st...@elego.de>:
> On Mon, Feb 06, 2012 at 02:28:40PM +0100, Branko Čibej wrote:
>> On 06.02.2012 14:10, Hiroaki Nakamura wrote:
>> > Hi, all.
>> >
>> > It seems there is no further discussion.
>> >
>> > I think the conclusion for the short term solution is:
>> > We convert unnormalized paths to NFC normalized paths on clients only,
>> > that is, svn_path_cstring_to_utf8.
>> >
>> > It is the same approach as utf8precompose_macosx_2.patch in
>> > http://subversion.tigris.org/issues/show_bug.cgi?id=2464
>> >
>> > It is proven to work as it is included in MacPorts unicode_path variant
>> > and Homebrew --unicode-path option.
>>
>> You'll note that MacPorts also warns you that using this option may
>> cause interoperability issues with other clients that aren't using it,
>> right? So this is hardly a universal solution that will not affect
>> existing users and repositories.
>
> Exactly. This is what I meant when I said that we cannot apply the
> submitted patch as it is, at the very beginning of this thread.
> The submitted patch simply copies the MacPorts solution and has
> the same compatibility problems.
>
> I think the discussion made clear that there are two ways
> to move forward:
>
>  1) Implement a client-side mapping table which maps server-provided
>    paths to local filesystem paths. It translates between one or more
>    server-side and local representations of the same path. This could
>    be done only on Mac OS X (or, preferrably, only on HFS+ filesystems)
>    because only Mac OS X has problems.
>    The idea here is to not change existing paths in repositories at all,
>    no matter which way they are encoded, and to teach Mac OS X clients
>    to cope with the problem locally. This way, other existing clients
>    won't notice a difference. The only thing that won't work is to create
>    a working copy on Mac OS X which contains the same name multiple times,
>    in NFD and in some other normalised or non-normalised form.
>    This approach was suggested by Peter.

The Unicode Standard says canonical equivalent sequences should be
interpreted the same way.
* 1.1 Canonical and Compatibility Equivalence
  http://unicode.org/reports/tr15/#Canonical_Equivalence
* 2.12 Equivalent Sequences and Normalization
  http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf

So we should not have the same name multiple times in repositories
and working copies. Therefore subversion servers and clients does
not need to handle them. Rather I think we should fix subversion to
reject the same name in a different form.

To handle existing repositories and working copies, maybe we should
create a tool which checks repositories and working copies have the
same name multiple times.

If they have, users must rename files manually. In reality, I think this
is extremely rare.

>    We'd need either a working patch or a more detailed implementation
>    design document to move forward here.

OK. Peter, or somebody else, please give us either one of them.

>
>  2) Do something else that effects repositories, too, and provide
>    a clean upgrade path for everyone (servers and clients).
>    AFAIK nobody has made a suggestion as to what could be done here.

What do you mean by a clean upgrade?
Is it clean if we do dump and load for repositories and re-checkout for
working copies?

-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Stefan Sperling <st...@elego.de>.

On Mon, Feb 06, 2012 at 02:28:40PM +0100, Branko Čibej wrote:
> On 06.02.2012 14:10, Hiroaki Nakamura wrote:
> > Hi, all.
> >
> > It seems there is no further discussion.
> >
> > I think the conclusion for the short term solution is:
> > We convert unnormalized paths to NFC normalized paths on clients only,
> > that is, svn_path_cstring_to_utf8.
> >
> > It is the same approach as utf8precompose_macosx_2.patch in
> > http://subversion.tigris.org/issues/show_bug.cgi?id=2464
> >
> > It is proven to work as it is included in MacPorts unicode_path variant
> > and Homebrew --unicode-path option.
> 
> You'll note that MacPorts also warns you that using this option may
> cause interoperability issues with other clients that aren't using it,
> right? So this is hardly a universal solution that will not affect
> existing users and repositories.

Exactly. This is what I meant when I said that we cannot apply the
submitted patch as it is, at the very beginning of this thread.
The submitted patch simply copies the MacPorts solution and has
the same compatibility problems.

I think the discussion made clear that there are two ways
to move forward:

 1) Implement a client-side mapping table which maps server-provided
    paths to local filesystem paths. It translates between one or more
    server-side and local representations of the same path. This could
    be done only on Mac OS X (or, preferrably, only on HFS+ filesystems)
    because only Mac OS X has problems.
    The idea here is to not change existing paths in repositories at all,
    no matter which way they are encoded, and to teach Mac OS X clients
    to cope with the problem locally. This way, other existing clients
    won't notice a difference. The only thing that won't work is to create
    a working copy on Mac OS X which contains the same name multiple times,
    in NFD and in some other normalised or non-normalised form.
    This approach was suggested by Peter.
    We'd need either a working patch or a more detailed implementation
    design document to move forward here.

 2) Do something else that effects repositories, too, and provide
    a clean upgrade path for everyone (servers and clients).
    AFAIK nobody has made a suggestion as to what could be done here.

Re: Let's discuss about unicode compositions for filenames!

Posted by Branko Čibej <br...@apache.org>.

On 06.02.2012 14:10, Hiroaki Nakamura wrote:
> Hi, all.
>
> It seems there is no further discussion.
>
> I think the conclusion for the short term solution is:
> We convert unnormalized paths to NFC normalized paths on clients only,
> that is, svn_path_cstring_to_utf8.
>
> It is the same approach as utf8precompose_macosx_2.patch in
> http://subversion.tigris.org/issues/show_bug.cgi?id=2464
>
> It is proven to work as it is included in MacPorts unicode_path variant
> and Homebrew --unicode-path option.

You'll note that MacPorts also warns you that using this option may
cause interoperability issues with other clients that aren't using it,
right? So this is hardly a universal solution that will not affect
existing users and repositories.

-- Brane

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

Hi, all.

It seems there is no further discussion.

I think the conclusion for the short term solution is:
We convert unnormalized paths to NFC normalized paths on clients only,
that is, svn_path_cstring_to_utf8.

It is the same approach as utf8precompose_macosx_2.patch in
http://subversion.tigris.org/issues/show_bug.cgi?id=2464

It is proven to work as it is included in MacPorts unicode_path variant
and Homebrew --unicode-path option.

The difference is this time we use utf8proc instead of Mac OS X APIs,
and we do conversions on not only Mac but all platforms.

Do you agree? If so, I will update my patch and post it to
http://subversion.tigris.org/issues/show_bug.cgi?id=2464

Best regards,

-- 
)Hiroaki Nakamura) hnakamur@gmail.com

Re: Let's discuss about unicode compositions for filenames!

Posted by Hiroaki Nakamura <hn...@gmail.com>.

* HFS+ is the default file system on Mac OS X, so we must support them.
  Forcing users to reformat their HDD and use another file system is not an
  option. It is much worse than merely upgrading subversion working copies.

* As for similarity to case sensitivity, there is a critical difference:
  case is preserved on all of Linux, Mac, and Windows.

  According to http://en.wikipedia.org/wiki/Filename
  Case sensitivity in Windows NTFS and Mac OS X HFS+ are optional, but
  disabled by default. We cannot have the same filename of diffrenct cases
  (Again, it is not realistic for users to format their HDD and turn on
  the option).

  So, it is very natural. If we have "readme" on Windows, then we have
  "readme" on Mac too. We cannot have "readme" and "README" at the same
  time, but that sounds normal to users on both camps.

* "a" and "A": diffrent characters, different looks.
               Both are easy to type in. Both used widely.
  NFC and NFD: the same abstract characters, almost same looks (*1)
               NFC is easy to type in. NFC is hard to type in (*2)
               On Windows, NFC used widely, NFD almost never used.
               On Mac, NFD only used as internal code of HFS+. The rest is NFC.
  (*1) looks same on Explorer, but different on Command Prompt.
       Actually Japanese NFD filename looks very weird on Command Prompt.
       Too much space between combined character and combining character.
       See the screenshot attached.
  (*2) I don't know the way to type in NFD in Japanese IME.

  http://unicode.org/reports/tr15/
  > The Unicode Standard defines two equivalences between characters:
  > canonical equivalence and compatibility equivalence. Canonical
  > equivalence is a fundamental equivalency between characters or
  > sequences of characters that represent the same abstract character,
  > and when correctly displayed should always have the same visual
  > appearance and behavior.

* As for NFC/NFD, Windows NTFS have the same filename of NFC/NFD.
  However we don't do that actually, because it leads to confusion.
  Different cases looks differently to our eyes, but NFC/NFD difference
  are hard to detect. It looks the same to casual users.
  So it is very rarely needed to have the same filename of NFC/NFD,
  we just treat it as an error and let users manually rename first and
  try again.

* Mac OS X HFS+ can store only NFD filenames. So if we use fictitious
  examples in analogy to case differences, it goes something like this:

  Here we suppose NFC is lower case, and NFD is upper case.
  Windows and Linux can have both form, like "readme" and "CHANGES".
  However usually we use only lower case like "readme" and "changes",
  because it is just easy to type in lower cases. We can type in upper
  cases, but we need very special skill to do that. Maybe casual users
  cannot use SHIFT or CAPS key :-)

  Also casual users don't bother to type in upper cases, because
  "readme" and "README" looks exactly same to us. (Of course it is not,
  in reality, but it is in this fictitious example).

  If we create and check in "readme" on Windows, then we check out "readme"
  on Mac, it becomes "README". It is OK for us, because it is normal.
  We always create filename like "README".  And we see "readme" and
  "README" are the same thing, it doesn't matter.

  If we create and check in "CHANGES" on Mac, then we check out "CHANGES"
  on Windows. It looks almost same as "changes", but it has some weird looks
  and feels unusual.

--
)Hiroaki Nakamura) hnakamur@gmail.com