You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@subversion.apache.org by Apache subversion Wiki <co...@subversion.apache.org> on 2013/01/21 23:19:33 UTC

[Subversion Wiki] Update of "NonNormalizingUnicodeCompositionAwareness" by Thomas Åkesson

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Subversion Wiki" for change notification.

The "NonNormalizingUnicodeCompositionAwareness" page has been changed by Thomas Åkesson:
http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness?action=diff&rev1=10&rev2=11

  
  There could be a performance impact. [Need more data] However, the 'add' operation is not one of the most frequent ones, in a typical installation.
  {{{#!wiki note
- The major impact would not stem from collision avoidance on `add` but normalization during directory search, which affects most other operations. For the server, it is probably better to store names twice (original for display and normalized for indexing) rather than normalize on every lookup.}}}
+ The major impact would not stem from collision avoidance on `add` but normalization during directory search, which affects most other operations. For the server, it is probably better to store names twice (original for display and normalized for indexing) rather than normalize on every lookup.
+ 
+ ThomasAkesson: It might be better to store names twice, but I don't see why the server needs to do normalization during directory search? That would be a client side task in this proposal. 
+ }}}
  
  It is not possible to rely on client behavior. A Subversion server can be accessed via mod_dav_svn, and elder Subversion clients.
  
@@ -100, +103 @@

  
  It might be more feasible to implement such an abstraction now in wc-ng than it was in Subversion <=1.6. 
  
- TODO: This section needs input from someone more familiar with wc-ng database design.
  
- === WC Database Columns ===
+ === Alternative Approaches ===
  
- Columns of interest in wc.db:
+ There are different approaches to implementing this abstraction of paths. The following have been identified so far, each with its Wiki page:
  
-  * The repository path as stored on server: repos_path (e.g. "project/dir/file.txt")
+  * WC Database columns: UnicodeClientColumns
+  * SQLite collation: UnicodeCollation
  
-  * The local path from WC root to node: local_relpath (e.g. "dir/file.txt")
+ The following sections are applicable to all above approaches. 
  
-  * The local path from WC root to node parent: parent_relpath (e.g. "dir")
- 
- All three paths are in UTF-8 but NFC/NFD is not currently specified. local_relpath/parent_relpath get converted from UTF-8 to whatever locale encoding is in use whenever they are used to access the filesystem.
- 
- Takesson: Is this conversion done on the fly every time? I am guessing this works because locale encoding is a reversible process , otherwise lookups in the database would fail?
- 
- An abstraction between the repository path and the file system path can be achieved by ensuring that there is a column in wc.db that contains the file system path in exactly the same form that the file system gives back. APIs in wc needs to be extended to ensure that all interaction with the file system is performed with the file system path.
- 
- 
- ==== Alternative 1: Redefine local_relpath ====
- 
- Redefine the existing column local_relpath to contain the path as stored in the file system. Code that currently relies on local_relpath being a substring of repos_path needs to be adjusted. E.g. a node might be considered switched when this condition is not met.
- 
- It would generally be desirable to use repos_path when referring to entries rather than local_relpath.
- 
- This alternative can be simulated using the attached script localrelpath2nfd.sh. This provides a Working Copy equivalent to what a checkout should produce if this alternative was implemented in Subversion itself:
-  * svn co ...
-  * svn stat #Shows any problematic items
-  * localrelpath2nfd.sh
-  * svn stat #Should be clean apart from misperception that some items are switched
- 
- TODO: provide a dump file with suitable test data. 
- 
- ==== Alternative 2: Introduce local_relpath_disk ====
- 
- A new column, local_relpath_disk, is added that contains the path as stored in the file system. This column will be used on all systems to interact with the file system. Currently, the content of columns local_relpath and  local_relpath_disk will be identical on all file systems except HFS+.
- 
- I guess this would require parent_relpath_disk as well?  Or would you plan to use the local_relpath==parent_relpath row to get local_relpath_disk for parent_relpath?
- 
- Takesson: thanks for pointing that out. I will update both alternatives, alt 1 redefining both and alt 2 "duplicating" both. 
  
  
  === Normalized uniqueness ===
  
- Repository path uniqueness should be checked in normalized form during add operations, in order to prevent new "normalized-name collisions" as early as possible. It might be acceptable to identify this later during commit, since it is a quite rare condition.
+ Repository path uniqueness should be checked in normalized form during add operations, in order to prevent new "normalized-name collisions" as early as possible. It might be acceptable to identify this later during commit, since very few users will encounter this condition. At the latest, it will be identified by the server (with above change). 
  
- When an existing "normalized-name collision" arrives to a Working Copy on HFS+ via checkout or update, there will be a uniqueness issue in the column local_relpath/local_relpath_disk and a situation somewhat similar to an obstruction. This should be communicated in some friendly way, similar to conflicts on case-insensitive file systems.
+ When an existing "normalized-name collision" arrives to a Working Copy on HFS+ via checkout or update, there will be a uniqueness issue in the column local_relpath (queried with collation) or in local_relpath_disk and a situation somewhat similar to an obstruction. This should be communicated in some friendly way, similar to conflicts on case-insensitive file systems.
- 
  
  === Pristine Storage ===
  
@@ -155, +127 @@

  
  === Command Line ===
  
- When referring to WC entries using the command line on Mac OSX, the tab-completion works unreliably because the keyboard typically produces composed characters while files are NFD. The tab completion is a general Mac OSX issue which should be addressed by Apple. However, Subversion could be helpful when attempting to identify entries referred to via the command line. 
+ When referring to WC entries using the command line on Mac OSX, the tab-completion works unreliably because the keyboard typically produces composed characters while files are NFD. The tab completion is a general Mac OSX issue which should be addressed by Apple, specifically the case; user types beginning including a composed character (currently matches nothing on disk). However, Subversion could be helpful when attempting to identify entries referred to via the command line. 
  
-  * Subversion must recognize paths that match the file system Unicode path (even if it does not match the repository path). Failure to do so makes tab-completion unusable.
+ * Subversion must recognize paths that match the file system Unicode path (even if it does not match the repository path). Failure to do so makes tab-completion unusable, especially on Mac OS X. 
-   * Paths on the command line should be matched against local_relpath/local_relpath_disk. 
  
-  * Subversion should as a fallback (optional) recognize paths that match the repository Unicode path. Failure to do so might make scripts less portable and might require the use of tab-completion in order to reference entries.
+ * Subversion must recognize paths that match the repository path in NFC. Failure to do so might make scripts less portable and might require the use of tab-completion in order to reference non-NFC entries (since keyboard input is typically NFC). E.g. A file added by Mac OS X can currently not be typed on other (any actually) OSes. 
  
+ 
+ === Hashtables in WC-NG ===
+ 
+ Bert has mentioned expected issues related to hashtables. 
+ 
+ TODO: Please elaborate on when they are used and approximately where in the codebase. 
+ 
+ 
- === Subcommand Changes ===
+ === Subcommand Status ===
  
- Specific changes to svn subcommands are outlined below. 
+ Current issues with svn subcommands related to Unicode composition are outlined below.
  
- All commands that access files in the Working Copy must do so by getting the path from the column local_relpath/local_relpath_disk. 
+ Below investigations where made on svn 1.7.x. 
  
- TODO: Investigate which subcommands currently use local_relpath for other purposes than accessing the file. With alternative 1 (above), it will NOT be acceptable to use local_relpath for comparison/substring operations with other paths, e.g. repos_path.
- 
- 
- ==== Checkout/Update ====
+ ==== Checkout ====
  
+ Completes, but creates a "broken" WC, see Status below. 
- When adding paths to the WC, determine the actual filesystem path and store that in local_relpath/local_relpath_disk. This is actually only required on OSX. How can this be done? 
-  * Do we get a handle back from the filesystem after creating a file/dir that can be queried for the path?
-  * Use platform dependent APIs to establish the expected path.
-  * Alternatively, first look for the exact same path (will find the one on most filesystems) then fall back to globbing with Unicode composition aware comparison.
  
- TODO: Do we need to process paths that are not actually checked out due to the depth setting?
+ ==== Update ====
  
+ Issues are related to the status issues when reporting the WC. Other issues?
  
  ==== Status ====
  
- The status subcommand incorrectly reports externals when manually adjusting local_relpath to match the filesystem.
+ The status subcommand reports one unversioned and one missing entry for each non-NFD on Mac OS X. This reflects the general WC issues with HFS+. 
  
- TODO: Clarify if status performs string comparisons between local_relpath and some other path.
  
- TODO: how does status show a file whose name changed to a value that canonicalizes to the same value as the original name? (is that possible?)
+ ==== Add ====
  
- ==== Add and mkdir ====
+ Works and creates an entry with the same composition as on disk. 
  
  Since this approach does not dictate a Normalized repository storage, the add subcommand should not perform any normalization.
  
- The uniqueness test should be Unicode aware to avoid a "normalized-name collision". This is not vital but desirable for better usability (has no effect on Mac OSX since it is not possible to create such collisions).
  
- TODO: Anything else?
+ ==== mkdir ====
+ 
+ TODO: Test. Suspect this might fail.
  
  
  ==== Commit ====
  
+ Seems to work. 
- No specific changes expected.
- 
- TODO: Confirm.
- 
- ==== Changelist ====
- 
- Changelists should use repos_path to refer to entries, unless already the case.
- 
  
  ==== ... ====
  
@@ -224, +191 @@

  
  {{{#!wiki note
  In a URL there are several different parts: the hostname, the <Location> (httpd only), the repository relpath(ra_svn) or basename(ra_dav with SVNParentPath), and the fspath.  Some of them might also be subject to canonicalization issues (eg: repos basename as handled by Mac mod_dav_svn).
+ 
+ ThomasAkesson: Can we accept the limitation to not have decomposable characters in these parts? They are defined by administrators while paths inside repositories are defined by users. 
  }}}
  
  == Use Cases ==