Non-normalizing Unicode Composition Awareness
Context
Within Unicode, some characters can in the unicode standard be represented in different ways (composed/decomposed, canonical ordering, etc), while rendered equally on screen or in print. A unicode string (e.g. a file name) can be represented in normalized forms (NFC/NFD) or mixed (not normalized).
The majority of file systems (e.g. NTFS, Ext3) will accept a unicode filename in any form, store and give back in the form it was input. These file systems will typically even accept multiple files where the path looks identical on screen but the unicode string is different due to character composition.
A minority of file systems (currently Mac OS X HFS+ only) will normalize the paths. In the case of HFS+, the path will be normalized into NFD and it will be given back that way when listing the filesystem.
Most significant differences from the majority of filesystems:
- A file that is stored in NFC or mixed, will not be returned with an identical name. Generally considered a negative effect of the HFS+ unicode implementation.
- Multiple files whose name is rendered equally cannot be stored in the same directory. Often considered an advantage.
The topic has been described here: http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
- This RFC is not as complete in all areas, and depend on this note for additional context and issue description.
- This RFC proposes a solution very similar to the note's solution 4, "Client and server-side path comparison routines". However, here it is proposed as a long term solution.
- This RFC is essentially identical to what Erik H. proposes in this thread:
http://svn.haxx.se/dev/archive-2010-09/0319.shtml
Issue Description
- Subversion and most file systems currently allow creation of multiple paths, which in normalized form are identical. Hereafter referred to as "normalized-name collisions". This could cause significant upgrade issues for repositories containing such collisions, depending on which solution is implemented. See section "Legacy Data".
- Users have difficulty understanding and managing "normalized-name collisions". It is difficult to know which file is which and one of the paths is typically not possible to type on a keyboard.
- Mac OS X clients can not interoperate with non-OSX clients when paths contain composed Unicode characters (added by a non-OSX client). The working copies report status issues directly after checkout/update on OSX. Tracked by: Bug 2464
Differences to case-sensitivity
- NFC/NFD look the same when rendered on screen.
- Different case can be controlled with the keyboard, while Unicode composition is more difficult.
- Most modern case-insensitive file systems are case-preserving, i.e. they do not normalize to a preferred form and always return the same form that was stored. Normalizing file systems do not preserve the paths.
Similarities to case-sensitivity
- If two Unicode strings differ only by letter case, on some computer systems they refer to the same file, while on other systems they refer to different files. The same applies if two Unicode strings differ only by composition. The rules are set by each file system.
- Subversion inter-operates with different systems. When two file names that differ only by letter case are transferred from a case-sensitive system to a case-insensitive system, they will collide and Subversion should handle this in some friendly way. The same applies if two file names differ only by composition.
To Normalize or Not to Normalize
Whether or not to normalize within a Subversion repository (server-side) has been debated. The note (unicode-composition-for-filenames) considers normalization to NFC to be the long term (2.x) solution. Hereafter referring to that approach as "repository normalization".
There are implementation advantages with normalized paths which can simplify comparisons and storage.
There are also reasons not to normalize:
- A file system is generally expected to give back exactly what was stored, or refuse up-front. HFS+ has been criticized for not living up to this expectation, which is also the reason the Svn WC has issues on HFS+. Subversion can be considered a sort of file system, and could therefore be expected to live up to this expectation.
- Compatibility is a high priority for Subversion. Introducing normalization/translation/etc is not unlikely to introduce compatibility issues, now or later. There is a principle that Subversion should not be a limiting factor or impose undue limitations on allowed characters, file names etc.
- Introducing normalization tends to complicate the upgrade process, especially for repositories that contain "normalized-name collisions". This is one of the reasons this very issue has not been addressed.
However, there is very little reason to allow the creation of new "normalized-name collisions". There are no known use-cases for creating multiple files in the same directory that would have identical normalized paths. Subversion should preferably refuse such add operations as early as possible, at the latest during commit. Referring to this feature as "uniqueness normalization".
Solution Overview
There are 2 components of this solution, one server side and one client side. These can be addressed individually, which is an important requirement for Subversion 1.x interoperability between client and server versions.
This solution does not normalize paths in the repository. Paths are only normalized for the purpose of comparisons.
Server Changes
The Subversion server should no longer accept 'add':ing paths that cause "normalized-name collisions". The comparison with existing paths (and other paths in the same txn) should be performed in normalized form. However, the paths created in the repository will keep the form input by the client.
There could be a performance impact. [Need more data] However, the 'add' operation is not one of the most frequent ones, in a typical installation.
The major impact would not stem from collision avoidance on add
but normalization during directory search, which affects most other operations. For the server, it is probably better to store names twice (original for display and normalized for indexing) rather than normalize on every lookup.
ThomasAkesson: It might be better to store names twice, but I don't see why the server needs to do normalization during directory search? That would be a client side task in this proposal.
It is not possible to rely on client behavior. A Subversion server can be accessed via mod_dav_svn, and elder Subversion clients.
The desired server behavior can be accomplished with Subversion 1.7 or earlier using a pre-commit hook, but it is desirable to have "uniqueness normalization" as the future default behavior.
This would make it impossible to load a dump of a repository with "normalized-name collisions". An important advantage of this proposal compared to normalizing approaches is that there is no requirement to process legacy data (see below for a discussion on 'svn mv' as cleanup tool). During loading of dump files, the normalized comparison should be disabled, either by default of via a switch, e.g. --ignore-utf-normalize.
Client Changes
The Working Copy needs an abstraction between the repository path provided by the server and the actual file system path. This is required for normalizing file systems (HFS+) regardless if the Subversion server performs normalization to NFC (repository normalization) or just enforces "uniqueness normalization".
It might be more feasible to implement such an abstraction now in wc-ng than it was in Subversion <=1.6.
Alternative Approaches
There are different approaches to implementing this abstraction of paths. The following have been identified so far, each with its Wiki page:
- WC Database columns: UnicodeClientColumns
- SQLite collation: UnicodeCollation
The following sections are applicable to all above approaches.
Normalized uniqueness
Repository path uniqueness should be checked in normalized form during add operations, in order to prevent new "normalized-name collisions" as early as possible. It might be acceptable to identify this later during commit, since very few users will encounter this condition. At the latest, it will be identified by the server (with above change).
When an existing "normalized-name collision" arrives to a Working Copy on HFS+ via checkout or update, there will be a uniqueness issue in the column local_relpath (queried with collation) or in local_relpath_disk and a situation somewhat similar to an obstruction. This should be communicated in some friendly way, similar to conflicts on case-insensitive file systems.
Pristine Storage
Since svn 1.7, pristines are stored based on the SHA1 checksum of their contents, independently of their name. There should be very little impact.
Command Line
When referring to WC entries using the command line on Mac OSX, the tab-completion works unreliably because the keyboard typically produces composed characters while files are NFD. The tab completion is a general Mac OSX issue which should be addressed by Apple, specifically the case; user types beginning including a composed character (currently matches nothing on disk). However, Subversion could be helpful when attempting to identify entries referred to via the command line.
- Subversion must recognize paths that match the file system Unicode path (even if it does not match the repository path). Failure to do so makes tab-completion unusable, especially on Mac OS X.
- Subversion must recognize paths that match the repository path in NFC. Failure to do so might make scripts less portable and might require the use of tab-completion in order to reference non-NFC entries (since keyboard input is typically NFC). E.g. A file added by Mac OS X can currently not be typed on other (any actually) OSes.
Hashtables in WC-NG
Bert has mentioned expected issues related to hashtables.
TODO: Please elaborate on when they are used and approximately where in the codebase.
Subcommand Status
Current issues with svn subcommands related to Unicode composition are outlined below.
Below investigations where made on svn 1.7.x.
Checkout
Completes, but creates a "broken" WC, see Status below.
Update
Issues are related to the status issues when reporting the WC. Other issues?
Status
The status subcommand reports one unversioned and one missing entry for each non-NFD on Mac OS X. This reflects the general WC issues with HFS+.
Add
Works and creates an entry with the same composition as on disk.
Since this approach does not dictate a Normalized repository storage, the add subcommand should not perform any normalization.
mkdir
TODO: Test. Suspect this might fail.
Commit
Seems to work.
...
TODO: More subcommands requiring attention?
Externals
External definitions are required to exactly match the Unicode URL. This is a currently existing requirement which is easily worked around using copy-paste. It would be difficult to lookup the repository URL in a Unicode composition aware manner.
On Mac OS X it will be necessary to determine the actual filesystem path for the target, much like during update/checkout.
On Mac OS X it will not be possible to define externals that cause a "normalized-name collision".
In a URL there are several different parts: the hostname, the <Location> (httpd only), the repository relpath(ra_svn) or basename(ra_dav with SVNParentPath), and the fspath. Some of them might also be subject to canonicalization issues (eg: repos basename as handled by Mac mod_dav_svn).
ThomasAkesson: Can we accept the limitation to not have decomposable characters in these parts? They are defined by administrators while paths inside repositories are defined by users.
Use Cases
- Interoperability between Mac OSX and non-OSX Subversion clients: an OS X user will after checkout/update from a repository containing NFC/mixed Unicode paths (added by non-OSX client) receive a fully functional WC where all normal operations can be performed. Tracked by: Bug 2464
- It will no longer be possible to add paths that look like duplicates but use different Unicode composition. It is highly unlikely anyone is relying on this.
Legacy Data
- This change will cause no problems when upgrading existing repositories even if they contain "normalized-name collisions".
- If "normalized-name collisions" exist in HEAD, a check out on Mac OS X will still fail after an upgrade but potentially with a better error message. This is an issue that is very similar to case-collisions on case-insensitive file systems. The detection code is similar and the same friendly error message can potentially be used.
- These "normalized-name collisions" can be resolved in HEAD via "svn mv SRC_URL DST_URL". Historical revisions will still be difficult to check out from Mac OS X.
- Working Copies will be upgraded in the same way as any other wc-ng upgrade with SQL schema changes.
- Working Copies on Mac OS X that are broken before upgrade might require a fresh check out.
- Consequently, no transformation of paths in wc.db is required. With alternative 2 above, local_relpath should be copied to local_relpath_disk.
- No identification of "normalized-name collisions" is required. Normal users should not be bothered with such maintenance tasks.