You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Bert Huijben <be...@qqmail.nl> on 2016/02/20 12:09:02 UTC

RE: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c


> -----Original Message-----
> From: kotkov@apache.org [mailto:kotkov@apache.org]
> Sent: vrijdag 19 februari 2016 23:11
> To: commits@subversion.apache.org
> Subject: svn commit: r1731300 - in /subversion/trunk/subversion:
> include/private/svn_utf_private.h libsvn_repos/dump.c
> libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c
> tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c
> 
> Author: kotkov
> Date: Fri Feb 19 22:11:11 2016
> New Revision: 1731300
> 
> URL: http://svn.apache.org/viewvc?rev=1731300&view=rev
> Log:
> Make svn log --search case-insensitive.
> 
> Use utf8proc to do the normalization and locale-independent case folding
> (UTF8PROC_CASEFOLD) for both the search pattern and the input strings.
> 
> Related discussion is in http://svn.haxx.se/dev/archive-2013-04/0374.shtml
> (Subject: "log --search test failures on trunk and 1.8.x").
> 
> * subversion/include/private/svn_utf_private.h
>   (svn_utf__normalize): Add new boolean argument to perform case folding.

Usually it is far more efficient to perform the comparison on the unnormalized strings using the apis, than to normalize and perform the operation later.  I'm not sure if utf8proc supports this feature though

But I'm wondering why you added this feature to an existing function?

I don't think it is recommended practice to perform the normalization this way and adding a boolean to an existing function makes it easier to do perform things in a not recommended way.



Locale independent case folding is not that well defined... Things like the Turkish 'i' that doesn't fold, so any decision on that makes it locale dependent. (n this case probably by choosing not Turkish, but that doesn't make it 'locale independent'.

Just folding the western European characters is much easier to explain/document.

	Bert 


Re: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c

Posted by Stefan Sperling <st...@elego.de>.
On Sun, Feb 21, 2016 at 02:22:45PM +0100, Branko Čibej wrote:
> Personally I'd much prefer the svn_utf__casefold() you propose (i.e.,
> normalize plus casefold) as a separate API. Internally, it can be
> implemented with that extra flag, but even for a private API, I think
> it's better to make each function do one thing.

+1

Re: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c

Posted by Evgeny Kotkov <ev...@visualsvn.com>.
Evgeny Kotkov <ev...@visualsvn.com> writes:

> We can drop the `normalize' argument, since keeping denormalized strings
> around is dangerous and unnecessary, but I'd leave the other two and let the
> caller specify the wanted behavior:
>
>     svn_error_t *
>     svn_utf__xfrm(const char **result,
>                   const char *str,
>                   apr_size_t len,
>                   svn_boolean_t case_insensitive,
>                   svn_boolean_t accent_insensitive,
>                   svn_membuf_t *buf);
>
> I attached the patch that does that.  What do you think?

I committed the patch in r1735614.


Regards,
Evgeny Kotkov

Re: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c

Posted by Evgeny Kotkov <ev...@visualsvn.com>.
Branko Čibej <br...@apache.org> writes:

> The big question here is what we'll use the API for. Currently we have a
> 'normalize' function that's used by svn_fs_verify (IIRC). Since we're
> talking about a funciton that transforms a UTF-8 string to a shape
> suitable for stuff-insensitive comparison, we could follow the example
> of the standard strxfrm() -> svn_utf__xfrm(); but if that's too ugly, my
> preference is for svn_utf__fold().
>
> However, I'd not add arguments for normalization/case folding/etc; I'd
> just make this function DTRT without any additional flags, because
> otherwise we'll always be second-guessing the correct invocation.

One use case that I keep in mind is doing server-side search or filtering,
where a client tells the server what kind of comparison and matching she
expects to get.

The strxfrm() function doesn't define the transformation in terms of
preserving case or diacritical marks.  Hence, we can't have svn_utf__xfrm()
doing the right thing for svn log --search, as that would mean that a
libsvn_subr function controls the behavior of the command-line client.
And while a private function somewhere around svn.c could be doing that,
hardcoding this kind of behavior in libsvn_subr doesn't sound proper to me.

We can drop the `normalize' argument, since keeping denormalized strings
around is dangerous and unnecessary, but I'd leave the other two and let the
caller specify the wanted behavior:

    svn_error_t *
    svn_utf__xfrm(const char **result,
                  const char *str,
                  apr_size_t len,
                  svn_boolean_t case_insensitive,
                  svn_boolean_t accent_insensitive,
                  svn_membuf_t *buf);

I attached the patch that does that.  What do you think?


Regards,
Evgeny Kotkov

Re: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c

Posted by Branko Čibej <br...@apache.org>.
On 24.02.2016 14:30, Evgeny Kotkov wrote:
> Branko Čibej <br...@apache.org> writes:
>> Instead of relying on the Unicode spec, I propose a different approach:
>> to treat accented letters as if they don't have diacriticals at all.
>> This should be fairly easy to do with utf8proc: in the intermediate,
>> 32-bit NFD string, remove any character that's in the
>> combining-diacritical group, and then convert the result to NFC UTF-8.
>> I've done this before with fairly good results; it's also much easier to
>> explain this behaviour to users than to tell them, "read the Unicode spec".
> I see that utf8proc has UTF8PROC_STRIPMARK flag that does something
> similar to what you describe.  The difference is that this option strips the
> codepoints that fall into either Mn (Nonspacing_Mark), Mc (Spacing_Mark) or
> Me (Enclosing_Mark) categories [1].
>
> Although that's more than just removing the characters that are marked as
> Combining Diacritical Marks [2,3,4,5], I am thinking that we could just use
> this flag.  How does this cope with what you propose?

This is probably even better than just removing combining diacriticals,
because it should work well with non-latin/cyrillic scripts, too.

> Another question is about exposing this ability in the API.  I'd say that we
> could do something like this:
>
>   svn_utf__transform(svn_boolean_t normalize,
>                      svn_boolean_t casefold,
>                      svn_boolean_t remove_diacritics)
>
>   (or maybe svn_utf__map / svn_utf__alter / svn_utf__fold?)
>
> Do you have an opinion or suggestions about that?

The big question here is what we'll use the API for. Currently we have a
'normalize' function that's used by svn_fs_verify (IIRC). Since we're
talking about a funciton that transforms a UTF-8 string to a shape
suitable for stuff-insensitive comparison, we could follow the example
of the standard strxfrm() -> svn_utf__xfrm(); but if that's too ugly, my
preference is for svn_utf__fold().

However, I'd not add arguments for normalization/case folding/etc; I'd
just make this function DTRT without any additional flags, because
otherwise we'll always be second-guessing the correct invocation.

If there's a use case for case-folding vs. non-case folding, then make
two functions: svn_utf__xfrm and svn_utf__xfrm_casefold.

(Again, obviously, all of these -- including svn_utf__normalize -- need
only one private impltmentation in the source.)

-- Brane

Re: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c

Posted by Evgeny Kotkov <ev...@visualsvn.com>.
Branko Čibej <br...@apache.org> writes:

> Personally I'd much prefer the svn_utf__casefold() you propose (i.e.,
> normalize plus casefold) as a separate API. Internally, it can be
> implemented with that extra flag, but even for a private API, I think
> it's better to make each function do one thing.

After giving it more thought, I agree that a separate API is a better choice
here.  For now, I added svn_utf__casefold() in r1732152.

> Instead of relying on the Unicode spec, I propose a different approach:
> to treat accented letters as if they don't have diacriticals at all.
> This should be fairly easy to do with utf8proc: in the intermediate,
> 32-bit NFD string, remove any character that's in the
> combining-diacritical group, and then convert the result to NFC UTF-8.
> I've done this before with fairly good results; it's also much easier to
> explain this behaviour to users than to tell them, "read the Unicode spec".

I see that utf8proc has UTF8PROC_STRIPMARK flag that does something
similar to what you describe.  The difference is that this option strips the
codepoints that fall into either Mn (Nonspacing_Mark), Mc (Spacing_Mark) or
Me (Enclosing_Mark) categories [1].

Although that's more than just removing the characters that are marked as
Combining Diacritical Marks [2,3,4,5], I am thinking that we could just use
this flag.  How does this cope with what you propose?

Another question is about exposing this ability in the API.  I'd say that we
could do something like this:

  svn_utf__transform(svn_boolean_t normalize,
                     svn_boolean_t casefold,
                     svn_boolean_t remove_diacritics)

  (or maybe svn_utf__map / svn_utf__alter / svn_utf__fold?)

Do you have an opinion or suggestions about that?

[1] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
[2] http://www.unicode.org/charts/PDF/U0300.pdf
[3] http://www.unicode.org/charts/PDF/U1AB0.pdf
[4] http://www.unicode.org/charts/PDF/U1DC0.pdf
[5] http://www.unicode.org/charts/PDF/U20D0.pdf


Regards,
Evgeny Kotkov

Re: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c

Posted by Branko Čibej <br...@apache.org>.
On 20.02.2016 22:16, Evgeny Kotkov wrote:
> Branko Čibej <br...@apache.org> writes:
>
>> Not really. For example, 'á' and 'A' are equivalent, but 'ß' and 'SS'
>> are not — whereas the latter should be equivalent in German, but I doubt
>> very much that utf8proc does that right. Case-insensitive comparison
>> must *always* be done in the context of a well-defined locale. Anything
>> that calls itself "locale-independent" is likely to be wrong in a really
>> huge number of cases.
> The Unicode Standard (Section 3.13 Default Case Algorithms) is quite clear
> on how case-insensitive matching should be done [1]:
>
>     Default caseless matching is the process of comparing two strings for
>     case-insensitive equality. The definitions of Unicode Default Caseless
>     Matching build on the definitions of Unicode Default Case Folding.
>
>        Default Caseless Matching uses full case folding:
>
>        A string X is a caseless match for a string Y if and only if:
>        toCasefold(X) = toCasefold(Y)
>
>        toCasefold(X): Map each character C in X to Case_Folding(C).
>
>        Case_Folding(C) uses the mappings with the status field value “C” or
>        “F” in the data file CaseFolding.txt in the Unicode Character Database.
>
>     When comparing strings for case-insensitive equality, the strings should
>     also be normalized for most correct results.
>
> The behavior we get with this patch is well-defined and follows the spec,
> since we normalize and fold the case of the strings with utf8proc.  (The
> UTF8PROC_CASEFOLD flag results in full C + F case folding as per [2],
> omitting special case T.)

It may be well-defined and follow the spec, but the important question
IMO is whether the behaviour makes sense for our users. Bert mentioned
Turkish (i ∼ İ and I ∼ ı), I mentioned German, where ß ∼ SS, and French,
where ò ∼ O but that doesn't hold in Italian, where ò ∼ Ò ... these are
just off the top of my head, there must be tens or even hundreds of
examples where locale-independent case folding can give unexpected results.

Instead of relying on the Unicode spec, I propose a different approach:
to treat accented letters as if they don't have diacriticals at all.
This should be fairly easy to do with utf8proc: in the intermediate,
32-bit NFD string, remove any character that's in the
combining-diacritical group, and then convert the result to NFC UTF-8.
I've done this before with fairly good results; it's also much easier to
explain this behaviour to users than to tell them, "read the Unicode spec".

>>> But I'm wondering why you added this feature to an existing function?
>>>
>>> I don't think it is recommended practice to perform the normalization this
>>> way and adding a boolean to an existing function makes it easier to do
>>> perform things in a not recommended way.
>> Adding flags that drastically change the semantics of a function is just
>> broken API design, period.
> I don't think that we expose this functionality in a broken way.  There aren't
> that many options to choose from, since we need to perform the normalization
> and the case folding in a single call to utf8proc, with appropriate flags set.
> We could add an svn_utf__casefold() function that does both, but I'd rather
> prefer what we have now.
>
> After all, the maintainers of utf8proc expose its features in a quite similar
> fashion [3] — with a normalize_string(..., casefold=true/false) function.

Heh, quoting bad examples doesn't make the API change any better. :)

Personally I'd much prefer the svn_utf__casefold() you propose (i.e.,
normalize plus casefold) as a separate API. Internally, it can be
implemented with that extra flag, but even for a private API, I think
it's better to make each function do one thing.

-- Brane

Re: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c

Posted by Evgeny Kotkov <ev...@visualsvn.com>.
Branko Čibej <br...@apache.org> writes:

> Not really. For example, 'á' and 'A' are equivalent, but 'ß' and 'SS'
> are not — whereas the latter should be equivalent in German, but I doubt
> very much that utf8proc does that right. Case-insensitive comparison
> must *always* be done in the context of a well-defined locale. Anything
> that calls itself "locale-independent" is likely to be wrong in a really
> huge number of cases.

The Unicode Standard (Section 3.13 Default Case Algorithms) is quite clear
on how case-insensitive matching should be done [1]:

    Default caseless matching is the process of comparing two strings for
    case-insensitive equality. The definitions of Unicode Default Caseless
    Matching build on the definitions of Unicode Default Case Folding.

       Default Caseless Matching uses full case folding:

       A string X is a caseless match for a string Y if and only if:
       toCasefold(X) = toCasefold(Y)

       toCasefold(X): Map each character C in X to Case_Folding(C).

       Case_Folding(C) uses the mappings with the status field value “C” or
       “F” in the data file CaseFolding.txt in the Unicode Character Database.

    When comparing strings for case-insensitive equality, the strings should
    also be normalized for most correct results.

The behavior we get with this patch is well-defined and follows the spec,
since we normalize and fold the case of the strings with utf8proc.  (The
UTF8PROC_CASEFOLD flag results in full C + F case folding as per [2],
omitting special case T.)

>> But I'm wondering why you added this feature to an existing function?
>>
>> I don't think it is recommended practice to perform the normalization this
>> way and adding a boolean to an existing function makes it easier to do
>> perform things in a not recommended way.
>
> Adding flags that drastically change the semantics of a function is just
> broken API design, period.

I don't think that we expose this functionality in a broken way.  There aren't
that many options to choose from, since we need to perform the normalization
and the case folding in a single call to utf8proc, with appropriate flags set.
We could add an svn_utf__casefold() function that does both, but I'd rather
prefer what we have now.

After all, the maintainers of utf8proc expose its features in a quite similar
fashion [3] — with a normalize_string(..., casefold=true/false) function.

[1] http://www.unicode.org/versions/Unicode8.0.0/ch03.pdf
[2] http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
[3] https://julia.readthedocs.org/en/latest/stdlib/strings/#Base.normalize_string


Regards,
Evgeny Kotkov

Re: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c

Posted by Branko Čibej <br...@apache.org>.
On 20.02.2016 12:09, Bert Huijben wrote:
>
>> -----Original Message-----
>> From: kotkov@apache.org [mailto:kotkov@apache.org]
>> Sent: vrijdag 19 februari 2016 23:11
>> To: commits@subversion.apache.org
>> Subject: svn commit: r1731300 - in /subversion/trunk/subversion:
>> include/private/svn_utf_private.h libsvn_repos/dump.c
>> libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c
>> tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c
>>
>> Author: kotkov
>> Date: Fri Feb 19 22:11:11 2016
>> New Revision: 1731300
>>
>> URL: http://svn.apache.org/viewvc?rev=1731300&view=rev
>> Log:
>> Make svn log --search case-insensitive.
>>
>> Use utf8proc to do the normalization and locale-independent case folding
>> (UTF8PROC_CASEFOLD) for both the search pattern and the input strings.
>>
>> Related discussion is in http://svn.haxx.se/dev/archive-2013-04/0374.shtml
>> (Subject: "log --search test failures on trunk and 1.8.x").
>>
>> * subversion/include/private/svn_utf_private.h
>>   (svn_utf__normalize): Add new boolean argument to perform case folding.
> Usually it is far more efficient to perform the comparison on the unnormalized strings using the apis, than to normalize and perform the operation later.  I'm not sure if utf8proc supports this feature though
>
> But I'm wondering why you added this feature to an existing function?
>
> I don't think it is recommended practice to perform the normalization this way and adding a boolean to an existing function makes it easier to do perform things in a not recommended way.

Adding flags that drastically change the semantics of a function is just
broken API design, period.

> Locale independent case folding is not that well defined... Things like the Turkish 'i' that doesn't fold, so any decision on that makes it locale dependent. (n this case probably by choosing not Turkish, but that doesn't make it 'locale independent'.
>
> Just folding the western European characters is much easier to explain/document.

Not really. For example, 'á' and 'A' are equivalent, but 'ß' and 'SS'
are not — whereas the latter should be equivalent in German, but I doubt
very much that utf8proc does that right. Case-insensitive comparison
must *always* be done in the context of a well-defined locale. Anything
that calls itself "locale-independent" is likely to be wrong in a really
huge number of cases.

Furthermore, this fix is not in any way related to the discussion that
the log message points to ... it's pretty clear that the bug is in
apr_fnmatch and should be fixed there.

In light of the above, I'd like to see some more discussion about a
realistic solution to the problem. So-called "locale-independent" case
folding isn't locale-independent, and I don't see how it can solve the
original problem without introducing a bunch of new bugs. It's scary
enough that the test cases don't try anything other than accented latin
characters.

-- Brane