You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@subversion.apache.org by Daniel Shahaf <d....@daniel.shahaf.name> on 2016/03/15 01:08:38 UTC

'svn log --search': forcing case sensitivity? (was: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c)

kotkov@apache.org wrote on Fri, Feb 19, 2016 at 22:11:11 -0000:
> Author: kotkov
> Date: Fri Feb 19 22:11:11 2016
> New Revision: 1731300
> 
> URL: http://svn.apache.org/viewvc?rev=1731300&view=rev
> Log:
> Make svn log --search case-insensitive.
> 
> Use utf8proc to do the normalization and locale-independent case folding
> (UTF8PROC_CASEFOLD) for both the search pattern and the input strings.
> 
> Related discussion is in http://svn.haxx.se/dev/archive-2013-04/0374.shtml
> (Subject: "log --search test failures on trunk and 1.8.x").
> 
> +++ subversion/trunk/subversion/svn/log-cmd.c Fri Feb 19 22:11:11 2016
> @@ -38,6 +38,7 @@
> @@ -110,6 +111,24 @@
> +/* Return TRUE if STR matches PATTERN. Else, return FALSE. Assumes that
> + * PATTERN is a UTF-8 string normalized to form C with case folding
> + * applied. Use BUF for temporary allocations. */
> +static svn_boolean_t
> +match(const char *pattern, const char *str, svn_membuf_t *buf)
> +{
> +  svn_error_t *err;
> +
> +  err = svn_utf__normalize(&str, str, strlen(str), TRUE /* casefold */, buf);
> +  if (err)
> +    {
> +      /* Can't match invalid data. */
> +      svn_error_clear(err);
> +      return FALSE;
> +    }
> +
> +  return apr_fnmatch(pattern, str, 0) == APR_SUCCESS;

Should there be a command-line flag to disable casefolding?

E.g., to allow users to grep for identifiers (function/variable/file
names) using their exact case?  Do people who use 'log --search' need it
to be case-sensitive?  (I don't use 'log --search' often.)

Even if casefolding is disabled, we should still apply Unicode
normalization to form C.

Cheers,

Daniel

P.S. This patch introduces a minor behaviour change: before this patch,
the search pattern «foo[A-z]bar» would match the log message «foo_bar»,
whereas after this change it would not.  (This is because the pattern is
now casefolded between being passed to APR, and '_' is between 'A'
and 'z' but not between 'A' and 'Z', when compared as C chars.)  I doubt
anyone will notice this behaviour change; I'm just mentioning it for
completeness.

> +}

Re: 'svn log --search': forcing case sensitivity?

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Branko Čibej wrote on Tue, Mar 15, 2016 at 06:13:28 +0100:
> On 15.03.2016 01:08, Daniel Shahaf wrote:
> > kotkov@apache.org wrote on Fri, Feb 19, 2016 at 22:11:11 -0000:
> >> +/* Return TRUE if STR matches PATTERN. Else, return FALSE. Assumes that
> >> + * PATTERN is a UTF-8 string normalized to form C with case folding
> >> + * applied. Use BUF for temporary allocations. */
> >> +static svn_boolean_t
> >> +match(const char *pattern, const char *str, svn_membuf_t *buf)
> >> +{
> >> +  svn_error_t *err;
> >> +
> >> +  err = svn_utf__normalize(&str, str, strlen(str), TRUE /* casefold */, buf);
> >> +  if (err)
> >> +    {
> >> +      /* Can't match invalid data. */
> >> +      svn_error_clear(err);
> >> +      return FALSE;
> >> +    }
> >> +
> >> +  return apr_fnmatch(pattern, str, 0) == APR_SUCCESS;
> > Should there be a command-line flag to disable casefolding?
> >
> > E.g., to allow users to grep for identifiers (function/variable/file
> > names) using their exact case?  Do people who use 'log --search' need it
> > to be case-sensitive?  (I don't use 'log --search' often.)
> 
> I'd prefer to keep things simple.
>

Fair enough.  I was concerned that users might perceive removing
case-sensitive search as a regression.

(I don't like having new knobs any more than you do.)

> And as I recall, this whole discussion began because apr_fnmatch
> doesn't like non-ASCII characters?

s/non-ASCII/multibyte/, but yes.

> > Even if casefolding is disabled, we should still apply Unicode
> > normalization to form C.
> 
> There's no particular reason it has to be form C, as long as both the
> pattern and the string are normalized to the same form.

Agreed, that's what I meant to say.

> Using form D is possibly even a bit faster, since that's the internal
> 32-bit representation used by utf8proc. It's a pity we don't have
> a 32-bit-char fnmatch implementation.

> Still, as you note below, normalizing a glob pattern isn't entirely
> trivial to do correctly.
> 
> > P.S. This patch introduces a minor behaviour change: before this patch,
> > the search pattern «foo[A-z]bar» would match the log message «foo_bar»,
> > whereas after this change it would not.  (This is because the pattern is
> > now casefolded between being passed to APR, and '_' is between 'A'
> > and 'z' but not between 'A' and 'Z', when compared as C chars.)  I doubt
> > anyone will notice this behaviour change; I'm just mentioning it for
> > completeness.
> 
> Mmhh ... this is what comes of 'obviously trivial' solutions. :)

The important thing is that there is no _other_ apr_fnmatch() syntax
that changes meaning through case-folding the pattern, at least in
apr-1.5 with flags==0.

Cheers,

Daniel

Re: 'svn log --search': forcing case sensitivity?

Posted by Branko Čibej <br...@apache.org>.
On 15.03.2016 01:08, Daniel Shahaf wrote:
> kotkov@apache.org wrote on Fri, Feb 19, 2016 at 22:11:11 -0000:
>> Author: kotkov
>> Date: Fri Feb 19 22:11:11 2016
>> New Revision: 1731300
>>
>> URL: http://svn.apache.org/viewvc?rev=1731300&view=rev
>> Log:
>> Make svn log --search case-insensitive.
>>
>> Use utf8proc to do the normalization and locale-independent case folding
>> (UTF8PROC_CASEFOLD) for both the search pattern and the input strings.
>>
>> Related discussion is in http://svn.haxx.se/dev/archive-2013-04/0374.shtml
>> (Subject: "log --search test failures on trunk and 1.8.x").
>>
>> +++ subversion/trunk/subversion/svn/log-cmd.c Fri Feb 19 22:11:11 2016
>> @@ -38,6 +38,7 @@
>> @@ -110,6 +111,24 @@
>> +/* Return TRUE if STR matches PATTERN. Else, return FALSE. Assumes that
>> + * PATTERN is a UTF-8 string normalized to form C with case folding
>> + * applied. Use BUF for temporary allocations. */
>> +static svn_boolean_t
>> +match(const char *pattern, const char *str, svn_membuf_t *buf)
>> +{
>> +  svn_error_t *err;
>> +
>> +  err = svn_utf__normalize(&str, str, strlen(str), TRUE /* casefold */, buf);
>> +  if (err)
>> +    {
>> +      /* Can't match invalid data. */
>> +      svn_error_clear(err);
>> +      return FALSE;
>> +    }
>> +
>> +  return apr_fnmatch(pattern, str, 0) == APR_SUCCESS;
> Should there be a command-line flag to disable casefolding?
>
> E.g., to allow users to grep for identifiers (function/variable/file
> names) using their exact case?  Do people who use 'log --search' need it
> to be case-sensitive?  (I don't use 'log --search' often.)

I'd prefer to keep things simple. And as I recall, this whole discussion
began because apr_fnmatch doesn't like non-ASCII characters?

> Even if casefolding is disabled, we should still apply Unicode
> normalization to form C.

There's no particular reason it has to be form C, as long as both the
pattern and the string are normalized to the same form. Using form D is
possibly even a bit faster, since that's the internal 32-bit
representation used by utf8proc. It's a pity we don't have a 32-bit-char
fnmatch implementation.

Still, as you note below, normalizing a glob pattern isn't entirely
trivial to do correctly.

> P.S. This patch introduces a minor behaviour change: before this patch,
> the search pattern «foo[A-z]bar» would match the log message «foo_bar»,
> whereas after this change it would not.  (This is because the pattern is
> now casefolded between being passed to APR, and '_' is between 'A'
> and 'z' but not between 'A' and 'Z', when compared as C chars.)  I doubt
> anyone will notice this behaviour change; I'm just mentioning it for
> completeness.

Mmhh ... this is what comes of 'obviously trivial' solutions. :)

-- Brane

Re: 'svn log --search': forcing case sensitivity?

Posted by Evgeny Kotkov <ev...@visualsvn.com>.
Julian Foad <ju...@gmail.com> writes:

>> Should there be a command-line flag to disable casefolding?
>
> No. Not unless we also want to provide all the other kinds of control for
> specialist searching, such as regexes, specifying which of the metadata
> fields to look in, boolean operators, etc. -- and I think it would be silly
> for the project to go that way. Like Brane said, keep it simple.

+1

Daniel Shahaf <d....@daniel.shahaf.name> writes:

> P.S. This patch introduces a minor behaviour change: before this patch,
> the search pattern «foo[A-z]bar» would match the log message «foo_bar»,
> whereas after this change it would not.  (This is because the pattern is
> now casefolded between being passed to APR, and '_' is between 'A'
> and 'z' but not between 'A' and 'Z', when compared as C chars.)  I doubt
> anyone will notice this behaviour change; I'm just mentioning it for
> completeness.

That's a good point.  However, I am not sure that there is an easy way
to solve this (IMHO, edge) case.


Regards,
Evgeny Kotkov

Re: 'svn log --search': forcing case sensitivity?

Posted by Julian Foad <ju...@gmail.com>.
Daniel Shahaf wrote:
> Should there be a command-line flag to disable casefolding?

No. Not unless we also want to provide all the other kinds of control 
for specialist searching, such as regexes, specifying which of the 
metadata fields to look in, boolean operators, etc. -- and I think it 
would be silly for the project to go that way. Like Brane said, keep it 
simple.

- Julian