You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Jin Chul Kim <ji...@gmail.com> on 2017/12/19 02:12:24 UTC

[DISCUSS: IMPALA-3282] Character escapes in regular expressions

Hi,

I would like to discuss some issues before taking the ticket which expects
a new builtin function(e.g. string regex_escape(string_pattern)). The
purpose of the function is to escape a set of special characters by
replacing the string pattern with their escaped characters.

1. Define candidates of escaped characters
When I research the escape on other languages, interestingly there are some
differences and features in each language.

We should set our escaped characters. Here is a summary of the above
discussion:

- Perl: Escapes every character that is not alphanumeric(i.e. [A-Za-z_0-9]).
- PHP: Escapes the following special characters: . \ + * ? [ ^ ] $ ( ) { }
= ! < > | : -
- Python: Same as Perl's approach, but the character underscore is no
longer escaped since version 3.3.
- Ruby: Escapes the following special characters: [ ] { } ( ) | - * . \ ? +
^ $ #
Ruby Escapes comments(#), but do not escape context sensitive characters(:
<)
- Java: A different approach. Java relies on "as if it were a literal
pattern" by "\Q" and "\E"
- C#: Escapes the following special characters: \ * + ? | { [ ( ) ^ $ . #
whitespace
C# does not escapes ] and }.

See the discussion if you want to see more details:
https://github.com/benjamingr/RegExp.escape/blob/master/data/other_languages/discussions.md

2. Built-in function name
The reporter proposed "regex_escape". I think the function name is
intuitive and self-explainable. Please suggest if you have any better name.

3. Signature of the built-in function
Do we have to extend function signature? I guess an user may want to pass a
set of customized characters.

regex_escape(string_pattern, [delimiter])

delimiter
  := "^[A-Za-z0-9]"
  | "[.\?\[^()\]{}=!<>|:-]"

"^[A-Za-z0-9]" means "escapes non-alphanumeric characters"
"[.\?\[^()\]{}=!<>|:-]" means "escapes the specified characters"
In delimiter, the following characters should be escaped: []

Best regards,
Jinchul

Re: [DISCUSS: IMPALA-3282] Character escapes in regular expressions

Posted by Jim Apple <jb...@cloudera.com>.
> The
> purpose of the function is to escape a set of special characters by
> replacing the string pattern with their escaped characters.

I'm not 100% clear on what this means - what is the end goal? Is it
that the result of regex_escape contains no special regex semantics
and therefore matches only strings that contain it exactly?

If so, it seems like this forces our hand - we should escape exactly
the regex characters present in our engine, which i think is RE2.

> Do we have to extend function signature? I guess an user may want to pass a
> set of customized characters.
>
> regex_escape(string_pattern, [delimiter])

I'm not so sure that users will want that. It seems pretty specific.

Re: [DISCUSS: IMPALA-3282] Character escapes in regular expressions

Posted by Jin Chul Kim <ji...@gmail.com>.
Hi,

I've pushed an initial change: https://gerrit.cloudera.org/#/c/8900/
The change contains essential feature only:
- Function name: regexp_escape
- Takes a string as a input parameter and returns a string which is escaped.
- Escapes the following special characters: ".*\\+?^[](){}$!=:-#\n\r\t\v "
(not contain double quote. the use of double quotes is not to hide a space.)

Best regards,
Jinchul

2017-12-19 11:12 GMT+09:00 Jin Chul Kim <ji...@gmail.com>:

> Hi,
>
> I would like to discuss some issues before taking the ticket which expects
> a new builtin function(e.g. string regex_escape(string_pattern)). The
> purpose of the function is to escape a set of special characters by
> replacing the string pattern with their escaped characters.
>
> 1. Define candidates of escaped characters
> When I research the escape on other languages, interestingly there are
> some differences and features in each language.
>
> We should set our escaped characters. Here is a summary of the above
> discussion:
>
> - Perl: Escapes every character that is not alphanumeric(i.e.
> [A-Za-z_0-9]).
> - PHP: Escapes the following special characters: . \ + * ? [ ^ ] $ ( ) { }
> = ! < > | : -
> - Python: Same as Perl's approach, but the character underscore is no
> longer escaped since version 3.3.
> - Ruby: Escapes the following special characters: [ ] { } ( ) | - * . \ ?
> + ^ $ #
> Ruby Escapes comments(#), but do not escape context sensitive characters(:
> <)
> - Java: A different approach. Java relies on "as if it were a literal
> pattern" by "\Q" and "\E"
> - C#: Escapes the following special characters: \ * + ? | { [ ( ) ^ $ . #
> whitespace
> C# does not escapes ] and }.
>
> See the discussion if you want to see more details: https://github.com/
> benjamingr/RegExp.escape/blob/master/data/other_languages/discussions.md
>
> 2. Built-in function name
> The reporter proposed "regex_escape". I think the function name is
> intuitive and self-explainable. Please suggest if you have any better name.
>
> 3. Signature of the built-in function
> Do we have to extend function signature? I guess an user may want to pass
> a set of customized characters.
>
> regex_escape(string_pattern, [delimiter])
>
> delimiter
>   := "^[A-Za-z0-9]"
>   | "[.\?\[^()\]{}=!<>|:-]"
>
> "^[A-Za-z0-9]" means "escapes non-alphanumeric characters"
> "[.\?\[^()\]{}=!<>|:-]" means "escapes the specified characters"
> In delimiter, the following characters should be escaped: []
>
> Best regards,
> Jinchul
>
>