You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2016/09/23 21:13:20 UTC

[jira] [Commented] (SPARK-17647) SQL LIKE/RLIKE do not handle backslashes correctly

    [ https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517589#comment-15517589 ] 

Josh Rosen commented on SPARK-17647:
------------------------------------

I think that the first case is clearly a bug (and have a fix) but I'm not so sure about the second case. Consider:

{code}
scala> ".*\\\\\\\\.*".r.findFirstIn("\\\\")
res8: Option[String] = Some(\\)
{code}

In a regular expression, two backslashes denote an escaped backslash. Setting Java strings aside for a moment, consider using pencil/paper to writing a regex which matches a single backslash character: in the context of a regex a backslash character acts as an escape character, so you need two consecutive backslashes. When we take our handwritten regex with two backslashes and encode this into a Java string we need to add an additional layer of backslash escaping to work around the character escaping for Java strings, yielding four consecutive backslashes.

One illustration of this is the fact that the Java string literal {code}"\\"{code} is not considered a valid regex:

{code}
scala> "\\".r
java.util.regex.PatternSyntaxException: Unexpected internal error near index 1
\
 ^
  at java.util.regex.Pattern.error(Pattern.java:1955)
  at java.util.regex.Pattern.compile(Pattern.java:1702)
  at java.util.regex.Pattern.<init>(Pattern.java:1351)
  at java.util.regex.Pattern.compile(Pattern.java:1028)
  at scala.util.matching.Regex.<init>(Regex.scala:191)
  at scala.collection.immutable.StringLike$class.r(StringLike.scala:284)
  at scala.collection.immutable.StringOps.r(StringOps.scala:29)
  at scala.collection.immutable.StringLike$class.r(StringLike.scala:273)
  at scala.collection.immutable.StringOps.r(StringOps.scala:29)
  ... 28 elided
{code}

The second example returns {{true}} on MySQL.

On MySQL, running {code}select '\\' rlike '\\'{code} will fail with a syntax error because this will be interpreted as a trailing escape character rather than as a backslash literal, while {code}select '\\' rlike '\\\\'{code} will return true.

> SQL LIKE/RLIKE do not handle backslashes correctly
> --------------------------------------------------
>
>                 Key: SPARK-17647
>                 URL: https://issues.apache.org/jira/browse/SPARK-17647
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Xiangrui Meng
>              Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '\\\\' like '%\\%';
> select '\\\\' rlike '.*\\\\\\\\.*';
> {code}
> The first returned false and the second returned true. Both are wrong.
> cc: [~yhuai] [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org