You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:15:12 UTC

[jira] [Resolved] (SPARK-22183) Inconsistency in LIKE escaping between literal values and column-based ones

     [ https://issues.apache.org/jira/browse/SPARK-22183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-22183.
----------------------------------
    Resolution: Incomplete

> Inconsistency in LIKE escaping between literal values and column-based ones
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-22183
>                 URL: https://issues.apache.org/jira/browse/SPARK-22183
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Adrien Lavoillotte
>            Priority: Minor
>              Labels: bulk-closed
>
> I'm trying to implement auto-escaping for {{LIKE}} expressions, in order to have filters & join conditions like:
> * Column A's value contains column B's
> * Column A's value contains some literal string
> So I need to escape {{LIKE}}-significant characters {{%}} and {{_}}. Since SparkSQL does not support {{LIKE expr ESCAPE char}}, I need to escape using \, and presumably also \ itself (twice in the case of literals, since '\​\' represents a single \​).
> But it seems that in a {{LIKE}} expression literal does not have quite the same escaping as other literal strings or non-literals {{LIKE}} expressions, seemingly depending on whether the left-hand side and/or right-hand side are literals or columns.
> Note: I'm using triple-quotes below to avoid scala-level \ escaping. And in the body of this description, I'm purposedly using zero-width spaces to avoid Jira transforming my \​.
> On Spark 2.2.0:
> {code}
> // both LHS & RHS literals
> scala> spark.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show()
> +---+---------+
> |  \|\ LIKE \\|
> +---+---------+
> |  \|     true|
> +---+---------+
> scala> spark.sql("""SELECT '\\', '\\' LIKE '\\'""").show()
> org.apache.spark.sql.AnalysisException: the pattern '\' is invalid, it is not allowed to end with the escape character;
>   at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
>   at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:53)
>   at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105)
>   at org.apache.spark.sql.catalyst.expressions.StringRegexExpression.compile(regexpExpressions.scala:50)
>   at org.apache.spark.sql.catalyst.expressions.StringRegexExpression.pattern(regexpExpressions.scala:53)
>   at org.apache.spark.sql.catalyst.expressions.StringRegexExpression.nullSafeEval(regexpExpressions.scala:56)
>   at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:419)
>   ...
> scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show()
> +---+-------------+
> |a\b|a\b LIKE a\\b|
> +---+-------------+
> |a\b|         true|
> +---+-------------+
> scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show()
> org.apache.spark.sql.AnalysisException: the pattern 'a\b' is invalid, the escape character is not allowed to precede 'b';
>   at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
>   at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
>   at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105)
>   ...
> // test data
> spark.sql("""SELECT * FROM test""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |  ok|  ok|
> |  Ok|  ok|
> | a_b| a_b|
> | aab| a_b|
> | c%d| c%d|
> |caad| c%d|
> |e\nf|e\nf|
> | e
> f|e\nf|
> +----+----+
> // both column-based
> // not escaping \
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, '([%_])', '\\\\$1')""").show()
> ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> org.apache.spark.sql.AnalysisException: the pattern 'e\nf' is invalid, the escape character is not allowed to precede 'n';
> 	at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
> 	at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
> 	at org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala)
> 	...
> // escaping \
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, '([%_\\\\])', '\\\\$1')""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |  ok|  ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // LHS column-based, RHS literal
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\nf'""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |e\nf|e\nf|
> +----+----+
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\\\nf'""").show()
> +---+---+
> | _1| _2|
> +---+---+
> +---+---+
> {code}
> So in Spark 2.2:
> * Escaping \ in the RHS is mandatory if RHS is not a literal, otherwise we get an error. So far so good.
> * If LHS is also a literal, same applies.
> * If RHS is a literal and LHS is a column, it seems that the string literal escaping of \ interferes with the {{LIKE}} escaping, since re-escaping \ does not match anymore. I would expect needing \\​​\​\​ between quotes to match a single \ (one escaping for the string literal, one for the {{LIKE}} escaping).
> On Spark 1.6 (and it seems until Spark 2.2.0):
> {code}
> // both LHS & RHS literals
> scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show()
> +---+-----+
> |_c0|  _c1|
> +---+-----+
> |  \|false|
> +---+-----+
> scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\'""").show()
> +---+-----+
> |_c0|  _c1|
> +---+-----+
> |  \|false|
> +---+-----+
> scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show()
> +---+----+
> |_c0| _c1|
> +---+----+
> |a\b|true|
> +---+----+
> scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show()
> +---+----+
> |_c0| _c1|
> +---+----+
> |a\b|true|
> +---+----+
> // Same test data as for 2.2
> // both column-based
> // not escaping \
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, '([%_])', '\\\\$1')""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |  ok|  ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // escaping \
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, '([%_\\\\])', '\\\\$1')""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |  ok|  ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // LHS column-based, RHS literal
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\nf'""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |e\nf|e\nf|
> +----+----+
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\\\nf'""").show()
> +---+---+
> | _1| _2|
> +---+---+
> +---+---+
> {code}
> So in Spark 1.6:
> * Cannot match a single-character literal string of \
> * Matching a string (literal or column) containing \ works whether you escape \ or not, which makes me think there is room for ambiguity (would \​\​n match \n?)
> * If RHS is a literal and LHS is a column, same issue as Spark 2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org