You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/11 23:10:28 UTC

[GitHub] [spark] vitaliili-db opened a new pull request, #37483: [WIP] TO_BINARY implementation with error handling

vitaliili-db opened a new pull request, #37483:
URL: https://github.com/apache/spark/pull/37483

### What changes were proposed in this pull request?

Implementation for `TO_BINARY` with errors on boundary condition for Base64 and Hex

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1217563689

   @cloud-fan done. thank you for catching this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r954146929


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -70,6 +70,11 @@
       "Another instance of this query was just started by a concurrent session."
     ]
   },
+  "CONVERSION_INVALID_INPUT" : {

Review Comment:
   I think it will be helpful to generalize this for other conversions, e.g. as mentioned below - `encode`, `unhex`, `unbase64`, and vice versa.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1554409912

   IMO we need to partially backport this patch to branch-3.3.
   
   The base64 function behavior changed since SPARK-37820 (3.3.0), causes some queries, e.g. `select unbase64("abcs==")`, which can execute succeeded in pre-3.3 but failed in 3.3.x
   
   ```
   spark-sql> select unbase64("abcs==");
   23/05/19 16:21:07 ERROR SparkSQLDriver: Failed in [select unbase64("abcs==")]
   java.lang.IllegalArgumentException: Input byte array has wrong 4-byte ending unit
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r951072891


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##########
@@ -169,6 +169,39 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase {
       summary = getSummary(context))
   }
 
+  def invalidInputInConversionError(
+      to: DataType,
+      s: UTF8String,
+      fmt: UTF8String,
+      hint: String = "",
+      context: SQLQueryContext): SparkIllegalArgumentException = {
+    val alternative = if (hint.nonEmpty) {
+      s" Use '$hint' to tolerate malformed input and return NULL instead."
+    } else ""
+    new SparkIllegalArgumentException(

Review Comment:
   I opened the JIRA for this https://issues.apache.org/jira/browse/SPARK-40174



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r952846091


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -70,6 +70,11 @@
       "Another instance of this query was just started by a concurrent session."
     ]
   },
+  "CONVERSION_INVALID_INPUT" : {

Review Comment:
   I think they are semantically different enough to require separate error class. `TO_BINARY` is string encoding to type conversion, while `CAST` is type to type conversion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r953733474


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala:
##########
@@ -452,6 +452,90 @@ class StringExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
     // scalastyle:on
   }
 
+  test("to_binary") {
+    // Base64 - Positive cases
+    val base64Literal = Literal("base64")
+    Seq(
+      ("", ""),
+      ("  \r\n  \t", ""),
+      ("ag", "ag=="),
+      ("abc", "abc="),
+      ("abcd", "abcd"),
+      ("ab\r\n\t cd", "abcd"),
+      ("ab cd ef", "abcdeQ=="),
+      ("+ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789/",
+        "+ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789/"),
+      ("b25lIHR3byB0aHJlZSBmb3VyIGZpdmUgc2l4IHNldmVuIGVpZ2h0IG5pbmUgdGVuIGVsZXZlbiB0\r\n" +
+        "d2VsdmUgdGhpcnRlZW4gZm91cnRlZW4gZml2dGVlbiBzaXh0ZWVuIHNldmVudGVlbiBlaWdodGVl\r\n" +
+        "biBuaW5ldGVlbiB0d2VudHkgdHdlbnR5LW9uZQ==",
+       "b25lIHR3byB0aHJlZSBmb3VyIGZpdmUgc2l4IHNldmVuIGVpZ2h0IG5pbmUgdGVuIGVsZXZlbiB0\r\n" +
+        "d2VsdmUgdGhpcnRlZW4gZm91cnRlZW4gZml2dGVlbiBzaXh0ZWVuIHNldmVudGVlbiBlaWdodGVl\r\n" +
+        "biBuaW5ldGVlbiB0d2VudHkgdHdlbnR5LW9uZQ==")
+    ).foreach {
+      case (input: String, expected: String) =>
+        checkEvaluation(Base64(ToBinary(Literal(input), base64Literal)), expected)
+    }
+    checkEvaluation(ToBinary(Literal.create(null, StringType), base64Literal), null)
+
+    // Base64 - Negative cases
+    Seq(
+      "a", // Too short
+      "abc==", // Invalid padding
+      "abcdef==a", // String should end with padding
+      "myemail@gmail.com" // Invalid character
+    ).foreach { input =>
+      val errMsg = "[CONVERSION_INVALID_INPUT]"
+      checkExceptionInExpression[Exception](ToBinary(Literal(input), base64Literal), errMsg)
+    }

Review Comment:
   > so I would take switching as separate task, wdyt?
   
   Sure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r954014236


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2487,59 +2538,117 @@ case class Encode(value: Expression, charset: Expression)
   """,
   since = "3.3.0",
   group = "string_funcs")
-// scalastyle:on line.size.limit
-case class ToBinary(
-    expr: Expression,
-    format: Option[Expression],
-    nullOnInvalidFormat: Boolean = false) extends RuntimeReplaceable
-  with ImplicitCastInputTypes {
-
-  override lazy val replacement: Expression = format.map { f =>
-    assert(f.foldable && (f.dataType == StringType || f.dataType == NullType))
-    val value = f.eval()
-    if (value == null) {
-      Literal(null, BinaryType)
-    } else {
-      value.asInstanceOf[UTF8String].toString.toLowerCase(Locale.ROOT) match {
-        case "hex" => Unhex(expr)
-        case "utf-8" => Encode(expr, Literal("UTF-8"))
-        case "base64" => UnBase64(expr)

Review Comment:
   I know this PR is for fixing `to_binary`, but I'm wondering if these 3 functions have wrong behaviors as well?
   
   As a Spark user, I'd expect `to_binary(input, 'hex')` should have the same behavior as `hex(input)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r956457712


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2487,59 +2538,117 @@ case class Encode(value: Expression, charset: Expression)
   """,
   since = "3.3.0",
   group = "string_funcs")
-// scalastyle:on line.size.limit
-case class ToBinary(
-    expr: Expression,
-    format: Option[Expression],
-    nullOnInvalidFormat: Boolean = false) extends RuntimeReplaceable
-  with ImplicitCastInputTypes {
-
-  override lazy val replacement: Expression = format.map { f =>
-    assert(f.foldable && (f.dataType == StringType || f.dataType == NullType))
-    val value = f.eval()
-    if (value == null) {
-      Literal(null, BinaryType)
-    } else {
-      value.asInstanceOf[UTF8String].toString.toLowerCase(Locale.ROOT) match {
-        case "hex" => Unhex(expr)
-        case "utf-8" => Encode(expr, Literal("UTF-8"))
-        case "base64" => UnBase64(expr)

Review Comment:
   Yes, I will make changes to utilize logic from `Unhex`, `Unbase64` and `Encode`. However, I am not sure we can keep it `RuntimeReplaceable` unless format is always constant (which it is not?).



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2487,59 +2538,117 @@ case class Encode(value: Expression, charset: Expression)
   """,
   since = "3.3.0",
   group = "string_funcs")
-// scalastyle:on line.size.limit
-case class ToBinary(
-    expr: Expression,
-    format: Option[Expression],
-    nullOnInvalidFormat: Boolean = false) extends RuntimeReplaceable
-  with ImplicitCastInputTypes {
-
-  override lazy val replacement: Expression = format.map { f =>
-    assert(f.foldable && (f.dataType == StringType || f.dataType == NullType))
-    val value = f.eval()
-    if (value == null) {
-      Literal(null, BinaryType)
-    } else {
-      value.asInstanceOf[UTF8String].toString.toLowerCase(Locale.ROOT) match {
-        case "hex" => Unhex(expr)
-        case "utf-8" => Encode(expr, Literal("UTF-8"))
-        case "base64" => UnBase64(expr)
-        case _ if nullOnInvalidFormat => Literal(null, BinaryType)
-        case other => throw QueryCompilationErrors.invalidStringLiteralParameter(
-          "to_binary", "format", other,
-          Some("The value has to be a case-insensitive string literal of " +
-            "'hex', 'utf-8', or 'base64'."))
-      }
-    }
-  }.getOrElse(Unhex(expr))
+case class ToBinary(left: Expression, right: Expression)
+    extends BinaryExpression
+    with ImplicitCastInputTypes with NullIntolerant with SupportQueryContext {
 
-  def this(expr: Expression) = this(expr, None, false)
+  def this(left: Expression) = this(left, Literal("hex"))
 
-  def this(expr: Expression, format: Expression) = this(expr, Some({
-      // We perform this check in the constructor to make it eager and not go through type coercion.
-      if (format.foldable && (format.dataType == StringType || format.dataType == NullType)) {
-        format
-      } else {
-        throw QueryCompilationErrors.requireLiteralParameter("to_binary", "format", "string")
-      }
-    }),
-    false
-    )
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
 
-  override def prettyName: String = "to_binary"
+  override def dataType: DataType = BinaryType
 
-  override def children: Seq[Expression] = expr +: format.toSeq
+  override def nullable: Boolean = true
 
-  override def inputTypes: Seq[AbstractDataType] = children.map(_ => StringType)
+  override def prettyName: String = "to_binary"
 
   override protected def withNewChildrenInternal(
-      newChildren: IndexedSeq[Expression]): Expression = {
-    if (format.isDefined) {
-      copy(expr = newChildren.head, format = Some(newChildren.last))
-    } else {
-      copy(expr = newChildren.head)
+      newLeft: Expression,
+      newRight: Expression): ToBinary = copy(left = newLeft, right = newRight)
+
+  override def initQueryContext(): Option[SQLQueryContext] = Option(origin.context)
+
+  override protected def nullSafeEval(input: Any, format: Any): Any = {
+    val fmtString = format.asInstanceOf[UTF8String]
+    val srcString = input.asInstanceOf[UTF8String]
+    fmtString.toString.toLowerCase(Locale.ROOT) match {
+      case "hex" =>

Review Comment:
   Why do we assume format string is constant? It could be a conditional expression or column reference, isn't it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan closed pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function
URL: https://github.com/apache/spark/pull/37483


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r950732296


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala:
##########
@@ -452,6 +452,90 @@ class StringExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
     // scalastyle:on
   }
 
+  test("to_binary") {
+    // Base64 - Positive cases
+    val base64Literal = Literal("base64")
+    Seq(
+      ("", ""),
+      ("  \r\n  \t", ""),
+      ("ag", "ag=="),
+      ("abc", "abc="),
+      ("abcd", "abcd"),
+      ("ab\r\n\t cd", "abcd"),
+      ("ab cd ef", "abcdeQ=="),
+      ("+ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789/",
+        "+ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789/"),
+      ("b25lIHR3byB0aHJlZSBmb3VyIGZpdmUgc2l4IHNldmVuIGVpZ2h0IG5pbmUgdGVuIGVsZXZlbiB0\r\n" +
+        "d2VsdmUgdGhpcnRlZW4gZm91cnRlZW4gZml2dGVlbiBzaXh0ZWVuIHNldmVudGVlbiBlaWdodGVl\r\n" +
+        "biBuaW5ldGVlbiB0d2VudHkgdHdlbnR5LW9uZQ==",
+       "b25lIHR3byB0aHJlZSBmb3VyIGZpdmUgc2l4IHNldmVuIGVpZ2h0IG5pbmUgdGVuIGVsZXZlbiB0\r\n" +
+        "d2VsdmUgdGhpcnRlZW4gZm91cnRlZW4gZml2dGVlbiBzaXh0ZWVuIHNldmVudGVlbiBlaWdodGVl\r\n" +
+        "biBuaW5ldGVlbiB0d2VudHkgdHdlbnR5LW9uZQ==")
+    ).foreach {
+      case (input: String, expected: String) =>
+        checkEvaluation(Base64(ToBinary(Literal(input), base64Literal)), expected)
+    }
+    checkEvaluation(ToBinary(Literal.create(null, StringType), base64Literal), null)
+
+    // Base64 - Negative cases
+    Seq(
+      "a", // Too short
+      "abc==", // Invalid padding
+      "abcdef==a", // String should end with padding
+      "myemail@gmail.com" // Invalid character
+    ).foreach { input =>
+      val errMsg = "[CONVERSION_INVALID_INPUT]"
+      checkExceptionInExpression[Exception](ToBinary(Literal(input), base64Literal), errMsg)
+    }
+
+    // UTF-8/UTF8
+    val utf8Literal = Literal("Utf8")
+    val utf_8Literal = Literal("uTf-8")
+    // scalastyle:off
+    Seq(
+      "∮ E⋅da = Q,  n → ∞, ∑ f(i) = ∏ g(i), ∀x∈ℝ: ⌈x⌉ = −⌊−x⌋, α ∧ ¬β = ¬(¬α ∨ β)",
+      "大千世界",
+      ""
+    ).foreach {
+      input =>
+        checkEvaluation(StringDecode(ToBinary(Literal(input), utf8Literal), utf_8Literal), input)
+        checkEvaluation(StringDecode(ToBinary(Literal(input), utf_8Literal), utf_8Literal), input)
+    }
+    // scalastyle:on
+    checkEvaluation(ToBinary(Literal.create(null, StringType), utf8Literal), null)
+    checkEvaluation(ToBinary(Literal.create(null, StringType), utf_8Literal), null)
+
+    // HEX - Positive cases
+    val hexLiteral = Literal("hEx")
+    Seq(
+      ("737472696E67", "737472696E67"),
+      ("", ""),
+      ("F", "0F"),
+      ("ff", "FF"),
+      ("00", "00")
+    ).foreach {
+      case (input, expected) =>
+        checkEvaluation(Hex(ToBinary(Literal(input), hexLiteral)), expected)
+    }
+    checkEvaluation(ToBinary(Literal.create(null, StringType), hexLiteral), null)
+
+    // HEX - Negative cases
+    Seq(
+      "GG",
+      "FF 01"
+    ).foreach { input =>
+      val errMsg = "[CONVERSION_INVALID_INPUT]"
+      checkExceptionInExpression[Exception](ToBinary(Literal(input), hexLiteral), errMsg)
+    }
+
+    // Invalid format
+    val errMsg = "[CONVERSION_INVALID_FORMAT]"

Review Comment:
   Could you transform this to a test in `QueryExecutionErrorsSuite` and use `checkError()` to check the error class and other fields.



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala:
##########
@@ -452,6 +452,90 @@ class StringExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
     // scalastyle:on
   }
 
+  test("to_binary") {
+    // Base64 - Positive cases
+    val base64Literal = Literal("base64")
+    Seq(
+      ("", ""),
+      ("  \r\n  \t", ""),
+      ("ag", "ag=="),
+      ("abc", "abc="),
+      ("abcd", "abcd"),
+      ("ab\r\n\t cd", "abcd"),
+      ("ab cd ef", "abcdeQ=="),
+      ("+ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789/",
+        "+ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789/"),
+      ("b25lIHR3byB0aHJlZSBmb3VyIGZpdmUgc2l4IHNldmVuIGVpZ2h0IG5pbmUgdGVuIGVsZXZlbiB0\r\n" +
+        "d2VsdmUgdGhpcnRlZW4gZm91cnRlZW4gZml2dGVlbiBzaXh0ZWVuIHNldmVudGVlbiBlaWdodGVl\r\n" +
+        "biBuaW5ldGVlbiB0d2VudHkgdHdlbnR5LW9uZQ==",
+       "b25lIHR3byB0aHJlZSBmb3VyIGZpdmUgc2l4IHNldmVuIGVpZ2h0IG5pbmUgdGVuIGVsZXZlbiB0\r\n" +
+        "d2VsdmUgdGhpcnRlZW4gZm91cnRlZW4gZml2dGVlbiBzaXh0ZWVuIHNldmVudGVlbiBlaWdodGVl\r\n" +
+        "biBuaW5ldGVlbiB0d2VudHkgdHdlbnR5LW9uZQ==")
+    ).foreach {
+      case (input: String, expected: String) =>
+        checkEvaluation(Base64(ToBinary(Literal(input), base64Literal)), expected)
+    }
+    checkEvaluation(ToBinary(Literal.create(null, StringType), base64Literal), null)
+
+    // Base64 - Negative cases
+    Seq(
+      "a", // Too short
+      "abc==", // Invalid padding
+      "abcdef==a", // String should end with padding
+      "myemail@gmail.com" // Invalid character
+    ).foreach { input =>
+      val errMsg = "[CONVERSION_INVALID_INPUT]"
+      checkExceptionInExpression[Exception](ToBinary(Literal(input), base64Literal), errMsg)
+    }

Review Comment:
   Please, use `checkError` to check the error classes.



##########
core/src/main/resources/error/error-classes.json:
##########
@@ -70,6 +70,16 @@
       "Another instance of this query was just started by a concurrent session."
     ]
   },
+  "CONVERSION_INVALID_FORMAT" : {

Review Comment:
   Can't we use the error class `INVALID_PARAMETER_VALUE` instead of it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r949757048


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##########
@@ -169,6 +169,39 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase {
       summary = getSummary(context))
   }
 
+  def invalidInputInConversionError(
+      to: DataType,
+      s: UTF8String,
+      fmt: UTF8String,
+      hint: String = "",
+      context: SQLQueryContext): SparkIllegalArgumentException = {
+    val alternative = if (hint.nonEmpty) {
+      s" Use '$hint' to tolerate malformed input and return NULL instead."
+    } else ""
+    new SparkIllegalArgumentException(

Review Comment:
   Note: with error class, we don't need different java exception classes and can always use `SparkException`. We keep these special exception classes only for backward compatibility.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1233655860

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r954455653


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2487,59 +2538,117 @@ case class Encode(value: Expression, charset: Expression)
   """,
   since = "3.3.0",
   group = "string_funcs")
-// scalastyle:on line.size.limit
-case class ToBinary(
-    expr: Expression,
-    format: Option[Expression],
-    nullOnInvalidFormat: Boolean = false) extends RuntimeReplaceable
-  with ImplicitCastInputTypes {
-
-  override lazy val replacement: Expression = format.map { f =>
-    assert(f.foldable && (f.dataType == StringType || f.dataType == NullType))
-    val value = f.eval()
-    if (value == null) {
-      Literal(null, BinaryType)
-    } else {
-      value.asInstanceOf[UTF8String].toString.toLowerCase(Locale.ROOT) match {
-        case "hex" => Unhex(expr)
-        case "utf-8" => Encode(expr, Literal("UTF-8"))
-        case "base64" => UnBase64(expr)
-        case _ if nullOnInvalidFormat => Literal(null, BinaryType)
-        case other => throw QueryCompilationErrors.invalidStringLiteralParameter(
-          "to_binary", "format", other,
-          Some("The value has to be a case-insensitive string literal of " +
-            "'hex', 'utf-8', or 'base64'."))
-      }
-    }
-  }.getOrElse(Unhex(expr))
+case class ToBinary(left: Expression, right: Expression)
+    extends BinaryExpression
+    with ImplicitCastInputTypes with NullIntolerant with SupportQueryContext {
 
-  def this(expr: Expression) = this(expr, None, false)
+  def this(left: Expression) = this(left, Literal("hex"))
 
-  def this(expr: Expression, format: Expression) = this(expr, Some({
-      // We perform this check in the constructor to make it eager and not go through type coercion.
-      if (format.foldable && (format.dataType == StringType || format.dataType == NullType)) {
-        format
-      } else {
-        throw QueryCompilationErrors.requireLiteralParameter("to_binary", "format", "string")
-      }
-    }),
-    false
-    )
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
 
-  override def prettyName: String = "to_binary"
+  override def dataType: DataType = BinaryType
 
-  override def children: Seq[Expression] = expr +: format.toSeq
+  override def nullable: Boolean = true
 
-  override def inputTypes: Seq[AbstractDataType] = children.map(_ => StringType)
+  override def prettyName: String = "to_binary"
 
   override protected def withNewChildrenInternal(
-      newChildren: IndexedSeq[Expression]): Expression = {
-    if (format.isDefined) {
-      copy(expr = newChildren.head, format = Some(newChildren.last))
-    } else {
-      copy(expr = newChildren.head)
+      newLeft: Expression,
+      newRight: Expression): ToBinary = copy(left = newLeft, right = newRight)
+
+  override def initQueryContext(): Option[SQLQueryContext] = Option(origin.context)
+
+  override protected def nullSafeEval(input: Any, format: Any): Any = {
+    val fmtString = format.asInstanceOf[UTF8String]
+    val srcString = input.asInstanceOf[UTF8String]
+    fmtString.toString.toLowerCase(Locale.ROOT) match {
+      case "hex" =>

Review Comment:
   this is worse than before. We assume the format string is a constant, so we decide the "to binary" behavior at the compile time. Now we are checking the format string for each input record at runtime.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1225885017

   @MaxGekk resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r954004766


##########
core/src/main/scala/org/apache/spark/SparkException.scala:
##########
@@ -287,14 +287,18 @@ private[spark] class SparkNoSuchMethodException(
 private[spark] class SparkIllegalArgumentException(
     errorClass: String,
     errorSubClass: Option[String] = None,
-    messageParameters: Array[String])
+    messageParameters: Array[String],
+    context: Array[QueryContext] = Array.empty,

Review Comment:
   instead of adding query context to `SparkIllegalArgumentException`, can we updating `SparkException` and use it instead?



##########
core/src/main/scala/org/apache/spark/SparkException.scala:
##########
@@ -287,14 +287,18 @@ private[spark] class SparkNoSuchMethodException(
 private[spark] class SparkIllegalArgumentException(
     errorClass: String,
     errorSubClass: Option[String] = None,
-    messageParameters: Array[String])
+    messageParameters: Array[String],
+    context: Array[QueryContext] = Array.empty,

Review Comment:
   instead of adding query context to `SparkIllegalArgumentException`, can we update `SparkException` and use it instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r952361982


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -70,6 +70,11 @@
       "Another instance of this query was just started by a concurrent session."
     ]
   },
+  "CONVERSION_INVALID_INPUT" : {

Review Comment:
   This is similar to `CAST_INVALID_INPUT`, it seems. Can you re-use the existing one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r956973920


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2488,59 +2539,117 @@ case class Encode(value: Expression, charset: Expression)
   """,
   since = "3.3.0",
   group = "string_funcs")
-// scalastyle:on line.size.limit
-case class ToBinary(
-    expr: Expression,
-    format: Option[Expression],
-    nullOnInvalidFormat: Boolean = false) extends RuntimeReplaceable
-  with ImplicitCastInputTypes {
-
-  override lazy val replacement: Expression = format.map { f =>
-    assert(f.foldable && (f.dataType == StringType || f.dataType == NullType))

Review Comment:
   @vitaliili-db check this out. We only support foldable format parameter before. If we want to relax it, please do it in a separate PR and justify it clearly (e.g. user requests for supporting attr/expr as the format parameter).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r959122998


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala:
##########
@@ -1120,28 +1120,58 @@ case class Hex(child: Expression)
   """,
   since = "1.5.0",
   group = "math_funcs")
-case class Unhex(child: Expression)
+case class Unhex(child: Expression, failOnError: Boolean = false)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
+  def this(expr: Expression) = this(expr, false)
+
   override def inputTypes: Seq[AbstractDataType] = Seq(StringType)
 
   override def nullable: Boolean = true
   override def dataType: DataType = BinaryType
 
-  protected override def nullSafeEval(num: Any): Any =
-    Hex.unhex(num.asInstanceOf[UTF8String].getBytes)
+  protected override def nullSafeEval(num: Any): Any = {
+    val result = Hex.unhex(num.asInstanceOf[UTF8String].getBytes)
+    if (failOnError && result == null) {
+      // The failOnError is set only from `ToBinary` function - hence we might safely set `hint`
+      // parameter to `try_to_binary`.

Review Comment:
   why not set the hint parameter to `to_binary`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r959830811


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala:
##########
@@ -1120,28 +1120,58 @@ case class Hex(child: Expression)
   """,
   since = "1.5.0",
   group = "math_funcs")
-case class Unhex(child: Expression)
+case class Unhex(child: Expression, failOnError: Boolean = false)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
+  def this(expr: Expression) = this(expr, false)
+
   override def inputTypes: Seq[AbstractDataType] = Seq(StringType)
 
   override def nullable: Boolean = true
   override def dataType: DataType = BinaryType
 
-  protected override def nullSafeEval(num: Any): Any =
-    Hex.unhex(num.asInstanceOf[UTF8String].getBytes)
+  protected override def nullSafeEval(num: Any): Any = {
+    val result = Hex.unhex(num.asInstanceOf[UTF8String].getBytes)
+    if (failOnError && result == null) {
+      // The failOnError is set only from `ToBinary` function - hence we might safely set `hint`
+      // parameter to `try_to_binary`.

Review Comment:
   hint is error free alternative function. So instead of `to_binary` => `try_to_binary`. Maybe there is a cleaner way to pass hint function for runtimereplaceable, e.g. tags?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r949732312


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2469,14 +2469,65 @@ case class Encode(value: Expression, charset: Expression)
     newLeft: Expression, newRight: Expression): Encode = copy(value = newLeft, charset = newRight)
 }
 
+object ToBinary {
+  def isValidBase64(srcString: UTF8String) : Boolean = {
+    // We use RFC4648. The valid base64 string should contain zero or more groups of 4 symbols plus
+    // last group consisting of 2-4 valid symbols and optional padding.
+    // Last group should contain at least 2 valid symbols and up to 2 padding characters `=`.
+    // Valid symbols include - (A-Za-z0-9+/). Each group might contain arbitrary number of
+    // whitespaces which are ignored.
+    // If padding is present - last group should include exactly 4 symbols.
+    // Examples:
+    //    "abcd"      - Valid, single group of 4 valid symbols
+    //    "abc d"     - Valid, single group of 4 valid symbols, whitespace is skipped
+    //    "abc?"      - Invalid, group contains invalid symbol `?`
+    //    "abcdA"     - Invalid, last group should contain at least 2 valid symbols
+    //    "abcdAE"    - Valid, a group of 4 valid symbols and a group of 2 valid symbols
+    //    "abcdAE=="  - Valid, last group includes 2 padding symbols and total number of symbols
+    //                  in a group is 4.
+    //    "abcdAE="   - Invalid, last group include padding symbols, therefore it should have
+    //                  exactly 4 symbols but contains only 3.
+    //    "ab==tm+1"  - Invalid, nothing should be after padding.
+    var position = 0
+    var padSize = 0
+    for (c: Char <- srcString.toString) {
+      c match {
+        case a
+            if (a >= '0' && a <= '9')
+            || (a >= 'A' && a <= 'Z')
+            || (a >= 'a' && a <= 'z')
+            || a == '/' || a == '+' =>
+          if (padSize != 0) return false // Padding symbols should conclude the string.
+          position += 1
+        case pad if pad == '=' =>

Review Comment:
   ```suggestion
           case '=' =>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1221215906

   @Yikun Hi Yikun, for some reason `Base image build` failed several times until I added `write` permission to GITHUB_TOKEN, is it expected?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Kimahriman commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by "Kimahriman (via GitHub)" <gi...@apache.org>.

Kimahriman commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1563548495

   Minor fix for the codegen from this: https://github.com/apache/spark/pull/41317


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r948643760


##########
docs/sql-migration-guide.md:
##########
@@ -28,6 +28,10 @@ license: |
   - Since Spark 3.4, v1 database, table, permanent view and function identifier will include 'spark_catalog' as the catalog name if database is defined, e.g. a table identifier will be: `spark_catalog.default.t`. To restore the legacy behavior, set `spark.sql.legacy.v1IdentifierNoCatalog` to `true`.
   - Since Spark 3.4, `INSERT INTO` commands will now support user-specified column lists comprising fewer columns than present in the target table (for example, `INSERT INTO t (a, b) VALUES (1, 2)` where table `t` has three columns). In this case, Spark will insert `NULL` into the remaining columns in the row, or the explicit `DEFAULT` value if assigned to the column. To revert to the previous behavior, please set `spark.sql.defaultColumn.addMissingValuesForInsertsWithExplicitColumns` to false.
   - Since Spark 3.4, when ANSI SQL mode(configuration `spark.sql.ansi.enabled`) is on, Spark SQL always returns NULL result on getting a map value with a non-existing key. In Spark 3.3 or earlier, there will be an error.
+  - Since Spark 3.4, the `to_binary` function throws error for a malformed `str` input. Use `try_to_binary` to tolerate malformed input and return NULL instead.
+    - Valid Base64 string should include symbols from in base64 alphabet (A-Za-z0-9+/), optional padding (`=`), and optional whitespaces. Whitespaces are skipped in conversion except when they are preceded by padding symbol(s). If padding is present it should conclude the string and follow rules described in RFC 4648 § 4.
+    - Valid hexadecimal strings should include only allowed symbols (0-9A-Fa-f).
+    - Valid values for `fmt` are case-insensitive `hex`, `base64`, `utf-8`, `utf8`.

Review Comment:
   this is not a breaking change I think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r947512752


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2469,14 +2469,76 @@ case class Encode(value: Expression, charset: Expression)
     newLeft: Expression, newRight: Expression): Encode = copy(value = newLeft, charset = newRight)
 }
 
+object ToBinary {
+  def isValidBase64(srcString: UTF8String) : Boolean = {
+    // We use RFC4648. The valid base64 string should contain zero or more groups of 4 symbols plus
+    // last group consisting of 2-4 valid symbols and optional padding.
+    // Last group should contain at least 2 valid symbols and up to 2 padding characters `=`.
+    // Valid symbols include - (A-Za-z0-9+/). Each group might contain arbitrary number of
+    // whitespaces which are ignored.
+    // If padding is present - last group should include exactly 4 symbols.
+    // Examples:
+    //    "abcd"      - Valid, single group of 4 valid symbols
+    //    "abc d"     - Valid, single group of 4 valid symbols, whitespace is skipped
+    //    "abc?"      - Invalid, group contains invalid symbol `?`
+    //    "abcdA"     - Invalid, last group should contain at least 2 valid symbols
+    //    "abcdAE"    - Valid, a group of 4 valid symbols and a group of 2 valid symbols
+    //    "abcdAE=="  - Valid, last group includes 2 padding symbols and total number of symbols
+    //                  in a group is 4.
+    //    "abcdAE="   - Invalid, last group include padding symbols, therefore it should have
+    //                  exactly 4 symbols but contains only 3.
+    //    "ab==tm+1"  - Invalid, nothing should be after padding.
+    var position = 0
+    var padSize = 0
+    var invalid = false // The string is detected to be invalid.
+    val validation = srcString.toString.map {
+      case c
+        if !invalid &&
+          ((c >= '0' && c <= '9')
+            || (c >= 'A' && c <= 'Z')
+            || (c >= 'a' && c <= 'z')
+            || c == '/' || c == '+') =>
+        position += 1
+        if (padSize != 0) { // Valid character after padding.
+          invalid = true
+          false
+        } else { // A group should have at least 2 valid symbols.
+          position % 4 != 1

Review Comment:
   should we update `invalid` in this branch if the `srcString` is invalid?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r950463225


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2469,14 +2469,65 @@ case class Encode(value: Expression, charset: Expression)
     newLeft: Expression, newRight: Expression): Encode = copy(value = newLeft, charset = newRight)
 }
 
+object ToBinary {
+  def isValidBase64(srcString: UTF8String) : Boolean = {
+    // We use RFC4648. The valid base64 string should contain zero or more groups of 4 symbols plus
+    // last group consisting of 2-4 valid symbols and optional padding.
+    // Last group should contain at least 2 valid symbols and up to 2 padding characters `=`.
+    // Valid symbols include - (A-Za-z0-9+/). Each group might contain arbitrary number of
+    // whitespaces which are ignored.
+    // If padding is present - last group should include exactly 4 symbols.
+    // Examples:
+    //    "abcd"      - Valid, single group of 4 valid symbols
+    //    "abc d"     - Valid, single group of 4 valid symbols, whitespace is skipped
+    //    "abc?"      - Invalid, group contains invalid symbol `?`
+    //    "abcdA"     - Invalid, last group should contain at least 2 valid symbols
+    //    "abcdAE"    - Valid, a group of 4 valid symbols and a group of 2 valid symbols
+    //    "abcdAE=="  - Valid, last group includes 2 padding symbols and total number of symbols
+    //                  in a group is 4.
+    //    "abcdAE="   - Invalid, last group include padding symbols, therefore it should have
+    //                  exactly 4 symbols but contains only 3.
+    //    "ab==tm+1"  - Invalid, nothing should be after padding.
+    var position = 0
+    var padSize = 0
+    for (c: Char <- srcString.toString) {
+      c match {
+        case a
+            if (a >= '0' && a <= '9')
+            || (a >= 'A' && a <= 'Z')
+            || (a >= 'a' && a <= 'z')
+            || a == '/' || a == '+' =>
+          if (padSize != 0) return false // Padding symbols should conclude the string.
+          position += 1
+        case pad if pad == '=' =>

Review Comment:
   done, thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r959836768


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2488,11 +2568,10 @@ case class Encode(value: Expression, charset: Expression)
   """,
   since = "3.3.0",
   group = "string_funcs")
-// scalastyle:on line.size.limit
 case class ToBinary(
-    expr: Expression,
-    format: Option[Expression],
-    nullOnInvalidFormat: Boolean = false) extends RuntimeReplaceable

Review Comment:
   done.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2534,7 +2615,7 @@ case class ToBinary(
   override def inputTypes: Seq[AbstractDataType] = children.map(_ => StringType)
 
   override protected def withNewChildrenInternal(
-      newChildren: IndexedSeq[Expression]): Expression = {

Review Comment:
   done.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##########
@@ -800,6 +800,19 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase {
       s"The '$argName' parameter of function '$funcName' needs to be a $requiredType literal.")
   }
 
+  def invalidFormatInConversion(
+    argName: String,

Review Comment:
   done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1559306005

   @cloud-fan, sorry my previous comment is not correct.
   
   After SPARK-37820, `select unbase64("abcs==")`(malformed input) always throws an exception, this PR does not help in that case, it only improves the error message for `to_binary()`.
   
   So, `unbase64()`'s behavior for malformed input changes silently after SPARK-37820:
   - before: return a malformed decode value (maybe a best-effort parsed result, not sure)
   - after: throw an exception
   
   And there is no way to restore the previous behavior. To tolerate the malformed input, the user should migrate "unbase64(<input>)" to `try_to_binary(<input>, 'base64')` to get NULL instead of interrupting by exception. 
   
   This is kind of an awkward situation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1559382923

   If the new behavior is already released in 3.3, I don't think we can change it back as it will be another breaking change. We can update the migration guide, or provide some solutions for easy migration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r954014236


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2487,59 +2538,117 @@ case class Encode(value: Expression, charset: Expression)
   """,
   since = "3.3.0",
   group = "string_funcs")
-// scalastyle:on line.size.limit
-case class ToBinary(
-    expr: Expression,
-    format: Option[Expression],
-    nullOnInvalidFormat: Boolean = false) extends RuntimeReplaceable
-  with ImplicitCastInputTypes {
-
-  override lazy val replacement: Expression = format.map { f =>
-    assert(f.foldable && (f.dataType == StringType || f.dataType == NullType))
-    val value = f.eval()
-    if (value == null) {
-      Literal(null, BinaryType)
-    } else {
-      value.asInstanceOf[UTF8String].toString.toLowerCase(Locale.ROOT) match {
-        case "hex" => Unhex(expr)
-        case "utf-8" => Encode(expr, Literal("UTF-8"))
-        case "base64" => UnBase64(expr)

Review Comment:
   I know this PR is for fixing `to_binary`, but I'm wondering if these 3 functions have wrong behaviors as well?
   
   As a Spark user, I'd expect `to_binary(input, 'hex')` should have the same behavior as `unhex(input)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r954150734


##########
core/src/main/scala/org/apache/spark/SparkException.scala:
##########
@@ -287,14 +287,18 @@ private[spark] class SparkNoSuchMethodException(
 private[spark] class SparkIllegalArgumentException(
     errorClass: String,
     errorSubClass: Option[String] = None,
-    messageParameters: Array[String])
+    messageParameters: Array[String],
+    context: Array[QueryContext] = Array.empty,

Review Comment:
   @MaxGekk has a separate ticket to fix this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r953732921


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -70,6 +70,11 @@
       "Another instance of this query was just started by a concurrent session."
     ]
   },
+  "CONVERSION_INVALID_INPUT" : {

Review Comment:
   ok, I see.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r959122627


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala:
##########
@@ -1120,28 +1120,58 @@ case class Hex(child: Expression)
   """,
   since = "1.5.0",
   group = "math_funcs")
-case class Unhex(child: Expression)
+case class Unhex(child: Expression, failOnError: Boolean = false)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
+  def this(expr: Expression) = this(expr, false)
+
   override def inputTypes: Seq[AbstractDataType] = Seq(StringType)
 
   override def nullable: Boolean = true
   override def dataType: DataType = BinaryType
 
-  protected override def nullSafeEval(num: Any): Any =
-    Hex.unhex(num.asInstanceOf[UTF8String].getBytes)
+  protected override def nullSafeEval(num: Any): Any = {
+    val result = Hex.unhex(num.asInstanceOf[UTF8String].getBytes)
+    if (failOnError && result == null) {
+      // The failOnError is set only from `ToBinary` function - hence we might safely set `hint`
+      // parameter to `try_to_binary`.
+      throw QueryExecutionErrors.invalidInputInConversionError(
+        BinaryType,
+        num.asInstanceOf[UTF8String],
+        UTF8String.fromString("HEX"),
+        "try_to_binary")
+    }
+    result
+  }
 
   override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-    nullSafeCodeGen(ctx, ev, (c) => {
+    nullSafeCodeGen(ctx, ev, c => {
       val hex = Hex.getClass.getName.stripSuffix("$")
+      val maybeFailOnErrorCode = if (failOnError) {
+        val format = UTF8String.fromString("BASE64");
+        val binaryType = ctx.addReferenceObj("to", BinaryType, BinaryType.getClass.getName)
+        s"""
+           |if (${ev.value} == null) {
+           |  throw QueryExecutionErrors.invalidInputInConversionError(
+           |    $binaryType,
+           |    $c,
+           |    $format,
+           |    "try_to_binary");

Review Comment:
   how can we know it's `try_to_binary` or `to_binary`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r949609910


##########
docs/sql-migration-guide.md:
##########
@@ -28,6 +28,10 @@ license: |
   - Since Spark 3.4, v1 database, table, permanent view and function identifier will include 'spark_catalog' as the catalog name if database is defined, e.g. a table identifier will be: `spark_catalog.default.t`. To restore the legacy behavior, set `spark.sql.legacy.v1IdentifierNoCatalog` to `true`.
   - Since Spark 3.4, `INSERT INTO` commands will now support user-specified column lists comprising fewer columns than present in the target table (for example, `INSERT INTO t (a, b) VALUES (1, 2)` where table `t` has three columns). In this case, Spark will insert `NULL` into the remaining columns in the row, or the explicit `DEFAULT` value if assigned to the column. To revert to the previous behavior, please set `spark.sql.defaultColumn.addMissingValuesForInsertsWithExplicitColumns` to false.
   - Since Spark 3.4, when ANSI SQL mode(configuration `spark.sql.ansi.enabled`) is on, Spark SQL always returns NULL result on getting a map value with a non-existing key. In Spark 3.3 or earlier, there will be an error.
+  - Since Spark 3.4, the `to_binary` function throws error for a malformed `str` input. Use `try_to_binary` to tolerate malformed input and return NULL instead.
+    - Valid Base64 string should include symbols from in base64 alphabet (A-Za-z0-9+/), optional padding (`=`), and optional whitespaces. Whitespaces are skipped in conversion except when they are preceded by padding symbol(s). If padding is present it should conclude the string and follow rules described in RFC 4648 § 4.
+    - Valid hexadecimal strings should include only allowed symbols (0-9A-Fa-f).
+    - Valid values for `fmt` are case-insensitive `hex`, `base64`, `utf-8`, `utf8`.

Review Comment:
   It might break users who rely on `NULL` result for malformed input? Similar to ansi mode.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r949610332


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2469,14 +2469,76 @@ case class Encode(value: Expression, charset: Expression)
     newLeft: Expression, newRight: Expression): Encode = copy(value = newLeft, charset = newRight)
 }
 
+object ToBinary {
+  def isValidBase64(srcString: UTF8String) : Boolean = {
+    // We use RFC4648. The valid base64 string should contain zero or more groups of 4 symbols plus
+    // last group consisting of 2-4 valid symbols and optional padding.
+    // Last group should contain at least 2 valid symbols and up to 2 padding characters `=`.
+    // Valid symbols include - (A-Za-z0-9+/). Each group might contain arbitrary number of
+    // whitespaces which are ignored.
+    // If padding is present - last group should include exactly 4 symbols.
+    // Examples:
+    //    "abcd"      - Valid, single group of 4 valid symbols
+    //    "abc d"     - Valid, single group of 4 valid symbols, whitespace is skipped
+    //    "abc?"      - Invalid, group contains invalid symbol `?`
+    //    "abcdA"     - Invalid, last group should contain at least 2 valid symbols
+    //    "abcdAE"    - Valid, a group of 4 valid symbols and a group of 2 valid symbols
+    //    "abcdAE=="  - Valid, last group includes 2 padding symbols and total number of symbols
+    //                  in a group is 4.
+    //    "abcdAE="   - Invalid, last group include padding symbols, therefore it should have
+    //                  exactly 4 symbols but contains only 3.
+    //    "ab==tm+1"  - Invalid, nothing should be after padding.
+    var position = 0
+    var padSize = 0
+    var invalid = false // The string is detected to be invalid.
+    val validation = srcString.toString.map {
+      case c
+        if !invalid &&
+          ((c >= '0' && c <= '9')
+            || (c >= 'A' && c <= 'Z')
+            || (c >= 'a' && c <= 'z')
+            || c == '/' || c == '+') =>
+        position += 1
+        if (padSize != 0) { // Valid character after padding.
+          invalid = true
+          false
+        } else { // A group should have at least 2 valid symbols.
+          position % 4 != 1

Review Comment:
   yesyes, tried scala way, rewrote to java :).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37483: [WIP] TO_BINARY implementation with error handling

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1214180307

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1217534540

   can we update `migratiob-guide.md` and mention the breaking changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r949753203


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2469,14 +2469,65 @@ case class Encode(value: Expression, charset: Expression)
     newLeft: Expression, newRight: Expression): Encode = copy(value = newLeft, charset = newRight)
 }
 
+object ToBinary {
+  def isValidBase64(srcString: UTF8String) : Boolean = {
+    // We use RFC4648. The valid base64 string should contain zero or more groups of 4 symbols plus
+    // last group consisting of 2-4 valid symbols and optional padding.
+    // Last group should contain at least 2 valid symbols and up to 2 padding characters `=`.
+    // Valid symbols include - (A-Za-z0-9+/). Each group might contain arbitrary number of
+    // whitespaces which are ignored.
+    // If padding is present - last group should include exactly 4 symbols.
+    // Examples:
+    //    "abcd"      - Valid, single group of 4 valid symbols
+    //    "abc d"     - Valid, single group of 4 valid symbols, whitespace is skipped
+    //    "abc?"      - Invalid, group contains invalid symbol `?`
+    //    "abcdA"     - Invalid, last group should contain at least 2 valid symbols
+    //    "abcdAE"    - Valid, a group of 4 valid symbols and a group of 2 valid symbols
+    //    "abcdAE=="  - Valid, last group includes 2 padding symbols and total number of symbols
+    //                  in a group is 4.
+    //    "abcdAE="   - Invalid, last group include padding symbols, therefore it should have
+    //                  exactly 4 symbols but contains only 3.
+    //    "ab==tm+1"  - Invalid, nothing should be after padding.
+    var position = 0
+    var padSize = 0
+    for (c: Char <- srcString.toString) {
+      c match {
+        case a
+            if (a >= '0' && a <= '9')
+            || (a >= 'A' && a <= 'Z')
+            || (a >= 'a' && a <= 'z')
+            || a == '/' || a == '+' =>
+          if (padSize != 0) return false // Padding symbols should conclude the string.
+          position += 1
+        case pad if pad == '=' =>
+          padSize += 1
+          // Last group preceding padding should have 2 or more symbols. Padding size should be 1 or
+          // less.
+          if (padSize > 2 || position % 4 < 2) {
+            return false
+          }
+        case ws if Character.isWhitespace(ws) =>
+          if (padSize != 0) { // Padding symbols should conclude the string.
+            return false
+          }
+        case _ => return false
+      }
+    }
+    if (padSize > 0) { // When padding is present last group should have exactly 4 symbols.
+      (position + padSize) % 4 == 0
+    } else { // When padding is absent last group should include 2 or more symbols.
+      position % 4 != 1
+    }

Review Comment:
   java style is much easier to read :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1554421469

   cc @yaooqinn @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1559005111

   @pan3793 can you help to open the 3.3 backport PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r951966342


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -70,6 +70,16 @@
       "Another instance of this query was just started by a concurrent session."
     ]
   },
+  "CONVERSION_INVALID_FORMAT" : {

Review Comment:
   Good point, fixed.



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala:
##########
@@ -452,6 +452,90 @@ class StringExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
     // scalastyle:on
   }
 
+  test("to_binary") {
+    // Base64 - Positive cases
+    val base64Literal = Literal("base64")
+    Seq(
+      ("", ""),
+      ("  \r\n  \t", ""),
+      ("ag", "ag=="),
+      ("abc", "abc="),
+      ("abcd", "abcd"),
+      ("ab\r\n\t cd", "abcd"),
+      ("ab cd ef", "abcdeQ=="),
+      ("+ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789/",
+        "+ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789/"),
+      ("b25lIHR3byB0aHJlZSBmb3VyIGZpdmUgc2l4IHNldmVuIGVpZ2h0IG5pbmUgdGVuIGVsZXZlbiB0\r\n" +
+        "d2VsdmUgdGhpcnRlZW4gZm91cnRlZW4gZml2dGVlbiBzaXh0ZWVuIHNldmVudGVlbiBlaWdodGVl\r\n" +
+        "biBuaW5ldGVlbiB0d2VudHkgdHdlbnR5LW9uZQ==",
+       "b25lIHR3byB0aHJlZSBmb3VyIGZpdmUgc2l4IHNldmVuIGVpZ2h0IG5pbmUgdGVuIGVsZXZlbiB0\r\n" +
+        "d2VsdmUgdGhpcnRlZW4gZm91cnRlZW4gZml2dGVlbiBzaXh0ZWVuIHNldmVudGVlbiBlaWdodGVl\r\n" +
+        "biBuaW5ldGVlbiB0d2VudHkgdHdlbnR5LW9uZQ==")
+    ).foreach {
+      case (input: String, expected: String) =>
+        checkEvaluation(Base64(ToBinary(Literal(input), base64Literal)), expected)
+    }
+    checkEvaluation(ToBinary(Literal.create(null, StringType), base64Literal), null)
+
+    // Base64 - Negative cases
+    Seq(
+      "a", // Too short
+      "abc==", // Invalid padding
+      "abcdef==a", // String should end with padding
+      "myemail@gmail.com" // Invalid character
+    ).foreach { input =>
+      val errMsg = "[CONVERSION_INVALID_INPUT]"
+      checkExceptionInExpression[Exception](ToBinary(Literal(input), base64Literal), errMsg)
+    }
+
+    // UTF-8/UTF8
+    val utf8Literal = Literal("Utf8")
+    val utf_8Literal = Literal("uTf-8")
+    // scalastyle:off
+    Seq(
+      "∮ E⋅da = Q,  n → ∞, ∑ f(i) = ∏ g(i), ∀x∈ℝ: ⌈x⌉ = −⌊−x⌋, α ∧ ¬β = ¬(¬α ∨ β)",
+      "大千世界",
+      ""
+    ).foreach {
+      input =>
+        checkEvaluation(StringDecode(ToBinary(Literal(input), utf8Literal), utf_8Literal), input)
+        checkEvaluation(StringDecode(ToBinary(Literal(input), utf_8Literal), utf_8Literal), input)
+    }
+    // scalastyle:on
+    checkEvaluation(ToBinary(Literal.create(null, StringType), utf8Literal), null)
+    checkEvaluation(ToBinary(Literal.create(null, StringType), utf_8Literal), null)
+
+    // HEX - Positive cases
+    val hexLiteral = Literal("hEx")
+    Seq(
+      ("737472696E67", "737472696E67"),
+      ("", ""),
+      ("F", "0F"),
+      ("ff", "FF"),
+      ("00", "00")
+    ).foreach {
+      case (input, expected) =>
+        checkEvaluation(Hex(ToBinary(Literal(input), hexLiteral)), expected)
+    }
+    checkEvaluation(ToBinary(Literal.create(null, StringType), hexLiteral), null)
+
+    // HEX - Negative cases
+    Seq(
+      "GG",
+      "FF 01"
+    ).foreach { input =>
+      val errMsg = "[CONVERSION_INVALID_INPUT]"
+      checkExceptionInExpression[Exception](ToBinary(Literal(input), hexLiteral), errMsg)
+    }
+
+    // Invalid format
+    val errMsg = "[CONVERSION_INVALID_FORMAT]"

Review Comment:
   Done, thank you



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1217352825

   @cloud-fan @MaxGekk ptal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r948645396


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2469,14 +2469,76 @@ case class Encode(value: Expression, charset: Expression)
     newLeft: Expression, newRight: Expression): Encode = copy(value = newLeft, charset = newRight)
 }
 
+object ToBinary {
+  def isValidBase64(srcString: UTF8String) : Boolean = {
+    // We use RFC4648. The valid base64 string should contain zero or more groups of 4 symbols plus
+    // last group consisting of 2-4 valid symbols and optional padding.
+    // Last group should contain at least 2 valid symbols and up to 2 padding characters `=`.
+    // Valid symbols include - (A-Za-z0-9+/). Each group might contain arbitrary number of
+    // whitespaces which are ignored.
+    // If padding is present - last group should include exactly 4 symbols.
+    // Examples:
+    //    "abcd"      - Valid, single group of 4 valid symbols
+    //    "abc d"     - Valid, single group of 4 valid symbols, whitespace is skipped
+    //    "abc?"      - Invalid, group contains invalid symbol `?`
+    //    "abcdA"     - Invalid, last group should contain at least 2 valid symbols
+    //    "abcdAE"    - Valid, a group of 4 valid symbols and a group of 2 valid symbols
+    //    "abcdAE=="  - Valid, last group includes 2 padding symbols and total number of symbols
+    //                  in a group is 4.
+    //    "abcdAE="   - Invalid, last group include padding symbols, therefore it should have
+    //                  exactly 4 symbols but contains only 3.
+    //    "ab==tm+1"  - Invalid, nothing should be after padding.
+    var position = 0
+    var padSize = 0
+    var invalid = false // The string is detected to be invalid.
+    val validation = srcString.toString.map {
+      case c
+        if !invalid &&
+          ((c >= '0' && c <= '9')
+            || (c >= 'A' && c <= 'Z')
+            || (c >= 'a' && c <= 'z')
+            || c == '/' || c == '+') =>
+        position += 1
+        if (padSize != 0) { // Valid character after padding.
+          invalid = true
+          false
+        } else { // A group should have at least 2 valid symbols.
+          position % 4 != 1

Review Comment:
   I don't quite understand why we need to use a list to keep the validation results. Can't we use a bool variable to hold the final validation result?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1563711884

   > If the new behavior is already released in 3.3, I don't think we can change it back as it will be another breaking change. We can update the migration guide, or provide some solutions for easy migration.
   
   @cloud-fan opened https://github.com/apache/spark/pull/41280 to mention the behavior change in migration guide


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r950463020


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##########
@@ -169,6 +169,39 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase {
       summary = getSummary(context))
   }
 
+  def invalidInputInConversionError(
+      to: DataType,
+      s: UTF8String,
+      fmt: UTF8String,
+      hint: String = "",
+      context: SQLQueryContext): SparkIllegalArgumentException = {
+    val alternative = if (hint.nonEmpty) {
+      s" Use '$hint' to tolerate malformed input and return NULL instead."
+    } else ""
+    new SparkIllegalArgumentException(

Review Comment:
   `SparkException` does not support `QueryContext` yet, will need a separate PR to fix that



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2469,14 +2469,65 @@ case class Encode(value: Expression, charset: Expression)
     newLeft: Expression, newRight: Expression): Encode = copy(value = newLeft, charset = newRight)
 }
 
+object ToBinary {
+  def isValidBase64(srcString: UTF8String) : Boolean = {
+    // We use RFC4648. The valid base64 string should contain zero or more groups of 4 symbols plus
+    // last group consisting of 2-4 valid symbols and optional padding.
+    // Last group should contain at least 2 valid symbols and up to 2 padding characters `=`.
+    // Valid symbols include - (A-Za-z0-9+/). Each group might contain arbitrary number of
+    // whitespaces which are ignored.
+    // If padding is present - last group should include exactly 4 symbols.
+    // Examples:
+    //    "abcd"      - Valid, single group of 4 valid symbols
+    //    "abc d"     - Valid, single group of 4 valid symbols, whitespace is skipped
+    //    "abc?"      - Invalid, group contains invalid symbol `?`
+    //    "abcdA"     - Invalid, last group should contain at least 2 valid symbols
+    //    "abcdAE"    - Valid, a group of 4 valid symbols and a group of 2 valid symbols
+    //    "abcdAE=="  - Valid, last group includes 2 padding symbols and total number of symbols
+    //                  in a group is 4.
+    //    "abcdAE="   - Invalid, last group include padding symbols, therefore it should have
+    //                  exactly 4 symbols but contains only 3.
+    //    "ab==tm+1"  - Invalid, nothing should be after padding.
+    var position = 0
+    var padSize = 0
+    for (c: Char <- srcString.toString) {
+      c match {
+        case a
+            if (a >= '0' && a <= '9')
+            || (a >= 'A' && a <= 'Z')
+            || (a >= 'a' && a <= 'z')
+            || a == '/' || a == '+' =>
+          if (padSize != 0) return false // Padding symbols should conclude the string.
+          position += 1
+        case pad if pad == '=' =>
+          padSize += 1
+          // Last group preceding padding should have 2 or more symbols. Padding size should be 1 or
+          // less.
+          if (padSize > 2 || position % 4 < 2) {
+            return false
+          }
+        case ws if Character.isWhitespace(ws) =>
+          if (padSize != 0) { // Padding symbols should conclude the string.
+            return false
+          }
+        case _ => return false
+      }
+    }
+    if (padSize > 0) { // When padding is present last group should have exactly 4 symbols.
+      (position + padSize) % 4 == 0
+    } else { // When padding is absent last group should include 2 or more symbols.
+      position % 4 != 1
+    }

Review Comment:
   agree :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

Yikun commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1221229222

   @vitaliili-db **By default**, write permission already included to your `GITHUB_TOKEN`, but if you set it manually (see also [1], [2]), it will failed, current CI will first build infra and push to your ghcr so write permission is required (see also [3]).
   
   [1] https://github.blog/changelog/2021-04-20-github-actions-control-permissions-for-github_token/
   [2] https://github.com/users/vitaliili-db/packages/container/apache-spark-ci-image/settings
   [3] https://docs.google.com/document/d/1_uiId-U1DODYyYZejAZeyz2OAjxcnA-xfwjynDF6vd0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r954003900


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -70,6 +70,11 @@
       "Another instance of this query was just started by a concurrent session."
     ]
   },
+  "CONVERSION_INVALID_INPUT" : {

Review Comment:
   shall we make the name more explicit? `TO_BINARY_INVALID_INPUT`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r959086655


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2488,59 +2539,117 @@ case class Encode(value: Expression, charset: Expression)
   """,
   since = "3.3.0",
   group = "string_funcs")
-// scalastyle:on line.size.limit
-case class ToBinary(
-    expr: Expression,
-    format: Option[Expression],
-    nullOnInvalidFormat: Boolean = false) extends RuntimeReplaceable
-  with ImplicitCastInputTypes {
-
-  override lazy val replacement: Expression = format.map { f =>
-    assert(f.foldable && (f.dataType == StringType || f.dataType == NullType))

Review Comment:
   @cloud-fan Here is my attempt to revert to `RuntimeReplaceable`. It works for the most part except for errors. I am not sure if it is serialization to executors or something else but `QueryContext` is pretty much empty (not `None`, just all fields are not populated). <- @MaxGekk is there a way to fix this? 
   
   Another alternative way to enforce `format` being foldable/literal is to check it in `checkInputDataTypes`. This will revert my last commit but error messages will be cleaner, what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r959123241


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2488,11 +2568,10 @@ case class Encode(value: Expression, charset: Expression)
   """,
   since = "3.3.0",
   group = "string_funcs")
-// scalastyle:on line.size.limit
 case class ToBinary(
-    expr: Expression,
-    format: Option[Expression],
-    nullOnInvalidFormat: Boolean = false) extends RuntimeReplaceable

Review Comment:
   nit: the previous 4 spaces indentation is correct



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r959123514


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##########
@@ -800,6 +800,19 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase {
       s"The '$argName' parameter of function '$funcName' needs to be a $requiredType literal.")
   }
 
+  def invalidFormatInConversion(
+    argName: String,

Review Comment:
   nit: 4 spaces indentation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r947523542


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2469,14 +2469,76 @@ case class Encode(value: Expression, charset: Expression)
     newLeft: Expression, newRight: Expression): Encode = copy(value = newLeft, charset = newRight)
 }
 
+object ToBinary {
+  def isValidBase64(srcString: UTF8String) : Boolean = {
+    // We use RFC4648. The valid base64 string should contain zero or more groups of 4 symbols plus
+    // last group consisting of 2-4 valid symbols and optional padding.
+    // Last group should contain at least 2 valid symbols and up to 2 padding characters `=`.
+    // Valid symbols include - (A-Za-z0-9+/). Each group might contain arbitrary number of
+    // whitespaces which are ignored.
+    // If padding is present - last group should include exactly 4 symbols.
+    // Examples:
+    //    "abcd"      - Valid, single group of 4 valid symbols
+    //    "abc d"     - Valid, single group of 4 valid symbols, whitespace is skipped
+    //    "abc?"      - Invalid, group contains invalid symbol `?`
+    //    "abcdA"     - Invalid, last group should contain at least 2 valid symbols
+    //    "abcdAE"    - Valid, a group of 4 valid symbols and a group of 2 valid symbols
+    //    "abcdAE=="  - Valid, last group includes 2 padding symbols and total number of symbols
+    //                  in a group is 4.
+    //    "abcdAE="   - Invalid, last group include padding symbols, therefore it should have
+    //                  exactly 4 symbols but contains only 3.
+    //    "ab==tm+1"  - Invalid, nothing should be after padding.
+    var position = 0
+    var padSize = 0
+    var invalid = false // The string is detected to be invalid.
+    val validation = srcString.toString.map {
+      case c
+        if !invalid &&
+          ((c >= '0' && c <= '9')
+            || (c >= 'A' && c <= 'Z')
+            || (c >= 'a' && c <= 'z')
+            || c == '/' || c == '+') =>
+        position += 1
+        if (padSize != 0) { // Valid character after padding.
+          invalid = true
+          false
+        } else { // A group should have at least 2 valid symbols.
+          position % 4 != 1

Review Comment:
   No, we set `invalid` only if we know for sure that the string is invalid. While we iterating the string and `invalid` is not set we are populating sequence of bools. Intermediate state could be marked as non-valid but if there is subsequent characters the string will be valid, e.g. `abcdef==` => `{false, true, true, true, false, true, false, true}`. For the first character the position will be 1 so if we don't encounter valid characters the result will be invalid string (could be whitespace or something). But next valid symbol makes entire string valid. Similarly for character at position 5 and 7.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r951968563


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala:
##########
@@ -452,6 +452,90 @@ class StringExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
     // scalastyle:on
   }
 
+  test("to_binary") {
+    // Base64 - Positive cases
+    val base64Literal = Literal("base64")
+    Seq(
+      ("", ""),
+      ("  \r\n  \t", ""),
+      ("ag", "ag=="),
+      ("abc", "abc="),
+      ("abcd", "abcd"),
+      ("ab\r\n\t cd", "abcd"),
+      ("ab cd ef", "abcdeQ=="),
+      ("+ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789/",
+        "+ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789/"),
+      ("b25lIHR3byB0aHJlZSBmb3VyIGZpdmUgc2l4IHNldmVuIGVpZ2h0IG5pbmUgdGVuIGVsZXZlbiB0\r\n" +
+        "d2VsdmUgdGhpcnRlZW4gZm91cnRlZW4gZml2dGVlbiBzaXh0ZWVuIHNldmVudGVlbiBlaWdodGVl\r\n" +
+        "biBuaW5ldGVlbiB0d2VudHkgdHdlbnR5LW9uZQ==",
+       "b25lIHR3byB0aHJlZSBmb3VyIGZpdmUgc2l4IHNldmVuIGVpZ2h0IG5pbmUgdGVuIGVsZXZlbiB0\r\n" +
+        "d2VsdmUgdGhpcnRlZW4gZm91cnRlZW4gZml2dGVlbiBzaXh0ZWVuIHNldmVudGVlbiBlaWdodGVl\r\n" +
+        "biBuaW5ldGVlbiB0d2VudHkgdHdlbnR5LW9uZQ==")
+    ).foreach {
+      case (input: String, expected: String) =>
+        checkEvaluation(Base64(ToBinary(Literal(input), base64Literal)), expected)
+    }
+    checkEvaluation(ToBinary(Literal.create(null, StringType), base64Literal), null)
+
+    // Base64 - Negative cases
+    Seq(
+      "a", // Too short
+      "abc==", // Invalid padding
+      "abcdef==a", // String should end with padding
+      "myemail@gmail.com" // Invalid character
+    ).foreach { input =>
+      val errMsg = "[CONVERSION_INVALID_INPUT]"
+      checkExceptionInExpression[Exception](ToBinary(Literal(input), base64Literal), errMsg)
+    }

Review Comment:
   I think it is a bit out of scope of this PR. Right now `checkExceptionInExpression` covers more scenarios than `checkError` so I would take switching as separate task, wdyt?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r951040536


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##########
@@ -169,6 +169,39 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase {
       summary = getSummary(context))
   }
 
+  def invalidInputInConversionError(
+      to: DataType,
+      s: UTF8String,
+      fmt: UTF8String,
+      hint: String = "",
+      context: SQLQueryContext): SparkIllegalArgumentException = {
+    val alternative = if (hint.nonEmpty) {
+      s" Use '$hint' to tolerate malformed input and return NULL instead."
+    } else ""
+    new SparkIllegalArgumentException(

Review Comment:
   @MaxGekk can you help to fix it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r948644710


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2469,14 +2469,76 @@ case class Encode(value: Expression, charset: Expression)
     newLeft: Expression, newRight: Expression): Encode = copy(value = newLeft, charset = newRight)
 }
 
+object ToBinary {
+  def isValidBase64(srcString: UTF8String) : Boolean = {
+    // We use RFC4648. The valid base64 string should contain zero or more groups of 4 symbols plus
+    // last group consisting of 2-4 valid symbols and optional padding.
+    // Last group should contain at least 2 valid symbols and up to 2 padding characters `=`.
+    // Valid symbols include - (A-Za-z0-9+/). Each group might contain arbitrary number of
+    // whitespaces which are ignored.
+    // If padding is present - last group should include exactly 4 symbols.
+    // Examples:
+    //    "abcd"      - Valid, single group of 4 valid symbols
+    //    "abc d"     - Valid, single group of 4 valid symbols, whitespace is skipped
+    //    "abc?"      - Invalid, group contains invalid symbol `?`
+    //    "abcdA"     - Invalid, last group should contain at least 2 valid symbols
+    //    "abcdAE"    - Valid, a group of 4 valid symbols and a group of 2 valid symbols
+    //    "abcdAE=="  - Valid, last group includes 2 padding symbols and total number of symbols
+    //                  in a group is 4.
+    //    "abcdAE="   - Invalid, last group include padding symbols, therefore it should have
+    //                  exactly 4 symbols but contains only 3.
+    //    "ab==tm+1"  - Invalid, nothing should be after padding.
+    var position = 0
+    var padSize = 0
+    var invalid = false // The string is detected to be invalid.
+    val validation = srcString.toString.map {

Review Comment:
   I think we can quite the loop earlier if `invalid == true`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r949610097


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2469,14 +2469,76 @@ case class Encode(value: Expression, charset: Expression)
     newLeft: Expression, newRight: Expression): Encode = copy(value = newLeft, charset = newRight)
 }
 
+object ToBinary {
+  def isValidBase64(srcString: UTF8String) : Boolean = {
+    // We use RFC4648. The valid base64 string should contain zero or more groups of 4 symbols plus
+    // last group consisting of 2-4 valid symbols and optional padding.
+    // Last group should contain at least 2 valid symbols and up to 2 padding characters `=`.
+    // Valid symbols include - (A-Za-z0-9+/). Each group might contain arbitrary number of
+    // whitespaces which are ignored.
+    // If padding is present - last group should include exactly 4 symbols.
+    // Examples:
+    //    "abcd"      - Valid, single group of 4 valid symbols
+    //    "abc d"     - Valid, single group of 4 valid symbols, whitespace is skipped
+    //    "abc?"      - Invalid, group contains invalid symbol `?`
+    //    "abcdA"     - Invalid, last group should contain at least 2 valid symbols
+    //    "abcdAE"    - Valid, a group of 4 valid symbols and a group of 2 valid symbols
+    //    "abcdAE=="  - Valid, last group includes 2 padding symbols and total number of symbols
+    //                  in a group is 4.
+    //    "abcdAE="   - Invalid, last group include padding symbols, therefore it should have
+    //                  exactly 4 symbols but contains only 3.
+    //    "ab==tm+1"  - Invalid, nothing should be after padding.
+    var position = 0
+    var padSize = 0
+    var invalid = false // The string is detected to be invalid.
+    val validation = srcString.toString.map {

Review Comment:
   rewrote method to iterative approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r954454856


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2487,59 +2538,117 @@ case class Encode(value: Expression, charset: Expression)
   """,
   since = "3.3.0",
   group = "string_funcs")
-// scalastyle:on line.size.limit
-case class ToBinary(
-    expr: Expression,
-    format: Option[Expression],
-    nullOnInvalidFormat: Boolean = false) extends RuntimeReplaceable
-  with ImplicitCastInputTypes {
-
-  override lazy val replacement: Expression = format.map { f =>
-    assert(f.foldable && (f.dataType == StringType || f.dataType == NullType))
-    val value = f.eval()
-    if (value == null) {
-      Literal(null, BinaryType)
-    } else {
-      value.asInstanceOf[UTF8String].toString.toLowerCase(Locale.ROOT) match {
-        case "hex" => Unhex(expr)
-        case "utf-8" => Encode(expr, Literal("UTF-8"))
-        case "base64" => UnBase64(expr)

Review Comment:
   I'd like to keep `ToBinary` as a `RuntimeReplaceable`. How about we add a `failOnError` bool flag to `Unhex` and `UnBase64`? It's only set to true by `ToBinary` so no behavior change for these 2 functions. It's basically moving your code changes to `Unhex` and `UnBase64`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r956457712


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2487,59 +2538,117 @@ case class Encode(value: Expression, charset: Expression)
   """,
   since = "3.3.0",
   group = "string_funcs")
-// scalastyle:on line.size.limit
-case class ToBinary(
-    expr: Expression,
-    format: Option[Expression],
-    nullOnInvalidFormat: Boolean = false) extends RuntimeReplaceable
-  with ImplicitCastInputTypes {
-
-  override lazy val replacement: Expression = format.map { f =>
-    assert(f.foldable && (f.dataType == StringType || f.dataType == NullType))
-    val value = f.eval()
-    if (value == null) {
-      Literal(null, BinaryType)
-    } else {
-      value.asInstanceOf[UTF8String].toString.toLowerCase(Locale.ROOT) match {
-        case "hex" => Unhex(expr)
-        case "utf-8" => Encode(expr, Literal("UTF-8"))
-        case "base64" => UnBase64(expr)

Review Comment:
   However, I am not sure we can keep it `RuntimeReplaceable` unless format is always constant (which it is not?). In which case it is probably better to internalize logic?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r956974990


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2488,59 +2539,117 @@ case class Encode(value: Expression, charset: Expression)
   """,
   since = "3.3.0",
   group = "string_funcs")
-// scalastyle:on line.size.limit
-case class ToBinary(
-    expr: Expression,
-    format: Option[Expression],
-    nullOnInvalidFormat: Boolean = false) extends RuntimeReplaceable
-  with ImplicitCastInputTypes {
-
-  override lazy val replacement: Expression = format.map { f =>
-    assert(f.foldable && (f.dataType == StringType || f.dataType == NullType))

Review Comment:
   Note that, I'm asking for justification because this is not free. We changed some logic from compile time to runtime, which complicates the implementation and may also affect performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on PR #37483:
URL: https://github.com/apache/spark/pull/37483#issuecomment-1225656234

   @vitaliili-db Could you resolve conflicts, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r954153222


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2487,59 +2538,117 @@ case class Encode(value: Expression, charset: Expression)
   """,
   since = "3.3.0",
   group = "string_funcs")
-// scalastyle:on line.size.limit
-case class ToBinary(
-    expr: Expression,
-    format: Option[Expression],
-    nullOnInvalidFormat: Boolean = false) extends RuntimeReplaceable
-  with ImplicitCastInputTypes {
-
-  override lazy val replacement: Expression = format.map { f =>
-    assert(f.foldable && (f.dataType == StringType || f.dataType == NullType))
-    val value = f.eval()
-    if (value == null) {
-      Literal(null, BinaryType)
-    } else {
-      value.asInstanceOf[UTF8String].toString.toLowerCase(Locale.ROOT) match {
-        case "hex" => Unhex(expr)
-        case "utf-8" => Encode(expr, Literal("UTF-8"))
-        case "base64" => UnBase64(expr)

Review Comment:
   We might want to consider changing behavior of `unhex` and `unbase64` as well in a separate ticket, e.g. under ansi flag? Since these were available for a long time I prefer not to expand scope of this ticket.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r959123405


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2534,7 +2615,7 @@ case class ToBinary(
   override def inputTypes: Seq[AbstractDataType] = children.map(_ => StringType)
 
   override protected def withNewChildrenInternal(
-      newChildren: IndexedSeq[Expression]): Expression = {

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org