You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/17 07:02:35 UTC

[GitHub] [spark] vitaliili-db commented on a diff in pull request #37483: [SPARK-40112][SQL] Improve the TO_BINARY() function

vitaliili-db commented on code in PR #37483:
URL: https://github.com/apache/spark/pull/37483#discussion_r947523542


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2469,14 +2469,76 @@ case class Encode(value: Expression, charset: Expression)
     newLeft: Expression, newRight: Expression): Encode = copy(value = newLeft, charset = newRight)
 }
 
+object ToBinary {
+  def isValidBase64(srcString: UTF8String) : Boolean = {
+    // We use RFC4648. The valid base64 string should contain zero or more groups of 4 symbols plus
+    // last group consisting of 2-4 valid symbols and optional padding.
+    // Last group should contain at least 2 valid symbols and up to 2 padding characters `=`.
+    // Valid symbols include - (A-Za-z0-9+/). Each group might contain arbitrary number of
+    // whitespaces which are ignored.
+    // If padding is present - last group should include exactly 4 symbols.
+    // Examples:
+    //    "abcd"      - Valid, single group of 4 valid symbols
+    //    "abc d"     - Valid, single group of 4 valid symbols, whitespace is skipped
+    //    "abc?"      - Invalid, group contains invalid symbol `?`
+    //    "abcdA"     - Invalid, last group should contain at least 2 valid symbols
+    //    "abcdAE"    - Valid, a group of 4 valid symbols and a group of 2 valid symbols
+    //    "abcdAE=="  - Valid, last group includes 2 padding symbols and total number of symbols
+    //                  in a group is 4.
+    //    "abcdAE="   - Invalid, last group include padding symbols, therefore it should have
+    //                  exactly 4 symbols but contains only 3.
+    //    "ab==tm+1"  - Invalid, nothing should be after padding.
+    var position = 0
+    var padSize = 0
+    var invalid = false // The string is detected to be invalid.
+    val validation = srcString.toString.map {
+      case c
+        if !invalid &&
+          ((c >= '0' && c <= '9')
+            || (c >= 'A' && c <= 'Z')
+            || (c >= 'a' && c <= 'z')
+            || c == '/' || c == '+') =>
+        position += 1
+        if (padSize != 0) { // Valid character after padding.
+          invalid = true
+          false
+        } else { // A group should have at least 2 valid symbols.
+          position % 4 != 1

Review Comment:
   No, we set `invalid` only if we know for sure that the string is invalid. While we iterating the string and `invalid` is not set we are populating sequence of bools. Intermediate state could be marked as non-valid but if there is subsequent characters the string will be valid, e.g. `abcdef==` => `{false, true, true, true, false, true, false, true}`. For the first character the position will be 1 so if we don't encounter valid characters the result will be invalid string (could be whitespace or something). But next valid symbol makes entire string valid. Similarly for character at position 5 and 7.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org