You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "ted-jenks (via GitHub)" <gi...@apache.org> on 2024/03/06 18:27:16 UTC

[PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

ted-jenks opened a new pull request, #45408:
URL: https://github.com/apache/spark/pull/45408

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'common/utils/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   In https://github.com/apache/spark/pull/35110, it was incorrectly asserted that:
   > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt
   This is not true as the previous code called:
   ```java
   public static byte[] encodeBase64(byte[] binaryData)
   ```
   Which states:
   > Encodes binary data using the base64 algorithm but does not chunk the output.
   However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   Existing test suite.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1518435595


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3318,6 +3318,13 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val BASE_64_CHUNKING = buildConf("spark.sql.base64.chunking")

Review Comment:
   This part, `val BASE_64_CHUNKING`, also should be revised accordingly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1518435187


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3318,6 +3318,13 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val BASE_64_CHUNKING = buildConf("spark.sql.base64.chunking")

Review Comment:
   We are very careful about the configuration namespace. In this case, it's a little too much to add a new namespace, `spark.sql.base64.*` for this single configuration. Maybe, (1) `spark.sql.chunkBase64String` or (2) `spark.sql.chunkBase64String.enabled`? I prefer (1).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "ted-jenks (via GitHub)" <gi...@apache.org>.
ted-jenks commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1985424230

   I think making this configurable makes the most sense. For people processing data for external systems with Spark they can choose to chunk or not chunk data based on what the use-case is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1984929745

   +1 for the direction if we need to support both.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307][SQL] Add a config to optionally chunk base64 strings [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-2098956928

   Gentle ping, @ted-jenks .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1981829489

   Hi, @ted-jenks . Could you elaborate your correctness situation a little more? It sounds like you have other systems to read Spark's data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "ted-jenks (via GitHub)" <gi...@apache.org>.
ted-jenks commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1981542843

   @dongjoon-hyun please may you take a look. Caused a big data correctness issue for us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1984433848

   Thank you for the confirmation, @ted-jenks . Well, in this case, it's too late to change the behavior again. Apache Spark 3.3 is already the EOL status since last year and I don't think we need to change the behavior for Apache Spark 3.4.3 and 3.5.2 because Apache Spark community didn't have such an official contract before. It would be great if you participate the community at Apache Spark 3.3.0 RC votes at that time.
   
   > > It sounds like you have other systems to read Spark's data.
   >
   > Correct. The issue was that from 3.2 to 3.3 there was a behavior change in the base64 encodings used in spark. Previously, they did not chunk. Now, they do. Chunked base64 cannot be read by non-MIME compatible base64 decoders causing the data output by Spark to be corrupt to systems following the normal base64 standard.
   > 
   > I think the best path forward is to use MIME encoding/decoding without chunking as this is the most fault tolerant meaning existing use-cases will not break, but the pre 3.3 base64 behavior is upheld.
   
   However, I understand and agree with @ted-jenks 's point as a nice-to-have of Apache Spark 4+ officially. In other words, if we want to merge this PR, we need to make it official from Apache Spark 4.0.0 and protect that as a kind of developer interface for all future releases. Do you think it's okay, @ted-jenks ?
   
   BTW, how do you think about this proposal, @yaooqinn (the original author of #35110) and @cloud-fan and @HyukjinKwon ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.
yaooqinn commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1984926315

   Thank you @dongjoon-hyun. 
   
   In such circumstances, I guess we can add a configuration for base64 classes to avoid breaking things again. AFAIK, Apache Hive also uses the JDK version, and I think the majority of Spark users talk to Hive heavily using Spark SQL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "ted-jenks (via GitHub)" <gi...@apache.org>.
ted-jenks commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1982836579

   @dongjoon-hyun 
   > It sounds like you have other systems to read Spark's data.
   Correct. The issue was that from 3.2 to 3.3 there was a behavior change in the base64 encodings used in spark. Previously, they did not chunk. Now, they do. Chunked base64 cannot be read by non-MIME compatible base64 decoders causing the data output by Spark to be corrupt to systems following the normal base64 standard.
   
   I think the best path forward is to use MIME encoding/decoding without chunking as this is the most fault tolerant meaning existing use-cases will not break, but the pre 3.3 base64 behavior is upheld.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307][SQL] Add a config to optionally chunk base64 strings [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.
yaooqinn commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1988397848

   thank you for the explanation @ted-jenks 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307][SQL] Add a config to optionally chunk base64 strings [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1522057774


##########
sql/core/src/test/resources/sql-tests/results/charvarchar.sql.out:
##########
@@ -839,17 +839,26 @@ NULL	NULL	NULL	NULL
 -- !query
 select base64(c7), base64(c8), base64(v), ascii(s) from char_tbl4
 -- !query schema
-struct<base64(c7):string,base64(c8):string,base64(v):string,ascii(s):int>
+struct<>
 -- !query output
-NULL	NULL	NULL	NULL
-NULL	NULL	Uw==	NULL
-TiAgICAgIA==	TiAgICAgICA=	TiA=	78
-TmUgICAgIA==	TmUgICAgICA=	U3A=	78
-TmV0ICAgIA==	TmV0ICAgICA=	U3BhICA=	78
-TmV0RSAgIA==	TmV0RSAgICA=	U3Bhcg==	78
-TmV0RWEgIA==	TmV0RWEgICA=	U3Bhcmsg	78
-TmV0RWFzIA==	TmV0RWFzICA=	U3Bhcms=	78
-TmV0RWFzZQ==	TmV0RWFzZSA=	U3Bhcmst	78
+org.apache.spark.sql.AnalysisException

Review Comment:
   ditto. This seems to be a wrong regeneration due to the improper implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307][SQL] Add a config to optionally chunk base64 strings [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.
yaooqinn commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-2008724931

   Gentle ping @ted-jenks, any updates for the test failures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307][SQL] Add a config to optionally chunk base64 strings [spark]

Posted by "ted-jenks (via GitHub)" <gi...@apache.org>.
ted-jenks commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1988639946

   I am having trouble getting the failing test to pass:
   ```13:27:04.051 ERROR org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite```
   Giving:
   ```
   java.sql.SQLException
   [info]   org.apache.hive.service.cli.HiveSQLException: Error running query: [WRONG_NUM_ARGS.WITHOUT_SUGGESTION] org.apache.spark.sql.AnalysisException: [WRONG_NUM_ARGS.WITHOUT_SUGGESTION] The `base64` requires 0 parameters but the actual number is 1. Please, refer to 'https://spark.apache.org/docs/latest/sql-ref-functions.html' for a fix. SQLSTATE: 42605; line 1 pos 7
   [info]   	at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:43)
   [info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation...
   ```
   Any idea why I would get this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "ted-jenks (via GitHub)" <gi...@apache.org>.
ted-jenks commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1517931178


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2426,21 +2426,34 @@ case class Chr(child: Expression)
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Base64(child: Expression)
+case class Base64(child: Expression, chunkBase64: Boolean = SQLConf.get.base64Chunking)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
+  lazy val encoder: JBase64.Encoder = if (chunkBase64) {
+    JBase64.getMimeEncoder
+  } else {
+    JBase64.getMimeEncoder(-1, Array())
+  }
+
   override def dataType: DataType = StringType
   override def inputTypes: Seq[DataType] = Seq(BinaryType)
 
   protected override def nullSafeEval(bytes: Any): Any = {
-    UTF8String.fromBytes(JBase64.getMimeEncoder.encode(bytes.asInstanceOf[Array[Byte]]))
+    UTF8String.fromBytes(encoder.encode(bytes.asInstanceOf[Array[Byte]]))
   }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
     nullSafeCodeGen(ctx, ev, (child) => {
-      s"""${ev.value} = UTF8String.fromBytes(
-            ${classOf[JBase64].getName}.getMimeEncoder().encode($child));
-       """})
+      s"""
+        if ($chunkBase64) {

Review Comment:
   nice



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1517826436


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2426,21 +2426,34 @@ case class Chr(child: Expression)
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Base64(child: Expression)
+case class Base64(child: Expression, chunkBase64: Boolean = SQLConf.get.base64Chunking)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
+  lazy val encoder: JBase64.Encoder = if (chunkBase64) {
+    JBase64.getMimeEncoder
+  } else {
+    JBase64.getMimeEncoder(-1, Array())
+  }
+
   override def dataType: DataType = StringType
   override def inputTypes: Seq[DataType] = Seq(BinaryType)
 
   protected override def nullSafeEval(bytes: Any): Any = {
-    UTF8String.fromBytes(JBase64.getMimeEncoder.encode(bytes.asInstanceOf[Array[Byte]]))
+    UTF8String.fromBytes(encoder.encode(bytes.asInstanceOf[Array[Byte]]))
   }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
     nullSafeCodeGen(ctx, ev, (child) => {
-      s"""${ev.value} = UTF8String.fromBytes(
-            ${classOf[JBase64].getName}.getMimeEncoder().encode($child));
-       """})
+      s"""
+        if ($chunkBase64) {

Review Comment:
   We know the value of `chunkBase64` before generating the java code, so we can do better
   ```
   if (chunkBase64) {
     s""" ...
   } else {
     s""" ...
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1517664214


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2426,21 +2426,27 @@ case class Chr(child: Expression)
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Base64(child: Expression)
+case class Base64(child: Expression, chunkBase64: Boolean)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
+  def this(child: Expression) = this(child, SQLConf.get.base64Chunking)
+
+  lazy val encoder: JBase64.Encoder = if (chunkBase64) {
+    JBase64.getMimeEncoder
+  } else {
+    JBase64.getMimeEncoder(-1, Array())
+  }
   override def dataType: DataType = StringType
   override def inputTypes: Seq[DataType] = Seq(BinaryType)
 
   protected override def nullSafeEval(bytes: Any): Any = {
-    UTF8String.fromBytes(JBase64.getMimeEncoder.encode(bytes.asInstanceOf[Array[Byte]]))
+    UTF8String.fromBytes(encoder.encode(bytes.asInstanceOf[Array[Byte]]))
   }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
     nullSafeCodeGen(ctx, ev, (child) => {
-      s"""${ev.value} = UTF8String.fromBytes(
-            ${classOf[JBase64].getName}.getMimeEncoder().encode($child));
-       """})
+      s"${ev.value} = UTF8String.fromBytes($encoder}.encode($child));"

Review Comment:
   Does this really work? `$encoder` is `JBase64.Encoder#toString`, not the java code that returns `JBase64.Encoder`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1518433859


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3318,6 +3318,13 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val BASE_64_CHUNKING = buildConf("spark.sql.base64.chunking")
+    .doc("When true, base64 strings generated by the base64 function are chunked into lines of " +
+      "at most 76 characters. When false, the base64 strings are not chunked.")
+    .version("3.5.2")

Review Comment:
   This should be 4.0.0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307][SQL] Add a config to optionally chunk base64 strings [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.
yaooqinn commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1988293451

   Do we need to revise unbase64 accordingly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307][SQL] Add a config to optionally chunk base64 strings [spark]

Posted by "ted-jenks (via GitHub)" <gi...@apache.org>.
ted-jenks commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1988371956

   > Do we need to revise unbase64 accordingly?
   
   Unbase64 uses the Mime decoder, which can tolerate chunked and unchunked data.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307][SQL] Add a config to optionally chunk base64 strings [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1989116494

   Could you do the following to re-generate the golden files, @ted-jenks ?
   ```
   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1518435444


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3318,6 +3318,13 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val BASE_64_CHUNKING = buildConf("spark.sql.base64.chunking")
+    .doc("When true, base64 strings generated by the base64 function are chunked into lines of " +
+      "at most 76 characters. When false, the base64 strings are not chunked.")
+    .version("3.5.2")
+    .booleanConf
+    .createWithDefault(true)

Review Comment:
   Since we decide to keep the existing behavior as the default, could you revise the PR title and description, @ted-jenks ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307][SQL] Add a config to optionally chunk base64 strings [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1522057496


##########
sql/core/src/test/resources/sql-tests/analyzer-results/charvarchar.sql.out:
##########
@@ -458,10 +458,24 @@ Project [ascii(c7#x) AS ascii(c7)#x, ascii(c8#x) AS ascii(c8)#x, ascii(v#x) AS a
 -- !query
 select base64(c7), base64(c8), base64(v), ascii(s) from char_tbl4
 -- !query analysis
-Project [base64(cast(c7#x as binary)) AS base64(c7)#x, base64(cast(c8#x as binary)) AS base64(c8)#x, base64(cast(v#x as binary)) AS base64(v)#x, ascii(s#x) AS ascii(s)#x]
-+- SubqueryAlias spark_catalog.default.char_tbl4
-   +- Project [staticinvoke(class org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, readSidePadding, c7#x, 7, true, false, true) AS c7#x, staticinvoke(class org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, readSidePadding, c8#x, 8, true, false, true) AS c8#x, v#x, s#x]
-      +- Relation spark_catalog.default.char_tbl4[c7#x,c8#x,v#x,s#x] parquet
+org.apache.spark.sql.AnalysisException

Review Comment:
   Ur, this seems to be a wrong regeneration due to the improper implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.
yaooqinn commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1984930529

   As the Spark Community didn't get any issue report during v3.3.0 - v3.5.1 releases, I think this is a corner case. Maybe we can make the config internal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org