You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "justaparth (via GitHub)" <gi...@apache.org> on 2023/05/06 21:50:06 UTC

[GitHub] [spark] justaparth opened a new pull request, #41075: spark protobuf: allow serde with enum as ints

justaparth opened a new pull request, #41075:
URL: https://github.com/apache/spark/pull/41075

   Adds a new option, enums.as.ints which allows deserializing enums as int values. Symmetrically, we modify the serialization logic as well to allow reading back structs where integer values are used for enums.
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   When deserializing protobuf enum fields, the spark-protobuf library will deserialize them as string values based on the enum name in the proto. E.g. 
   
   ```
   message Person {
     enum Job {
       NOTHING = 0;
       ENGINEER = 1;
       DOCTOR = 2;
     }
     Job job = 1;
   }
   ```
   And we have a message like
   
   ```
   Job(job=ENGINEER)
   ```
   
   Then the deserialized value will be:
   
   ```
   {"job": "ENGINEER"}
   ```
   
   However it can be useful to deserialize the enum integer value rather than the name, namely:
   
   ```
   {"job": 1}
   ```
   
   This commit adds that functionality to spark protobuf via an option, "enums.as.ints". In addition, it also allows serializing such fields back into protobuf via `to_protobuf`.
   
   
   Examples in other libraries:
   
   - protobuf-java-util JsonFormat: https://javadoc.io/doc/com.google.protobuf/protobuf-java-util/3.10.0/com/google/protobuf/util/JsonFormat.Printer.html#printingEnumsAsInts--
   - golang/protobuf jsonpb marshaler https://pkg.go.dev/github.com/golang/protobuf/jsonpb#Marshaler
   
   
   ### Why are the changes needed?
   Adds new functionality that exists in other libraries.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, adds a new option to from_protobuf.
   
   ### How was this patch tested?
   Added unit tests confirming this behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #41075:
URL: https://github.com/apache/spark/pull/41075#issuecomment-1539511840

   @rangadi is it good to go?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #41075: [SPARK-43361][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #41075: [SPARK-43361][PROTOBUF] spark-protobuf: allow serde with enum as ints
URL: https://github.com/apache/spark/pull/41075


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "amaliujia (via GitHub)" <gi...@apache.org>.
amaliujia commented on PR #41075:
URL: https://github.com/apache/spark/pull/41075#issuecomment-1541197793

   Just wondering why this is relevant to Spark Connect (seeing `[CONNECT]` in the PR title)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1186785845


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -260,6 +260,11 @@ private[sql] class ProtobufDeserializer(
       case (ENUM, StringType) =>
         (updater, ordinal, value) => updater.set(ordinal, UTF8String.fromString(value.toString))
 
+      case (ENUM, IntegerType) =>
+        (updater, ordinal, value) => {
+          updater.set(ordinal, protoType.getEnumType.findValueByName(value.toString).getNumber)

Review Comment:
   `value` itself would be of type `ProtocolMessageEnum` I think. Can do `value.asInstanceOf[ProtobufMessageEnum].getNumber()`. No need to convert to string then search.



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufSerializer.scala:
##########
@@ -110,6 +110,19 @@ private[sql] class ProtobufSerializer(
               enumSymbols.mkString("\"", "\", \"", "\""))
           }
           fieldDescriptor.getEnumType.findValueByName(data)
+      case (IntegerType, ENUM) =>
+        val enumValues: Set[Int] =
+          fieldDescriptor.getEnumType.getValues.asScala.map(e => e.getNumber).toSet
+        (getter, ordinal) =>
+          val data = getter.getInt(ordinal)
+          if (!enumValues.contains(data)) {
+            throw QueryCompilationErrors.cannotConvertCatalystTypeToProtobufEnumTypeError(

Review Comment:
   This is not type conversion error. Also it occurs at runtime, not at compile time. Check for other errors to see there is a better suited one. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1188968513


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -361,15 +361,23 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
 
     val df = Seq(dynamicMessage.toByteArray).toDF("value")
 
+    // Test that roundtrip serde works correctly both with and without enums as ints.
     checkWithFileAndClassName("SimpleMessageEnum") {
       case (name, descFilePathOpt) =>
-        val fromProtoDF = df.select(
-          from_protobuf_wrapper($"value", name, descFilePathOpt).as("value_from"))
-        val toProtoDF = fromProtoDF.select(
-          to_protobuf_wrapper($"value_from", name, descFilePathOpt).as("value_to"))
-        val toFromProtoDF = toProtoDF.select(
-          from_protobuf_wrapper($"value_to", name, descFilePathOpt).as("value_to_from"))
-        checkAnswer(fromProtoDF.select($"value_from.*"), toFromProtoDF.select($"value_to_from.*"))
+        List(
+          Map.empty[String, String],
+          Map("enums.as.ints" -> "false"),
+          Map("enums.as.ints" -> "true"))
+          .foreach(opts => {
+            val fromProtoDF = df.select(
+              from_protobuf_wrapper($"value", name, descFilePathOpt, opts).as("value_from"))
+            val toProtoDF = fromProtoDF.select(
+              to_protobuf_wrapper($"value_from", name, descFilePathOpt).as("value_to"))

Review Comment:
   Agree. Lets go with the current version (int will be used if the input schema is int).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on pull request #41075: [SPARK_43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on PR #41075:
URL: https://github.com/apache/spark/pull/41075#issuecomment-1537245401

   cc @rangadi @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1188970475


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -141,6 +141,33 @@ private[sql] class ProtobufOptions(
   //      what information is available in a serialized proto.
   val emitDefaultValues: Boolean =
     parameters.getOrElse("emit.default.values", false.toString).toBoolean
+
+  // Whether to render enum fields as their integer values.
+  //
+  // As an example, consider the following message type:
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   enum Job {
+  //     NONE = 0;
+  //     ENGINEER = 1;
+  //     DOCTOR = 2;
+  //   }
+  //   Job job = 1;
+  // }
+  // ```
+  //
+  // If we have an instance of this message like `Person(job = ENGINEER)`, then the
+  // default deserialization will be:
+  // `{"job": "ENGINEER"}`
+  //
+  // But with this option set the deserialization will be:
+  // `{"job": 1}`
+  //
+  // Please note the output struct type will now contain an int column
+  // instead of string, so use caution if changing existing parsing logic.
+  val enumsAsInts: Boolean =
+    parameters.getOrElse("enums.as.ints", false.toString).toBoolean

Review Comment:
   Good point @HyukjinKwon . 
   Thanks @justaparth for the ticket. would you like to send a PR for doc update :)? Essentially we would copy scala documentation to markdown. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #41075: [SPARK-43361][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1195341377


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##########
@@ -2812,4 +2813,18 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase {
         "location" -> toSQLValue(location.toString, StringType),
         "identifier" -> toSQLId(tableId.nameParts)))
   }
+
+  def cannotConvertCatalystValueToProtobufEnumTypeError(
+      sqlColumn: Seq[String],
+      protobufColumn: String,
+      data: String,
+      enumString: String): Throwable = {
+    new AnalysisException(
+      errorClass = "CANNOT_CONVERT_SQL_VALUE_TO_PROTOBUF_ENUM_TYPE",

Review Comment:
   ah thanks for letting me know. i've added here: https://github.com/apache/spark/pull/41188



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1186853012


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -361,15 +361,23 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
 
     val df = Seq(dynamicMessage.toByteArray).toDF("value")
 
+    // Test that roundtrip serde works correctly both with and without enums as ints.
     checkWithFileAndClassName("SimpleMessageEnum") {
       case (name, descFilePathOpt) =>
-        val fromProtoDF = df.select(
-          from_protobuf_wrapper($"value", name, descFilePathOpt).as("value_from"))
-        val toProtoDF = fromProtoDF.select(
-          to_protobuf_wrapper($"value_from", name, descFilePathOpt).as("value_to"))
-        val toFromProtoDF = toProtoDF.select(
-          from_protobuf_wrapper($"value_to", name, descFilePathOpt).as("value_to_from"))
-        checkAnswer(fromProtoDF.select($"value_from.*"), toFromProtoDF.select($"value_to_from.*"))
+        List(
+          Map.empty[String, String],
+          Map("enums.as.ints" -> "false"),
+          Map("enums.as.ints" -> "true"))
+          .foreach(opts => {
+            val fromProtoDF = df.select(
+              from_protobuf_wrapper($"value", name, descFilePathOpt, opts).as("value_from"))
+            val toProtoDF = fromProtoDF.select(
+              to_protobuf_wrapper($"value_from", name, descFilePathOpt).as("value_to"))

Review Comment:
   For `to_protobuf()` should we enforce `enum.as.ints` option is set while converting int to Enum?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1186842201


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -141,6 +141,30 @@ private[sql] class ProtobufOptions(
   //      what information is available in a serialized proto.
   val emitDefaultValues: Boolean =
     parameters.getOrElse("emit.default.values", false.toString).toBoolean
+
+  // Whether to render enum fields as their integer values.
+  //
+  // As an example, consider the following message type:
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   enum Job {
+  //     NONE = 0;
+  //     ENGINEER = 1;
+  //     DOCTOR = 2;
+  //   }
+  //   Job job = 1;
+  // }
+  // ```
+  //
+  // If we have an instance of this message like `Person(job = ENGINEER)`, then the
+  // default deserialization will be:
+  // `{"job": "ENGINEER"}`
+  //
+  // But with this option set the deserialization will be:
+  // `{"job": 1}`
+  val enumsAsInts: Boolean =

Review Comment:
   ah good catch! added + modified the test to assert it as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1186842041


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -260,6 +260,11 @@ private[sql] class ProtobufDeserializer(
       case (ENUM, StringType) =>
         (updater, ordinal, value) => updater.set(ordinal, UTF8String.fromString(value.toString))
 
+      case (ENUM, IntegerType) =>
+        (updater, ordinal, value) => {
+          updater.set(ordinal, protoType.getEnumType.findValueByName(value.toString).getNumber)

Review Comment:
   ah thanks! i found it was of type `EnumValueDescriptor`, so i've updated this code + the code above! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #41075: [SPARK-43361][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1194138082


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##########
@@ -2812,4 +2813,18 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase {
         "location" -> toSQLValue(location.toString, StringType),
         "identifier" -> toSQLId(tableId.nameParts)))
   }
+
+  def cannotConvertCatalystValueToProtobufEnumTypeError(
+      sqlColumn: Seq[String],
+      protobufColumn: String,
+      data: String,
+      enumString: String): Throwable = {
+    new AnalysisException(
+      errorClass = "CANNOT_CONVERT_SQL_VALUE_TO_PROTOBUF_ENUM_TYPE",

Review Comment:
   @justaparth we forgot to update core/src/main/resources/error/error-classes.json and docs/sql-error-conditions.md in this PR.
   Don't know how this got merged without failing. 
   Could you send a follow up PR updating those files? Also, it will be better if we verify this error with `checkError()` in unit tests (see similar examples).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1186958551


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -141,6 +141,33 @@ private[sql] class ProtobufOptions(
   //      what information is available in a serialized proto.
   val emitDefaultValues: Boolean =
     parameters.getOrElse("emit.default.values", false.toString).toBoolean
+
+  // Whether to render enum fields as their integer values.
+  //
+  // As an example, consider the following message type:
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   enum Job {
+  //     NONE = 0;
+  //     ENGINEER = 1;
+  //     DOCTOR = 2;
+  //   }
+  //   Job job = 1;
+  // }
+  // ```
+  //
+  // If we have an instance of this message like `Person(job = ENGINEER)`, then the
+  // default deserialization will be:
+  // `{"job": "ENGINEER"}`
+  //
+  // But with this option set the deserialization will be:
+  // `{"job": 1}`
+  //
+  // Please note the output struct type will now contain an int column
+  // instead of string, so use caution if changing existing parsing logic.
+  val enumsAsInts: Boolean =
+    parameters.getOrElse("enums.as.ints", false.toString).toBoolean

Review Comment:
   Actually those options should better be documented at https://github.com/apache/spark/blob/master/docs/sql-data-sources-protobuf.md#deploying.
   
   Since they are already undocumented there, I am fine without doing that in this PR though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #41075: [SPARK-43361][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #41075:
URL: https://github.com/apache/spark/pull/41075#issuecomment-1541267600

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on pull request #41075: [SPARK-43361][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on PR #41075:
URL: https://github.com/apache/spark/pull/41075#issuecomment-1541425683

   > Just wondering why this is relevant to Spark Connect (seeing `[CONNECT]` in the PR title)?
   
   ah sorry about that. i think i confused that this is under the `connector/` library with `CONNECT` which i saw in some other PR titles. In the future i'll add only `PROTOBUF` 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1187413841


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -141,6 +141,33 @@ private[sql] class ProtobufOptions(
   //      what information is available in a serialized proto.
   val emitDefaultValues: Boolean =
     parameters.getOrElse("emit.default.values", false.toString).toBoolean
+
+  // Whether to render enum fields as their integer values.
+  //
+  // As an example, consider the following message type:
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   enum Job {
+  //     NONE = 0;
+  //     ENGINEER = 1;
+  //     DOCTOR = 2;
+  //   }
+  //   Job job = 1;
+  // }
+  // ```
+  //
+  // If we have an instance of this message like `Person(job = ENGINEER)`, then the
+  // default deserialization will be:
+  // `{"job": "ENGINEER"}`
+  //
+  // But with this option set the deserialization will be:
+  // `{"job": 1}`
+  //
+  // Please note the output struct type will now contain an int column
+  // instead of string, so use caution if changing existing parsing logic.
+  val enumsAsInts: Boolean =
+    parameters.getOrElse("enums.as.ints", false.toString).toBoolean

Review Comment:
   @HyukjinKwon went ahead and made a ticket for this https://issues.apache.org/jira/browse/SPARK-43409 !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1186842132


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufSerializer.scala:
##########
@@ -110,6 +110,19 @@ private[sql] class ProtobufSerializer(
               enumSymbols.mkString("\"", "\", \"", "\""))
           }
           fieldDescriptor.getEnumType.findValueByName(data)
+      case (IntegerType, ENUM) =>
+        val enumValues: Set[Int] =
+          fieldDescriptor.getEnumType.getValues.asScala.map(e => e.getNumber).toSet
+        (getter, ordinal) =>
+          val data = getter.getInt(ordinal)
+          if (!enumValues.contains(data)) {
+            throw QueryCompilationErrors.cannotConvertCatalystTypeToProtobufEnumTypeError(

Review Comment:
   yeah i agree, i copy/pasted the error from the existing logic above.
   
   i've moved the error from `QueryCompilationErrors` to `QueryExecutionErrors` and slightly renamed it, wdyt? happy to change more or use a different error if that makes more sense though



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1186842041


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -260,6 +260,11 @@ private[sql] class ProtobufDeserializer(
       case (ENUM, StringType) =>
         (updater, ordinal, value) => updater.set(ordinal, UTF8String.fromString(value.toString))
 
+      case (ENUM, IntegerType) =>
+        (updater, ordinal, value) => {
+          updater.set(ordinal, protoType.getEnumType.findValueByName(value.toString).getNumber)

Review Comment:
   ah thanks, i found it was of type `EnumDescriptor`, so i've updated this code + the code right above! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1186875033


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -361,15 +361,23 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
 
     val df = Seq(dynamicMessage.toByteArray).toDF("value")
 
+    // Test that roundtrip serde works correctly both with and without enums as ints.
     checkWithFileAndClassName("SimpleMessageEnum") {
       case (name, descFilePathOpt) =>
-        val fromProtoDF = df.select(
-          from_protobuf_wrapper($"value", name, descFilePathOpt).as("value_from"))
-        val toProtoDF = fromProtoDF.select(
-          to_protobuf_wrapper($"value_from", name, descFilePathOpt).as("value_to"))
-        val toFromProtoDF = toProtoDF.select(
-          from_protobuf_wrapper($"value_to", name, descFilePathOpt).as("value_to_from"))
-        checkAnswer(fromProtoDF.select($"value_from.*"), toFromProtoDF.select($"value_to_from.*"))
+        List(
+          Map.empty[String, String],
+          Map("enums.as.ints" -> "false"),
+          Map("enums.as.ints" -> "true"))
+          .foreach(opts => {
+            val fromProtoDF = df.select(
+              from_protobuf_wrapper($"value", name, descFilePathOpt, opts).as("value_from"))
+            val toProtoDF = fromProtoDF.select(
+              to_protobuf_wrapper($"value_from", name, descFilePathOpt).as("value_to"))

Review Comment:
   yeah, i'm not sure either and also don't have a very strong opinion.
   
   It felt safe to allow it by default because its more permissive, and also was hard for me to imagine how it could go wrong (if you have int values in your struct and you try to serialize and they're valid in the enum, then _probably_ thats what you were intending to do).
   
   And this isn't _necessarily_ a good argument for what we should , but it appears that `JsonFormat`s parsing side, for example, does this by default as well: https://github.com/protocolbuffers/protobuf/blob/main/java/util/src/main/java/com/google/protobuf/util/JsonFormat.java#L1920C33-L1946
   
   So i'm inclined to leave it available by default, but yeah I don't really have a strong opinion either way!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1186980765


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -141,6 +141,33 @@ private[sql] class ProtobufOptions(
   //      what information is available in a serialized proto.
   val emitDefaultValues: Boolean =
     parameters.getOrElse("emit.default.values", false.toString).toBoolean
+
+  // Whether to render enum fields as their integer values.
+  //
+  // As an example, consider the following message type:
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   enum Job {
+  //     NONE = 0;
+  //     ENGINEER = 1;
+  //     DOCTOR = 2;
+  //   }
+  //   Job job = 1;
+  // }
+  // ```
+  //
+  // If we have an instance of this message like `Person(job = ENGINEER)`, then the
+  // default deserialization will be:
+  // `{"job": "ENGINEER"}`
+  //
+  // But with this option set the deserialization will be:
+  // `{"job": 1}`
+  //
+  // Please note the output struct type will now contain an int column
+  // instead of string, so use caution if changing existing parsing logic.
+  val enumsAsInts: Boolean =
+    parameters.getOrElse("enums.as.ints", false.toString).toBoolean

Review Comment:
   yeah thats a great point....
   
   > Since they are already undocumented there, I am fine without doing that in this PR though.
   
   yeah, maybe it makes sense in a new PR (which i'm happy to do) to add documentation related to all of the options available? since like you said, none of them seem to be documented



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on PR #41075:
URL: https://github.com/apache/spark/pull/41075#issuecomment-1540649572

   @HyukjinKwon Yep, this looks good to go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1186786447


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -141,6 +141,30 @@ private[sql] class ProtobufOptions(
   //      what information is available in a serialized proto.
   val emitDefaultValues: Boolean =
     parameters.getOrElse("emit.default.values", false.toString).toBoolean
+
+  // Whether to render enum fields as their integer values.
+  //
+  // As an example, consider the following message type:
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   enum Job {
+  //     NONE = 0;
+  //     ENGINEER = 1;
+  //     DOCTOR = 2;
+  //   }
+  //   Job job = 1;
+  // }
+  // ```
+  //
+  // If we have an instance of this message like `Person(job = ENGINEER)`, then the
+  // default deserialization will be:
+  // `{"job": "ENGINEER"}`
+  //
+  // But with this option set the deserialization will be:

Review Comment:
   Should we add a warning about users to be careful while changing it? It will change the schema and might be incompatible with downstream consumes. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -141,6 +141,30 @@ private[sql] class ProtobufOptions(
   //      what information is available in a serialized proto.
   val emitDefaultValues: Boolean =
     parameters.getOrElse("emit.default.values", false.toString).toBoolean
+
+  // Whether to render enum fields as their integer values.
+  //
+  // As an example, consider the following message type:
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   enum Job {
+  //     NONE = 0;
+  //     ENGINEER = 1;
+  //     DOCTOR = 2;
+  //   }
+  //   Job job = 1;
+  // }
+  // ```
+  //
+  // If we have an instance of this message like `Person(job = ENGINEER)`, then the
+  // default deserialization will be:
+  // `{"job": "ENGINEER"}`
+  //
+  // But with this option set the deserialization will be:
+  // `{"job": 1}`
+  val enumsAsInts: Boolean =

Review Comment:
   Better to set this option for `jsonPrinter` in `ProtobufDeserializer` so that `Any` values deserialized as json strings will have ints too. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1186875033


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -361,15 +361,23 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
 
     val df = Seq(dynamicMessage.toByteArray).toDF("value")
 
+    // Test that roundtrip serde works correctly both with and without enums as ints.
     checkWithFileAndClassName("SimpleMessageEnum") {
       case (name, descFilePathOpt) =>
-        val fromProtoDF = df.select(
-          from_protobuf_wrapper($"value", name, descFilePathOpt).as("value_from"))
-        val toProtoDF = fromProtoDF.select(
-          to_protobuf_wrapper($"value_from", name, descFilePathOpt).as("value_to"))
-        val toFromProtoDF = toProtoDF.select(
-          from_protobuf_wrapper($"value_to", name, descFilePathOpt).as("value_to_from"))
-        checkAnswer(fromProtoDF.select($"value_from.*"), toFromProtoDF.select($"value_to_from.*"))
+        List(
+          Map.empty[String, String],
+          Map("enums.as.ints" -> "false"),
+          Map("enums.as.ints" -> "true"))
+          .foreach(opts => {
+            val fromProtoDF = df.select(
+              from_protobuf_wrapper($"value", name, descFilePathOpt, opts).as("value_from"))
+            val toProtoDF = fromProtoDF.select(
+              to_protobuf_wrapper($"value_from", name, descFilePathOpt).as("value_to"))

Review Comment:
   yeah, i'm not sure either and also don't have a very strong opinion.
   
   It felt safe to allow it by default because its more permissive, and also was hard for me to imagine how it could go wrong (if you have int values in your struct and you try to serialize and they're valid in the enum, then _probably_ thats what you were intending to do).
   
   And this isn't _necessarily_ a good argument for what we should , but it appears that `JsonFormat`, for example, does this by default as well: https://github.com/protocolbuffers/protobuf/blob/main/java/util/src/main/java/com/google/protobuf/util/JsonFormat.java#L1920C33-L1946
   
   So i'm inclined to leave it available by default, but yeah I don't really have a strong opinion either way!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1186842132


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufSerializer.scala:
##########
@@ -110,6 +110,19 @@ private[sql] class ProtobufSerializer(
               enumSymbols.mkString("\"", "\", \"", "\""))
           }
           fieldDescriptor.getEnumType.findValueByName(data)
+      case (IntegerType, ENUM) =>
+        val enumValues: Set[Int] =
+          fieldDescriptor.getEnumType.getValues.asScala.map(e => e.getNumber).toSet
+        (getter, ordinal) =>
+          val data = getter.getInt(ordinal)
+          if (!enumValues.contains(data)) {
+            throw QueryCompilationErrors.cannotConvertCatalystTypeToProtobufEnumTypeError(

Review Comment:
   yeah i agree, i copy/pasted the error from the existing logic above.
   
   i've moved the error from `QueryCompilationErrors` to `QueryExecutionErrors` and slightly renamed it, what do yo uthink? happy to change more or use a different error if that makes more sense though



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #41075: [SPARK-43361][CONNECT][PROTOBUF] spark-protobuf: allow serde with enum as ints

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #41075:
URL: https://github.com/apache/spark/pull/41075#discussion_r1186842132


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufSerializer.scala:
##########
@@ -110,6 +110,19 @@ private[sql] class ProtobufSerializer(
               enumSymbols.mkString("\"", "\", \"", "\""))
           }
           fieldDescriptor.getEnumType.findValueByName(data)
+      case (IntegerType, ENUM) =>
+        val enumValues: Set[Int] =
+          fieldDescriptor.getEnumType.getValues.asScala.map(e => e.getNumber).toSet
+        (getter, ordinal) =>
+          val data = getter.getInt(ordinal)
+          if (!enumValues.contains(data)) {
+            throw QueryCompilationErrors.cannotConvertCatalystTypeToProtobufEnumTypeError(

Review Comment:
   yeah i agree, i copy/pasted the error from the existing logic above.
   
   i've moved the error from `QueryCompilationErrors` to `QueryExecutionErrors` and slightly renamed it, what do you think? happy to change more or use a different error if that makes more sense though



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org