You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "SandishKumarHN (via GitHub)" <gi...@apache.org> on 2023/04/16 16:33:21 UTC

[GitHub] [spark] SandishKumarHN commented on a diff in pull request #39550: [SPARK-42056][SQL][PROTOBUF] Add missing options for Protobuf functions

SandishKumarHN commented on code in PR #39550:
URL: https://github.com/apache/spark/pull/39550#discussion_r1167973045


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/functions.scala:
##########
@@ -72,51 +69,129 @@ object functions {
   }
 
   /**
-   * Converts a binary column of Protobuf format into its corresponding catalyst value. The
-   * specified Protobuf class must match the data, otherwise the behavior is
-   * undefined: it may fail or return arbitrary result. The jar containing Java class should be
+   * Converts a binary column of Protobuf format into its corresponding catalyst value.
+   * `messageClassName` points to Protobuf Java class. The jar containing Java class should be
    * shaded. Specifically, `com.google.protobuf.*` should be shaded to
    * `org.sparkproject.spark-protobuf.protobuf.*`.
+   * https://github.com/rangadi/shaded-protobuf-classes is useful to create shaded jar from
+   * Protobuf files.
    *
    * @param data
    *   the binary column.
-   * @param shadedMessageClassName
-   *   The Protobuf class name. E.g. <code>org.spark.examples.protobuf.ExampleEvent</code>.
+   * @param messageClassName
+   *   The full name for Protobuf Java class. E.g. <code>com.example.protos.ExampleEvent</code>.
    *   The jar with these classes needs to be shaded as described above.
    * @since 3.4.0
    */
   @Experimental
-  def from_protobuf(data: Column, shadedMessageClassName: String): Column = {
-    new Column(ProtobufDataToCatalyst(data.expr, shadedMessageClassName))
+  def from_protobuf(data: Column, messageClassName: String): Column = {
+    new Column(ProtobufDataToCatalyst(data.expr, messageClassName))
   }
 
   /**
-   * Converts a column into binary of protobuf format.
+   * Converts a binary column of Protobuf format into its corresponding catalyst value.
+   * `messageClassName` points to Protobuf Java class. The jar containing Java class should be
+   * shaded. Specifically, `com.google.protobuf.*` should be shaded to
+   * `org.sparkproject.spark-protobuf.protobuf.*`.
+   * https://github.com/rangadi/shaded-protobuf-classes is useful to create shaded jar from
+   * Protobuf files.
+   *
+   * @param data
+   *   the binary column.
+   * @param messageClassName
+   *   The full name for Protobuf Java class. E.g. <code>com.example.protos.ExampleEvent</code>.
+   *   The jar with these classes needs to be shaded as described above.
+   * @param options
+   * @since 3.4.0
+   */
+  @Experimental

Review Comment:
   @ericsun95 since the protobuf encoding format is binary, you can use the spark binary file data source and then use from_protobuf function to convert it into a spark dataframe. I don't see the need for a separate protobuf data source. 
   
   `
   spark.read.format("binaryFile").option("pathGlobFilter", "*.proto").load("/path/to/data")
   `
   
   having a separate data source for protobuf can achieve performance, type safety, and schema evolution by storing the protobuf schema along with the protobuf files on disk.
   
   @rangadi 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org