You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "rangadi (via GitHub)" <gi...@apache.org> on 2023/04/17 20:51:17 UTC

[GitHub] [spark] rangadi commented on a diff in pull request #39550: [SPARK-42056][SQL][PROTOBUF] Add missing options for Protobuf functions

rangadi commented on code in PR #39550:
URL: https://github.com/apache/spark/pull/39550#discussion_r1169254778


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/functions.scala:
##########
@@ -72,51 +69,129 @@ object functions {
   }
 
   /**
-   * Converts a binary column of Protobuf format into its corresponding catalyst value. The
-   * specified Protobuf class must match the data, otherwise the behavior is
-   * undefined: it may fail or return arbitrary result. The jar containing Java class should be
+   * Converts a binary column of Protobuf format into its corresponding catalyst value.
+   * `messageClassName` points to Protobuf Java class. The jar containing Java class should be
    * shaded. Specifically, `com.google.protobuf.*` should be shaded to
    * `org.sparkproject.spark-protobuf.protobuf.*`.
+   * https://github.com/rangadi/shaded-protobuf-classes is useful to create shaded jar from
+   * Protobuf files.
    *
    * @param data
    *   the binary column.
-   * @param shadedMessageClassName
-   *   The Protobuf class name. E.g. <code>org.spark.examples.protobuf.ExampleEvent</code>.
+   * @param messageClassName
+   *   The full name for Protobuf Java class. E.g. <code>com.example.protos.ExampleEvent</code>.
    *   The jar with these classes needs to be shaded as described above.
    * @since 3.4.0
    */
   @Experimental
-  def from_protobuf(data: Column, shadedMessageClassName: String): Column = {
-    new Column(ProtobufDataToCatalyst(data.expr, shadedMessageClassName))
+  def from_protobuf(data: Column, messageClassName: String): Column = {
+    new Column(ProtobufDataToCatalyst(data.expr, messageClassName))
   }
 
   /**
-   * Converts a column into binary of protobuf format.
+   * Converts a binary column of Protobuf format into its corresponding catalyst value.
+   * `messageClassName` points to Protobuf Java class. The jar containing Java class should be
+   * shaded. Specifically, `com.google.protobuf.*` should be shaded to
+   * `org.sparkproject.spark-protobuf.protobuf.*`.
+   * https://github.com/rangadi/shaded-protobuf-classes is useful to create shaded jar from
+   * Protobuf files.
+   *
+   * @param data
+   *   the binary column.
+   * @param messageClassName
+   *   The full name for Protobuf Java class. E.g. <code>com.example.protos.ExampleEvent</code>.
+   *   The jar with these classes needs to be shaded as described above.
+   * @param options
+   * @since 3.4.0
+   */
+  @Experimental

Review Comment:
   There is no "protobuf file format" (unlike Avro, which has its own format).
   One common way I have seen are "length delimited files" where the record length is written followed by the record.
   We can use "binaryFile" to handle such files. But there some disadvantages: 
     - It loads the whole file in to memory 
     - User needs to parse the length and the extract the bytes.
   
   It will be good to add a new format "delimitedFile" that handles such file and return one binary record at a time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org