You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "advancedxy (via GitHub)" <gi...@apache.org> on 2023/05/18 02:41:20 UTC

[GitHub] [spark] advancedxy commented on a diff in pull request #41192: [SPARK-43530][PROTOBUF] Read descriptor file only once

advancedxy commented on code in PR #41192:
URL: https://github.com/apache/spark/pull/41192#discussion_r1197287073


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala:
##########
@@ -26,14 +26,14 @@ import org.apache.spark.sql.types.{BinaryType, DataType}
 private[protobuf] case class CatalystDataToProtobuf(
     child: Expression,
     messageName: String,
-    descFilePath: Option[String] = None,
+    fileDescriptorSetBytesOpt: Option[Array[Byte]] = None,

Review Comment:
   > most of the time size would be in kilobytes
   
   Yes. Most of the time, we don't need to bother the size of protobuf messages. But there are certain cases, huge message and huge protobuf repo is used, such as ML featuring engineering: they could define thousands, or even tens of thousands fields in one message.
   
   > Can we do that as an implementation detail (so that users don't need be aware of it).
   
   Of course, we should do that as an implementation details, users should be no aware how the bytes are distributed.  And we can be adaptive here, only the size exceeds a threshold, say 512K, we should broadcast instead of serializing.
   
   I just checked the related spark connect impl, seems functions are just placeholders on the client side, it will work fine with spark connect.
   
   > I will add a comment. Do you think we can do that an enhancement later?
   
   A comment is sufficient for now. Since we can improve it via broadcast, this enhancement could be addressed in later prs.
   
   > We also need to make sure it is cleaned up.
   
   P.S. if we are using it correctly, it should be cleaned up automatically. A lot of places uses broadcast.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org