You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "rangadi (via GitHub)" <gi...@apache.org> on 2023/05/27 16:27:24 UTC

[GitHub] [spark] rangadi opened a new pull request, #41343: [SPARK-43799] Add descriptor binary option to Pyspark Protobuf API

rangadi opened a new pull request, #41343:
URL: https://github.com/apache/spark/pull/41343

   ### What changes were proposed in this pull request?
   This updated Protobuf Pyspark API to allow passing binary FileDescriptorSet rather than a file name. This is a Python follow up to feature implemented in Scala in #41192. 
   
   ### Why are the changes needed?
     - This allows flexibility for Pyspark users to provide binary descriptor set directly.
     - Even if users are using file path, Pyspark avoids passing file name to Scala and reads the descriptor file in Python. This avoids having to read the file in Scala. 
   
   ### Does this PR introduce _any_ user-facing change?
   
     - This adds extra arg to `from_protobuf()` and `to_protobuf()` API. 
   
   ### How was this patch tested?
     - Doc tests
     - Manual tests
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #41343: [SPARK-43799][PYTHON] Add descriptor binary option to Pyspark Protobuf API

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #41343:
URL: https://github.com/apache/spark/pull/41343#discussion_r1208774517


##########
python/pyspark/sql/protobuf/functions.py:
##########
@@ -61,9 +65,11 @@ def from_protobuf(
         The Protobuf class name when descFilePath parameter is not set.
         E.g. `com.example.protos.ExampleEvent`.
     descFilePath : str, optional
-        The protobuf descriptor file.
+        The Protobuf descriptor file.
     options : dict, optional
         options to control how the protobuf record is parsed.
+    binaryDescriptorSet: bytes, optional
+        The Protobuf `FileDescriptorSet` serialized as binary.

Review Comment:
   Good call. Updated.



##########
python/pyspark/sql/protobuf/functions.py:
##########
@@ -232,6 +274,12 @@ def to_protobuf(
     return Column(jc)
 
 
+def _read_descriptor_set_file(filePath: str) -> bytes:
+    # TODO: Throw structured errors like "PROTOBUF_DESCRIPTOR_FILE_NOT_FOUND" etc.

Review Comment:
   Agree. Added jira ticket and updated the comment.
   Can we throw structured error now in Python? In that case, I can actually fix it in this PR itself. 
   Please point to an example. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #41343: [SPARK-43799][PYTHON] Add descriptor binary option to Pyspark Protobuf API

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #41343:
URL: https://github.com/apache/spark/pull/41343#issuecomment-1567605823

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41343: [SPARK-43799][PYTHON] Add descriptor binary option to Pyspark Protobuf API

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #41343:
URL: https://github.com/apache/spark/pull/41343#discussion_r1208713243


##########
python/pyspark/sql/protobuf/functions.py:
##########
@@ -61,9 +65,11 @@ def from_protobuf(
         The Protobuf class name when descFilePath parameter is not set.
         E.g. `com.example.protos.ExampleEvent`.
     descFilePath : str, optional
-        The protobuf descriptor file.
+        The Protobuf descriptor file.
     options : dict, optional
         options to control how the protobuf record is parsed.
+    binaryDescriptorSet: bytes, optional
+        The Protobuf `FileDescriptorSet` serialized as binary.

Review Comment:
   Actually we should add `versionchanged:: 3.5.0` here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on pull request #41343: [SPARK-43799][PYTHON] Add descriptor binary option to Pyspark Protobuf API

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on PR #41343:
URL: https://github.com/apache/spark/pull/41343#issuecomment-1567218021

   @LuciferYang Please merge this when you get a chance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on pull request #41343: [SPARK-43799] Add descriptor binary option to Pyspark Protobuf API

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on PR #41343:
URL: https://github.com/apache/spark/pull/41343#issuecomment-1565581446

   cc: @LuciferYang, @pengzhon-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on pull request #41343: [SPARK-43799][PYTHON] Add descriptor binary option to Pyspark Protobuf API

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on PR #41343:
URL: https://github.com/apache/spark/pull/41343#issuecomment-1567677248

   Thanks @HyukjinKwon! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on pull request #41343: [SPARK-43799][PYTHON] Add descriptor binary option to Pyspark Protobuf API

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on PR #41343:
URL: https://github.com/apache/spark/pull/41343#issuecomment-1566512210

   The tests should pass in an hour or so. Please merge it you are around. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41343: [SPARK-43799][PYTHON] Add descriptor binary option to Pyspark Protobuf API

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #41343:
URL: https://github.com/apache/spark/pull/41343#discussion_r1208714143


##########
python/pyspark/sql/protobuf/functions.py:
##########
@@ -232,6 +274,12 @@ def to_protobuf(
     return Column(jc)
 
 
+def _read_descriptor_set_file(filePath: str) -> bytes:
+    # TODO: Throw structured errors like "PROTOBUF_DESCRIPTOR_FILE_NOT_FOUND" etc.

Review Comment:
   and should file TODO'ed JIRA such as `TODO(SPARK-XXXXX)`. It's a bit trouble some to file for every todo but we found that some of todos are just left out, forgotten, and nobody remembers what it was :-).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #41343: [SPARK-43799][PYTHON] Add descriptor binary option to Pyspark Protobuf API

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #41343: [SPARK-43799][PYTHON] Add descriptor binary option to Pyspark Protobuf API
URL: https://github.com/apache/spark/pull/41343


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org