You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/27 06:00:44 UTC

[GitHub] [spark] grundprinzip commented on a diff in pull request #38803: [SPARK-41114] [CONNECT] [PYTHON] [FOLLOW-UP] Python Client support for local data

grundprinzip commented on code in PR #38803:
URL: https://github.com/apache/spark/pull/38803#discussion_r1032871984


##########
python/pyspark/sql/connect/session.py:
##########
@@ -205,6 +207,31 @@ def __init__(self, connectionString: str, userId: Optional[str] = None):
         # Create the reader
         self.read = DataFrameReader(self)
 
+    def createDataFrame(self, data: "pd.DataFrame") -> "DataFrame":

Review Comment:
   It is impossible to match the implementation because in Pyspark to parallelize a first serialization is already happening to pass the input DF to the executors. 
   
   In our case to even send the data to spark we have to serialize it. 
   
   That said you're right that this currently does not support streaming of local data to the client. But the limit is not 4kb but probably whatever the max message size of GRPC is so in the megabytes. 
   
   
   I think we need to add the client side streaming APIs at some point but I'd like to defer that for a bit. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org