You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "samkumar (via GitHub)" <gi...@apache.org> on 2023/11/20 08:09:14 UTC

Re: [PR] [SPARK-32846][SQL][PYTHON] Support createDataFrame from an RDD of pd.DataFrames [spark]

samkumar commented on PR #29719:
URL: https://github.com/apache/spark/pull/29719#issuecomment-1818420929

   Have there been any updates for adding this kind of functionality since this pull request? Being able to take an RDD of pyarrow RecordBatches or pandas DataFrames and turn it into a Spark DataFrame would be very useful turning a dataset distributed at the workers outside of Spark into a Spark DataFrame for analysis.
   
   Even if an API like this hasn't been added, is there any guidance on achieving this (building a Spark DataFrom from an RDD of pandas RecordBatches or pandas DataFrames) in Spark 3.4/3.5? As far as I can tell, the code in this pull request no longer works on the latest versions of Spark because `toDataFrame` now accepts an iterator as its argument, not an RDD.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org