You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tanveer Ahmad - EWI <T....@tudelft.nl> on 2020/06/25 03:35:01 UTC

Arrow RecordBatches to Spark Dataframe

Hi all,

I have a small question, if you people can help me.

In this code snippet<https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_linar-2Djether_7dd61ed6fa89098ab9c58a1ab428b2b5&d=DwMFaQ&c=XYzUhXBD2cD-CornpT4QE19xOJBbRy-TBPLK0X9U2o8&r=0FbbJetCCSYzJEnEDCQ1rNv76vTL6SUFCukKhvNosPs&m=xmBkwlg97mtA1QdP5CjruMn_xeOPwDNai-A67sGzgE8&s=AamSgwvubLZjISfIuoBJCWRNB4aikOo78kezYSyRMqw&e=>, Jether is converting an prdd (RDD) of pd.Dataframes objects to Arrow RecordBatches (slices) and then to Spark Dataframe finally. Similarly the code in Scala<https://github.com/apache/spark/blob/65a189c7a1ddceb8ab482ccc60af5350b8da5ea5/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L192-L206> converts   JavaRDD to Spark Dataframe.

If I already have an ardd (RDD) of pa.RecordBatch (Arrow RecordBatches) objects, how can I convert it to Spark Dataframe directly without using Pandas in PySpark? Thanks.


Regards,
Tanveer Ahmad