You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bryan Cutler <cu...@gmail.com> on 2019/09/10 19:17:25 UTC
Re: question about pyarrow.Table to pyspark.DataFrame conversion
Hi Artem,
I don't believe this is currently possible, but it could be a great
addition to PySpark since this would offer a convenient and efficient way
to parallelize nested column data. I created the JIRA
https://issues.apache.org/jira/browse/SPARK-29040 for this.
On Tue, Aug 27, 2019 at 7:55 PM Artem Kozhevnikov <
kozhevnikov.artem@gmail.com> wrote:
> I wonder if there's some recommended method to convert in memory
> pyarrow.Table (or pyarrow.BatchRecord) to pyspark.Dataframe without using
> pandas ?
> My motivation is about converting nested data (like List[int]) that have
> an efficient representation in pyarrow which is not possible with Pandas (I
> don't want to pass by python list of int ...).
>
> Thanks in advance !
> Artem
>
>
>
Re: question about pyarrow.Table to pyspark.DataFrame conversion
Posted by shouheng <sh...@gmail.com>.
Hi Bryan,
I came across SPARK-29040
<https://issues.apache.org/jira/browse/SPARK-29040> and I'm very excited
that others are looking for such feature as well. It will be tremendously
useful if we can implement this feature.
Currently, my workaround is to serialize `pyarrow.Table` to a parquet file,
then let Spark to read that parquet file. I avoided using `pd.Dataframe`,
same as what Artem mentioned above.
Do you think this ticket has a chance to get prioritized?
Thank you very much.
Best,
Shouheng
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org