You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Michael Shtelma <ms...@gmail.com> on 2018/02/05 16:30:06 UTC

Spark DataFrame <--> Arrow Roundtrip

Hi all,

I would like to make some changes (updates) to the data stored in
Spark data frames, which I get as a result of different queries.
Afterwards, I would like to operate with these changed data frames as
with normal data frames in Spark, e.g. use them for further
transformations.

I would like to use Apache Arrow as an intermediate representation of
the data, I am going to update. My idea was to call
ds.toArrowPayload() and afterwards operate with RDD<ArrowPayload>, so
get the batch for each payload and perform the update operation on the
batch. Question: Can I update individual values for some column
vector? Or is it better to rewrite the whole column?

And the final question is how to get all the batches back to Spark, I
mean create data frame?
Can I use method ArrowConverters.toDataFrame(arrowRDD,ds.schema(),
...) for that ?

Is it going to work? Does anybody have any better ideas?
Any assistance would be greatly appreciated!

Best,
Michael

Re: Spark DataFrame <--> Arrow Roundtrip

Posted by Li Jin <ic...@gmail.com>.
Hi Michael,

Please see my comments inline.

Keep in mind these are all pretty internal APIs so they might change in the
future.

On Mon, Feb 5, 2018 at 11:30 AM, Michael Shtelma <ms...@gmail.com> wrote:

> Hi all,
>
> I would like to make some changes (updates) to the data stored in
> Spark data frames, which I get as a result of different queries.
> Afterwards, I would like to operate with these changed data frames as
> with normal data frames in Spark, e.g. use them for further
> transformations.
>
> I would like to use Apache Arrow as an intermediate representation of
> the data, I am going to update. My idea was to call
> ds.toArrowPayload() and afterwards operate with RDD<ArrowPayload>, so
> get the batch for each payload and perform the update operation on the
> batch. Question: Can I update individual values for some column
> vector? Or is it better to rewrite the whole column?
>

Yes you can update individual values for some vectors.


> And the final question is how to get all the batches back to Spark, I
> mean create data frame?
> Can I use method ArrowConverters.toDataFrame(arrowRDD,ds.schema(),
> ...) for that ?
>
> You probably want to use the columnar vector API to do that:
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java#L134
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java#L31


> Is it going to work? Does anybody have any better ideas?
> Any assistance would be greatly appreciated!
>
> Best,
> Michael
>