You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Subash Prabakar <su...@gmail.com> on 2020/02/17 07:41:05 UTC

Apache Arrow support for Apache Spark

Hi Team,

I have two questions regarding Arrow and Spark integration,

1. I am joining two huge tables (1PB) each - will the performance be huge
when I use Arrow format before shuffling ? Will the
serialization/deserialization cost have significant improvement?

2. Can we store the final data in Arrow format to HDFS and read them back
in another Spark application? If so how could I do that ?
Note: The dataset is transient  - separation of responsibility is for
easier management. Though resiliency inside spark - we use different
language (in our case Java and Python)

Thanks,
Subash

Re: Apache Arrow support for Apache Spark

Posted by Chris Teoh <ch...@gmail.com>.

1. I'd also consider how you're structuring the data before applying the
join, naively doing the join could be expensive so doing a bit of data
preparation may be necessary to improve join performance. Try to get a
baseline as well. Arrow would help improve it.

2. Try storing it back as Parquet but in a way the next application can
take advantage of predicate pushdown.

On Mon, 17 Feb 2020, 6:41 pm Subash Prabakar, <su...@gmail.com>
wrote:

> Hi Team,
>
> I have two questions regarding Arrow and Spark integration,
>
> 1. I am joining two huge tables (1PB) each - will the performance be huge
> when I use Arrow format before shuffling ? Will the
> serialization/deserialization cost have significant improvement?
>
> 2. Can we store the final data in Arrow format to HDFS and read them back
> in another Spark application? If so how could I do that ?
> Note: The dataset is transient  - separation of responsibility is for
> easier management. Though resiliency inside spark - we use different
> language (in our case Java and Python)
>
> Thanks,
> Subash
>
>