You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Subash Prabakar <su...@gmail.com> on 2020/02/16 13:49:53 UTC

Basic question on Apache Arrow

Hi all,

I could understand the use of Arrow in our projects to have
inter-operability as well as faster access. I have couple of questions on
how we can use for the following usecase and whether is it a good way of
usage,

1. Will the Spark execution be faster when I use joins on DF with Arrow
compared to normal Parquet format ? Due to serialization and
deserialization shuffling cost is lesser ? Is it?


2. If I have a use case of running aggregate queries on a very huge table
(say 10TB) containing say few dimensions and very few metrics - Is it good
to use Arrow as intermediate caching layer for interactive queries ? (Low
latency queries)
Note: Dremio contains this by default  - should I explore it or Impala or
Drill for this use case ?


Thanks,
Subash

Re: Basic question on Apache Arrow

Posted by Wes McKinney <we...@gmail.com>.

hi Subash,

I'm only familiar with Question 1. Spark only makes use of Arrow for
accelerating Python and R UDF evaluation and sending data to and from
those language APIs (see our blog posts for some discussion about
this). So I would guess for what you're saying there aren't any
speedups unless there's new development I have not heard of. There is
some internal columnar processing in Spark but I don't know if Arrow
is being used (there was some discussion of this but I'm not sure
where things currently stand).

On the 2nd question would have to defer to others who know better.

Thanks
Wes

On Sun, Feb 16, 2020 at 7:53 AM Subash Prabakar
<su...@gmail.com> wrote:
>
> Hi all,
>
> I could understand the use of Arrow in our projects to have
> inter-operability as well as faster access. I have couple of questions on
> how we can use for the following usecase and whether is it a good way of
> usage,
>
> 1. Will the Spark execution be faster when I use joins on DF with Arrow
> compared to normal Parquet format ? Due to serialization and
> deserialization shuffling cost is lesser ? Is it?
>
>
> 2. If I have a use case of running aggregate queries on a very huge table
> (say 10TB) containing say few dimensions and very few metrics - Is it good
> to use Arrow as intermediate caching layer for interactive queries ? (Low
> latency queries)
> Note: Dremio contains this by default  - should I explore it or Impala or
> Drill for this use case ?
>
>
> Thanks,
> Subash