You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Tao Wang <ta...@gmail.com> on 2021/12/20 06:01:53 UTC

A general question about Arrow

Hi,

I looked through Arrow's docs about its formats and APIs.

But I am still somewhat confused about typical usecases of Arrow.

As in my understanding, the goal of Arrow is to eliminate the (de)serialization costs among different data analytic systems, since it has the common format.

But, it still needs some data conversion between Arrow format and language native format, right? For example, you have to convert Arrow columnar-based format to C++ row-based format. Or is there any usecase to directly conduct data analysis on Arrow's format?

Best,
Tao

Re: A general question about Arrow

Posted by Benson Muite <be...@emailplus.org>.

Andrew Lamb has a good overview talk, see:
http://andrew.nerdnetworks.org/

we need to put this in written form so it is easier to digest and 
adoption of the format can grow.

On 12/21/21 8:56 AM, Tao Wang wrote:
> Just a followup question.
> 
> I looked through the python cookbook. So basically, Arrow provides some 
> API to manipulate data in its own format.
> 
> And I get the idea that columnar benefits data analytics performance.
> 
> But it is not strictly necessary for applications to build analytic 
> functions based on Arrow data format, right? For example,
> Spark may have its own spark-native format rather than the Arrow format.
Yes, that is correct.
> 
> Best,
> Tao
> 
> On Mon, Dec 20, 2021 at 5:22 PM Tao Wang <tao.wang0221@gmail.com 
> <ma...@gmail.com>> wrote:
> 
>     Thanks for the references and explanation!
> 
>     Best,
>     Tao
> 
>     On Mon, Dec 20, 2021 at 2:26 PM Sasha Krassovsky
>     <krassovskysasha@gmail.com <ma...@gmail.com>> wrote:
> 
>         Hi Tao,
>         You’re right that there is a cost to convert Arrow’s columnar
>         format to a row format, but generally data analytics are done on
>         columnar arrays for performance reasons. Any modern data
>         analytics engine will use some form of “structure of arrays”
>         (i.e. one array per column, rather than one struct per row).
> 
>         One example of this could be for grouped aggregation, where you
>         group by a key column (let’s call it A) and sum another column
>         (call it B). This can be broken up into two steps: group ID
>         mapping and summation. The first stage touches only column A, as
>         you find which row indices have the same values and assign them
>         to groups. The summation step then touches only column B. As you
>         can see, each part of the analysis touches only one column at a
>         time. Thus if we were to use a row-oriented format we would be
>         loading data into cache that we weren’t actually using,
>         effectively reducing our cache size by half! By using a columnar
>         format, we minimize cache misses by loading more useful data at
>         a time.
> 
>         I hope this example helps illustrate why a columnar format is
>         better for analytical workloads, which is what Arrow is targeting.
> 
>         Sasha
> 
>          > 19 дек. 2021 г., в 22:02, Tao Wang <tao.wang0221@gmail.com
>         <ma...@gmail.com>> написал(а):
>          >
>          > Hi,
>          >
>          > I looked through Arrow's docs about its formats and APIs.
>          >
>          > But I am still somewhat confused about typical usecases of Arrow.
>          >
>          > As in my understanding, the goal of Arrow is to eliminate the
>         (de)serialization costs among different data analytic systems,
>         since it has the common format.
>          >
>          > But, it still needs some data conversion between Arrow format
>         and language native format, right? For example, you have to
>         convert Arrow columnar-based format to C++ row-based format. Or
>         is there any usecase to directly conduct data analysis on
>         Arrow's format?
>          >
>          > Best,
>          > Tao
>

Re: A general question about Arrow

Posted by Tao Wang <ta...@gmail.com>.

Just a followup question.

I looked through the python cookbook. So basically, Arrow provides some API
to manipulate data in its own format.

And I get the idea that columnar benefits data analytics performance.

But it is not strictly necessary for applications to build analytic
functions based on Arrow data format, right? For example,
Spark may have its own spark-native format rather than the Arrow format.

Best,
Tao

On Mon, Dec 20, 2021 at 5:22 PM Tao Wang <ta...@gmail.com> wrote:

> Thanks for the references and explanation!
>
> Best,
> Tao
>
> On Mon, Dec 20, 2021 at 2:26 PM Sasha Krassovsky <
> krassovskysasha@gmail.com> wrote:
>
>> Hi Tao,
>> You’re right that there is a cost to convert Arrow’s columnar format to a
>> row format, but generally data analytics are done on columnar arrays for
>> performance reasons. Any modern data analytics engine will use some form of
>> “structure of arrays” (i.e. one array per column, rather than one struct
>> per row).
>>
>> One example of this could be for grouped aggregation, where you group by
>> a key column (let’s call it A) and sum another column (call it B). This can
>> be broken up into two steps: group ID mapping and summation. The first
>> stage touches only column A, as you find which row indices have the same
>> values and assign them to groups. The summation step then touches only
>> column B. As you can see, each part of the analysis touches only one column
>> at a time. Thus if we were to use a row-oriented format we would be loading
>> data into cache that we weren’t actually using, effectively reducing our
>> cache size by half! By using a columnar format, we minimize cache misses by
>> loading more useful data at a time.
>>
>> I hope this example helps illustrate why a columnar format is better for
>> analytical workloads, which is what Arrow is targeting.
>>
>> Sasha
>>
>> > 19 дек. 2021 г., в 22:02, Tao Wang <ta...@gmail.com> написал(а):
>> >
>> > Hi,
>> >
>> > I looked through Arrow's docs about its formats and APIs.
>> >
>> > But I am still somewhat confused about typical usecases of Arrow.
>> >
>> > As in my understanding, the goal of Arrow is to eliminate the
>> (de)serialization costs among different data analytic systems, since it has
>> the common format.
>> >
>> > But, it still needs some data conversion between Arrow format and
>> language native format, right? For example, you have to convert Arrow
>> columnar-based format to C++ row-based format. Or is there any usecase to
>> directly conduct data analysis on Arrow's format?
>> >
>> > Best,
>> > Tao
>>
>

Re: A general question about Arrow

Posted by Tao Wang <ta...@gmail.com>.

Thanks for the references and explanation!

Best,
Tao

On Mon, Dec 20, 2021 at 2:26 PM Sasha Krassovsky <kr...@gmail.com>
wrote:

> Hi Tao,
> You’re right that there is a cost to convert Arrow’s columnar format to a
> row format, but generally data analytics are done on columnar arrays for
> performance reasons. Any modern data analytics engine will use some form of
> “structure of arrays” (i.e. one array per column, rather than one struct
> per row).
>
> One example of this could be for grouped aggregation, where you group by a
> key column (let’s call it A) and sum another column (call it B). This can
> be broken up into two steps: group ID mapping and summation. The first
> stage touches only column A, as you find which row indices have the same
> values and assign them to groups. The summation step then touches only
> column B. As you can see, each part of the analysis touches only one column
> at a time. Thus if we were to use a row-oriented format we would be loading
> data into cache that we weren’t actually using, effectively reducing our
> cache size by half! By using a columnar format, we minimize cache misses by
> loading more useful data at a time.
>
> I hope this example helps illustrate why a columnar format is better for
> analytical workloads, which is what Arrow is targeting.
>
> Sasha
>
> > 19 дек. 2021 г., в 22:02, Tao Wang <ta...@gmail.com> написал(а):
> >
> > Hi,
> >
> > I looked through Arrow's docs about its formats and APIs.
> >
> > But I am still somewhat confused about typical usecases of Arrow.
> >
> > As in my understanding, the goal of Arrow is to eliminate the
> (de)serialization costs among different data analytic systems, since it has
> the common format.
> >
> > But, it still needs some data conversion between Arrow format and
> language native format, right? For example, you have to convert Arrow
> columnar-based format to C++ row-based format. Or is there any usecase to
> directly conduct data analysis on Arrow's format?
> >
> > Best,
> > Tao
>

Re: A general question about Arrow

Posted by Sasha Krassovsky <kr...@gmail.com>.

Hi Tao,
You’re right that there is a cost to convert Arrow’s columnar format to a row format, but generally data analytics are done on columnar arrays for performance reasons. Any modern data analytics engine will use some form of “structure of arrays” (i.e. one array per column, rather than one struct per row). 

One example of this could be for grouped aggregation, where you group by a key column (let’s call it A) and sum another column (call it B). This can be broken up into two steps: group ID mapping and summation. The first stage touches only column A, as you find which row indices have the same values and assign them to groups. The summation step then touches only column B. As you can see, each part of the analysis touches only one column at a time. Thus if we were to use a row-oriented format we would be loading data into cache that we weren’t actually using, effectively reducing our cache size by half! By using a columnar format, we minimize cache misses by loading more useful data at a time. 

I hope this example helps illustrate why a columnar format is better for analytical workloads, which is what Arrow is targeting. 

Sasha 

> 19 дек. 2021 г., в 22:02, Tao Wang <ta...@gmail.com> написал(а):
> 
> Hi,
> 
> I looked through Arrow's docs about its formats and APIs.
> 
> But I am still somewhat confused about typical usecases of Arrow.
> 
> As in my understanding, the goal of Arrow is to eliminate the (de)serialization costs among different data analytic systems, since it has the common format.
> 
> But, it still needs some data conversion between Arrow format and language native format, right? For example, you have to convert Arrow columnar-based format to C++ row-based format. Or is there any usecase to directly conduct data analysis on Arrow's format?
> 
> Best,
> Tao

Re: A general question about Arrow

Posted by Benson Muite <be...@emailplus.org>.

On 12/20/21 9:01 AM, Tao Wang wrote:
> Hi,
> 
> I looked through Arrow's docs about its formats and APIs.
> 
> But I am still somewhat confused about typical usecases of Arrow.
> 
> As in my understanding, the goal of Arrow is to eliminate the (de)serialization costs among different data analytic systems, since it has the common format.
> 
> But, it still needs some data conversion between Arrow format and language native format, right? For example, you have to convert Arrow columnar-based format to C++ row-based format. Or is there any usecase to directly conduct data analysis on Arrow's format?
Conversion may be required, but the hope is that for many data analytics 
applications, if the data can be described by the arrow format, then 
conversion is not needed, and data processing can occur efficiently. 
Please see examples[1] and cookbook[2] for analytics demonstrations.
> 
> Best,
> Tao
> 
Hi Tao,
The documentation is still being updated. For an end user, Python 
documentation [1][2] and Ballista[3] documentation are probably of most 
interest. The original motivation for Arrow was to develop more 
efficient data frames that allow for interoperability[4].
Regards,
Benson

[1] https://arrow.apache.org/docs/python/index.html
[2] https://arrow.apache.org/cookbook/py/
[3] https://arrow.apache.org/blog/2021/04/12/ballista-donation/
[4] https://wesmckinney.com/blog/apache-arrow-pandas-internals/