You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Dongxiao Song <so...@hashdata.cn> on 2022/05/06 02:27:53 UTC

[Array][C++]Whether batch with constant-type array will be supported in Arrow?

Hello,

I’m using arrow c++ as storage and computing structure of my own project,
which is a database based on PostgresSQL.

But when computing with a batch containing constant value column, the constant 
value has to be expanded to an array to store into batch, which is waste of time
and memory.

Arrow::scalar can be used as parameter for arrow functions, but cannot represent
a column in batch. So if we want to compute a batch containing constant value column,
the expansion of value is inevitable.

This occurs mainly before batch serialization, and functions like FilterBatch.

A constant-type array may solve this problem. It looks like an arrow array, 
but only stores single constant value and number of rows. In functions like
Arrow::Sum, the result can even be computed by multiplication.

Another solution is allowing batch containing Arrow::Scalar.

All this is just a suggestion from an Arrow user. I’m not sure that whether it is helpful 
for Arrow project.

Thanks,
Song

Re: [Array][C++]Whether batch with constant-type array will be supported in Arrow?

Posted by Dongxiao Song <so...@hashdata.cn>.
Thanks a lot for your reply. I will bypass constant array now, and hope
to use constant array in the future.

Song

> 2022年5月7日 上午2:30,Weston Pace <we...@gmail.com> 写道:
> 
> Hi Song,
> 
> Wes proposed a couple of different array types a few months ago in
> [1].  These were documented in [2].  In this proposal a constant array
> type was suggested in addition to a run-length encoded array type.
> During the discussion it was suggested that a constant array might
> just be a special case of a run-length encoded array.  So there has
> been some discussion about adding support for this.  However, these
> ideas have not been implemented yet and I'm not aware of any PRs so it
> can be difficult to know if/when something may happen.
> 
> In the present moment you might be able to use
> arrow::compute::ExecBatch which is what we use in the streaming
> execution engine to bypass this problem.  An ExecBatch is a vector of
> datums and so each column could either be a scalar or an array.  The
> batch itself has a length so if a batch with length 50 has a scalar
> column then that implies a constant array of 50 items.  However, this
> does add complication to the logic (constantly needing to check if a
> column is a scalar or an array) and I do hope the RLE array is added
> as it can simplify a lot of this.
> 
> -Weston
> 
> [1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> [2] https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit#heading=h.j2x776n0ymmp
> 
> On Thu, May 5, 2022 at 4:28 PM Dongxiao Song <so...@hashdata.cn> wrote:
>> 
>> Hello,
>> 
>> I’m using arrow c++ as storage and computing structure of my own project,
>> which is a database based on PostgresSQL.
>> 
>> But when computing with a batch containing constant value column, the constant
>> value has to be expanded to an array to store into batch, which is waste of time
>> and memory.
>> 
>> Arrow::scalar can be used as parameter for arrow functions, but cannot represent
>> a column in batch. So if we want to compute a batch containing constant value column,
>> the expansion of value is inevitable.
>> 
>> This occurs mainly before batch serialization, and functions like FilterBatch.
>> 
>> A constant-type array may solve this problem. It looks like an arrow array,
>> but only stores single constant value and number of rows. In functions like
>> Arrow::Sum, the result can even be computed by multiplication.
>> 
>> Another solution is allowing batch containing Arrow::Scalar.
>> 
>> All this is just a suggestion from an Arrow user. I’m not sure that whether it is helpful
>> for Arrow project.
>> 
>> Thanks,
>> Song
> 


Re: [Array][C++]Whether batch with constant-type array will be supported in Arrow?

Posted by Weston Pace <we...@gmail.com>.
Hi Song,

Wes proposed a couple of different array types a few months ago in
[1].  These were documented in [2].  In this proposal a constant array
type was suggested in addition to a run-length encoded array type.
During the discussion it was suggested that a constant array might
just be a special case of a run-length encoded array.  So there has
been some discussion about adding support for this.  However, these
ideas have not been implemented yet and I'm not aware of any PRs so it
can be difficult to know if/when something may happen.

In the present moment you might be able to use
arrow::compute::ExecBatch which is what we use in the streaming
execution engine to bypass this problem.  An ExecBatch is a vector of
datums and so each column could either be a scalar or an array.  The
batch itself has a length so if a batch with length 50 has a scalar
column then that implies a constant array of 50 items.  However, this
does add complication to the logic (constantly needing to check if a
column is a scalar or an array) and I do hope the RLE array is added
as it can simplify a lot of this.

-Weston

[1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[2] https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit#heading=h.j2x776n0ymmp

On Thu, May 5, 2022 at 4:28 PM Dongxiao Song <so...@hashdata.cn> wrote:
>
> Hello,
>
> I’m using arrow c++ as storage and computing structure of my own project,
> which is a database based on PostgresSQL.
>
> But when computing with a batch containing constant value column, the constant
> value has to be expanded to an array to store into batch, which is waste of time
> and memory.
>
> Arrow::scalar can be used as parameter for arrow functions, but cannot represent
> a column in batch. So if we want to compute a batch containing constant value column,
> the expansion of value is inevitable.
>
> This occurs mainly before batch serialization, and functions like FilterBatch.
>
> A constant-type array may solve this problem. It looks like an arrow array,
> but only stores single constant value and number of rows. In functions like
> Arrow::Sum, the result can even be computed by multiplication.
>
> Another solution is allowing batch containing Arrow::Scalar.
>
> All this is just a suggestion from an Arrow user. I’m not sure that whether it is helpful
> for Arrow project.
>
> Thanks,
> Song