You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Fan Liya <li...@gmail.com> on 2019/08/28 07:01:52 UTC

[DISCUSS][Java] Should null values in VariableWidthVector/ListVector always takes 0 space?

Dear all,

In the discussion of this PR (https://github.com/apache/arrow/pull/5073),
we are faced with a problem:

Normally, in a VariableWidthVector (e.g. VarCharVector), a null value is
supposed to take no space in the data buffer. In particular, for a null
value, we have

start index == end index

Where start index and end index are the start/end positions of the value in
the data buffer. This problem is also related to the ListVector.

However, it seems that for some scenarios, a null value can take non-empty
space (please see this comment
https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491).

Since this is an important issue, we should make it clear in the
specification. Otherwise, some unexpected problems may occur in client code.

It seems we are faced with 3 options:

1. a null value always takes no space.
2. a null value can take non-empty space, and the content of the non-empty
space is always 0.
3. a null value can take non-empty space, and the content of the non-empty
space is undefined.

Option 1 makes the data buffer of a VariableWidthVector a continuous region
(not interleaved by undefined regions). So optimization can be applied.
However, it may lead to memory copy/move (as indicated in the above comment
https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491)

Option 3 can address the above problem of memory copy/move. However, it
splits memory into un-continuous regions, so optimizations cannot be
performed. In addition, it may cause unexpected problems in client code.

Option 2 seems like a trade-off between the two. However, it is not
suitable for ListVector.

Please give your valuable feedback.

Best,
Liya Fan

Re: [DISCUSS][Java] Should null values in VariableWidthVector/ListVector always takes 0 space?

Posted by Fan Liya <li...@gmail.com>.
Hi Wes,

Thanks for the effort. I will add clarifications.

Best,
Liya Fan

On Wed, Sep 4, 2019 at 11:06 AM Wes McKinney <we...@gmail.com> wrote:

> I opened https://issues.apache.org/jira/browse/ARROW-6451
>
> On Sun, Sep 1, 2019 at 9:59 PM Fan Liya <li...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > Thanks for the information.
> > I agree with you that we had better make this clear in the document, to
> > help users avoid unexpected behaviors.
> >
> > Best,
> > Liya Fan
> >
> > On Sun, Sep 1, 2019 at 7:17 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > > Option 3 is the what the columnar specification currently intends, for
> > > the reasons that Jacques cites. In particular, a value can be made
> > > null only by altering the validity bitmap. It might be helpful to add
> > > some language to make clear that the contents "underneath" a null can
> > > be anything. The same is true of other memory layouts also, including
> > > primitive.
> > >
> > > On Thu, Aug 29, 2019 at 12:50 AM Fan Liya <li...@gmail.com>
> wrote:
> > > >
> > > > Hi Jacques and Ravindra,
> > > >
> > > > Thanks for your valuable feedback.
> > > >
> > > > Please let me talk more about contiguous memory:
> > > > For some operations (like memory segment comparison, hash code
> > > computation,
> > > > etc.), if we we chose option 1 or 2, we can get the result with a
> single
> > > > call, without any reference to the validity buffer.
> > > >
> > > > With option 3, we need to split the memory into continuous regions
> > > > separated by undefined regions (based on validity buffer), and then
> we
> > > > calculate the result for each region and finally combine them. This
> is
> > > less
> > > > efficient.
> > > >
> > > > Ravindra's idea sounds interesting, especially when most values are
> null
> > > or
> > > > non-null.
> > > >
> > > > What do you think?
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > On Thu, Aug 29, 2019 at 1:26 PM Ravindra Pindikura <
> ravindra@dremio.com>
> > > > wrote:
> > > >
> > > > > On Wed, Aug 28, 2019 at 12:32 PM Fan Liya <li...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Dear all,
> > > > > >
> > > > > > In the discussion of this PR (
> > > https://github.com/apache/arrow/pull/5073
> > > > > ),
> > > > > > we are faced with a problem:
> > > > > >
> > > > > > Normally, in a VariableWidthVector (e.g. VarCharVector), a null
> > > value is
> > > > > > supposed to take no space in the data buffer. In particular, for
> a
> > > null
> > > > > > value, we have
> > > > > >
> > > > > > start index == end index
> > > > > >
> > > > > > Where start index and end index are the start/end positions of
> the
> > > value
> > > > > in
> > > > > > the data buffer. This problem is also related to the ListVector.
> > > > > >
> > > > > > However, it seems that for some scenarios, a null value can take
> > > > > non-empty
> > > > > > space (please see this comment
> > > > > >
> > > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491
> ).
> > > > > >
> > > > > > Since this is an important issue, we should make it clear in the
> > > > > > specification. Otherwise, some unexpected problems may occur in
> > > client
> > > > > > code.
> > > > > >
> > > > > > It seems we are faced with 3 options:
> > > > > >
> > > > > > 1. a null value always takes no space.
> > > > > > 2. a null value can take non-empty space, and the content of the
> > > > > non-empty
> > > > > > space is always 0.
> > > > > > 3. a null value can take non-empty space, and the content of the
> > > > > non-empty
> > > > > > space is undefined.
> > > > > >
> > > > > > Option 1 makes the data buffer of a VariableWidthVector a
> continuous
> > > > > region
> > > > > > (not interleaved by undefined regions). So optimization can be
> > > applied.
> > > > >
> > > > > However, it may lead to memory copy/move (as indicated in the above
> > > comment
> > > > > >
> > > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491)
> > > > > >
> > > > > > Option 3 can address the above problem of memory copy/move.
> However,
> > > it
> > > > > > splits memory into un-continuous regions, so optimizations
> cannot be
> > > > > > performed. In addition, it may cause unexpected problems in
> client
> > > code.
> > > > > >
> > > > >
> > > > > We could still apply the optimisation for the contiguous "valid
> > > regions".
> > > > > eg. if the entire vector is valid (called array in cpp), then
> compare
> > > data
> > > > > buffers. If there are only two null entries in the vector, compare
> the
> > > > > three consecutive regions in the data buffer, ..
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > Option 2 seems like a trade-off between the two. However, it is
> not
> > > > > > suitable for ListVector.
> > > > > >
> > > > > > Please give your valuable feedback.
> > > > > >
> > > > > > Best,
> > > > > > Liya Fan
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks and regards,
> > > > > Ravindra.
> > > > >
> > >
>

Re: [DISCUSS][Java] Should null values in VariableWidthVector/ListVector always takes 0 space?

Posted by Wes McKinney <we...@gmail.com>.
I opened https://issues.apache.org/jira/browse/ARROW-6451

On Sun, Sep 1, 2019 at 9:59 PM Fan Liya <li...@gmail.com> wrote:
>
> Hi Wes,
>
> Thanks for the information.
> I agree with you that we had better make this clear in the document, to
> help users avoid unexpected behaviors.
>
> Best,
> Liya Fan
>
> On Sun, Sep 1, 2019 at 7:17 AM Wes McKinney <we...@gmail.com> wrote:
>
> > Option 3 is the what the columnar specification currently intends, for
> > the reasons that Jacques cites. In particular, a value can be made
> > null only by altering the validity bitmap. It might be helpful to add
> > some language to make clear that the contents "underneath" a null can
> > be anything. The same is true of other memory layouts also, including
> > primitive.
> >
> > On Thu, Aug 29, 2019 at 12:50 AM Fan Liya <li...@gmail.com> wrote:
> > >
> > > Hi Jacques and Ravindra,
> > >
> > > Thanks for your valuable feedback.
> > >
> > > Please let me talk more about contiguous memory:
> > > For some operations (like memory segment comparison, hash code
> > computation,
> > > etc.), if we we chose option 1 or 2, we can get the result with a single
> > > call, without any reference to the validity buffer.
> > >
> > > With option 3, we need to split the memory into continuous regions
> > > separated by undefined regions (based on validity buffer), and then we
> > > calculate the result for each region and finally combine them. This is
> > less
> > > efficient.
> > >
> > > Ravindra's idea sounds interesting, especially when most values are null
> > or
> > > non-null.
> > >
> > > What do you think?
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Thu, Aug 29, 2019 at 1:26 PM Ravindra Pindikura <ra...@dremio.com>
> > > wrote:
> > >
> > > > On Wed, Aug 28, 2019 at 12:32 PM Fan Liya <li...@gmail.com>
> > wrote:
> > > >
> > > > > Dear all,
> > > > >
> > > > > In the discussion of this PR (
> > https://github.com/apache/arrow/pull/5073
> > > > ),
> > > > > we are faced with a problem:
> > > > >
> > > > > Normally, in a VariableWidthVector (e.g. VarCharVector), a null
> > value is
> > > > > supposed to take no space in the data buffer. In particular, for a
> > null
> > > > > value, we have
> > > > >
> > > > > start index == end index
> > > > >
> > > > > Where start index and end index are the start/end positions of the
> > value
> > > > in
> > > > > the data buffer. This problem is also related to the ListVector.
> > > > >
> > > > > However, it seems that for some scenarios, a null value can take
> > > > non-empty
> > > > > space (please see this comment
> > > > >
> > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491).
> > > > >
> > > > > Since this is an important issue, we should make it clear in the
> > > > > specification. Otherwise, some unexpected problems may occur in
> > client
> > > > > code.
> > > > >
> > > > > It seems we are faced with 3 options:
> > > > >
> > > > > 1. a null value always takes no space.
> > > > > 2. a null value can take non-empty space, and the content of the
> > > > non-empty
> > > > > space is always 0.
> > > > > 3. a null value can take non-empty space, and the content of the
> > > > non-empty
> > > > > space is undefined.
> > > > >
> > > > > Option 1 makes the data buffer of a VariableWidthVector a continuous
> > > > region
> > > > > (not interleaved by undefined regions). So optimization can be
> > applied.
> > > >
> > > > However, it may lead to memory copy/move (as indicated in the above
> > comment
> > > > >
> > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491)
> > > > >
> > > > > Option 3 can address the above problem of memory copy/move. However,
> > it
> > > > > splits memory into un-continuous regions, so optimizations cannot be
> > > > > performed. In addition, it may cause unexpected problems in client
> > code.
> > > > >
> > > >
> > > > We could still apply the optimisation for the contiguous "valid
> > regions".
> > > > eg. if the entire vector is valid (called array in cpp), then compare
> > data
> > > > buffers. If there are only two null entries in the vector, compare the
> > > > three consecutive regions in the data buffer, ..
> > > >
> > > >
> > > >
> > > > >
> > > > > Option 2 seems like a trade-off between the two. However, it is not
> > > > > suitable for ListVector.
> > > > >
> > > > > Please give your valuable feedback.
> > > > >
> > > > > Best,
> > > > > Liya Fan
> > > > >
> > > >
> > > >
> > > > --
> > > > Thanks and regards,
> > > > Ravindra.
> > > >
> >

Re: [DISCUSS][Java] Should null values in VariableWidthVector/ListVector always takes 0 space?

Posted by Fan Liya <li...@gmail.com>.
Hi Wes,

Thanks for the information.
I agree with you that we had better make this clear in the document, to
help users avoid unexpected behaviors.

Best,
Liya Fan

On Sun, Sep 1, 2019 at 7:17 AM Wes McKinney <we...@gmail.com> wrote:

> Option 3 is the what the columnar specification currently intends, for
> the reasons that Jacques cites. In particular, a value can be made
> null only by altering the validity bitmap. It might be helpful to add
> some language to make clear that the contents "underneath" a null can
> be anything. The same is true of other memory layouts also, including
> primitive.
>
> On Thu, Aug 29, 2019 at 12:50 AM Fan Liya <li...@gmail.com> wrote:
> >
> > Hi Jacques and Ravindra,
> >
> > Thanks for your valuable feedback.
> >
> > Please let me talk more about contiguous memory:
> > For some operations (like memory segment comparison, hash code
> computation,
> > etc.), if we we chose option 1 or 2, we can get the result with a single
> > call, without any reference to the validity buffer.
> >
> > With option 3, we need to split the memory into continuous regions
> > separated by undefined regions (based on validity buffer), and then we
> > calculate the result for each region and finally combine them. This is
> less
> > efficient.
> >
> > Ravindra's idea sounds interesting, especially when most values are null
> or
> > non-null.
> >
> > What do you think?
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Aug 29, 2019 at 1:26 PM Ravindra Pindikura <ra...@dremio.com>
> > wrote:
> >
> > > On Wed, Aug 28, 2019 at 12:32 PM Fan Liya <li...@gmail.com>
> wrote:
> > >
> > > > Dear all,
> > > >
> > > > In the discussion of this PR (
> https://github.com/apache/arrow/pull/5073
> > > ),
> > > > we are faced with a problem:
> > > >
> > > > Normally, in a VariableWidthVector (e.g. VarCharVector), a null
> value is
> > > > supposed to take no space in the data buffer. In particular, for a
> null
> > > > value, we have
> > > >
> > > > start index == end index
> > > >
> > > > Where start index and end index are the start/end positions of the
> value
> > > in
> > > > the data buffer. This problem is also related to the ListVector.
> > > >
> > > > However, it seems that for some scenarios, a null value can take
> > > non-empty
> > > > space (please see this comment
> > > >
> https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491).
> > > >
> > > > Since this is an important issue, we should make it clear in the
> > > > specification. Otherwise, some unexpected problems may occur in
> client
> > > > code.
> > > >
> > > > It seems we are faced with 3 options:
> > > >
> > > > 1. a null value always takes no space.
> > > > 2. a null value can take non-empty space, and the content of the
> > > non-empty
> > > > space is always 0.
> > > > 3. a null value can take non-empty space, and the content of the
> > > non-empty
> > > > space is undefined.
> > > >
> > > > Option 1 makes the data buffer of a VariableWidthVector a continuous
> > > region
> > > > (not interleaved by undefined regions). So optimization can be
> applied.
> > >
> > > However, it may lead to memory copy/move (as indicated in the above
> comment
> > > >
> https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491)
> > > >
> > > > Option 3 can address the above problem of memory copy/move. However,
> it
> > > > splits memory into un-continuous regions, so optimizations cannot be
> > > > performed. In addition, it may cause unexpected problems in client
> code.
> > > >
> > >
> > > We could still apply the optimisation for the contiguous "valid
> regions".
> > > eg. if the entire vector is valid (called array in cpp), then compare
> data
> > > buffers. If there are only two null entries in the vector, compare the
> > > three consecutive regions in the data buffer, ..
> > >
> > >
> > >
> > > >
> > > > Option 2 seems like a trade-off between the two. However, it is not
> > > > suitable for ListVector.
> > > >
> > > > Please give your valuable feedback.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > >
> > >
> > > --
> > > Thanks and regards,
> > > Ravindra.
> > >
>

Re: [DISCUSS][Java] Should null values in VariableWidthVector/ListVector always takes 0 space?

Posted by Wes McKinney <we...@gmail.com>.
Option 3 is the what the columnar specification currently intends, for
the reasons that Jacques cites. In particular, a value can be made
null only by altering the validity bitmap. It might be helpful to add
some language to make clear that the contents "underneath" a null can
be anything. The same is true of other memory layouts also, including
primitive.

On Thu, Aug 29, 2019 at 12:50 AM Fan Liya <li...@gmail.com> wrote:
>
> Hi Jacques and Ravindra,
>
> Thanks for your valuable feedback.
>
> Please let me talk more about contiguous memory:
> For some operations (like memory segment comparison, hash code computation,
> etc.), if we we chose option 1 or 2, we can get the result with a single
> call, without any reference to the validity buffer.
>
> With option 3, we need to split the memory into continuous regions
> separated by undefined regions (based on validity buffer), and then we
> calculate the result for each region and finally combine them. This is less
> efficient.
>
> Ravindra's idea sounds interesting, especially when most values are null or
> non-null.
>
> What do you think?
>
> Best,
> Liya Fan
>
> On Thu, Aug 29, 2019 at 1:26 PM Ravindra Pindikura <ra...@dremio.com>
> wrote:
>
> > On Wed, Aug 28, 2019 at 12:32 PM Fan Liya <li...@gmail.com> wrote:
> >
> > > Dear all,
> > >
> > > In the discussion of this PR (https://github.com/apache/arrow/pull/5073
> > ),
> > > we are faced with a problem:
> > >
> > > Normally, in a VariableWidthVector (e.g. VarCharVector), a null value is
> > > supposed to take no space in the data buffer. In particular, for a null
> > > value, we have
> > >
> > > start index == end index
> > >
> > > Where start index and end index are the start/end positions of the value
> > in
> > > the data buffer. This problem is also related to the ListVector.
> > >
> > > However, it seems that for some scenarios, a null value can take
> > non-empty
> > > space (please see this comment
> > > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491).
> > >
> > > Since this is an important issue, we should make it clear in the
> > > specification. Otherwise, some unexpected problems may occur in client
> > > code.
> > >
> > > It seems we are faced with 3 options:
> > >
> > > 1. a null value always takes no space.
> > > 2. a null value can take non-empty space, and the content of the
> > non-empty
> > > space is always 0.
> > > 3. a null value can take non-empty space, and the content of the
> > non-empty
> > > space is undefined.
> > >
> > > Option 1 makes the data buffer of a VariableWidthVector a continuous
> > region
> > > (not interleaved by undefined regions). So optimization can be applied.
> >
> > However, it may lead to memory copy/move (as indicated in the above comment
> > > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491)
> > >
> > > Option 3 can address the above problem of memory copy/move. However, it
> > > splits memory into un-continuous regions, so optimizations cannot be
> > > performed. In addition, it may cause unexpected problems in client code.
> > >
> >
> > We could still apply the optimisation for the contiguous "valid regions".
> > eg. if the entire vector is valid (called array in cpp), then compare data
> > buffers. If there are only two null entries in the vector, compare the
> > three consecutive regions in the data buffer, ..
> >
> >
> >
> > >
> > > Option 2 seems like a trade-off between the two. However, it is not
> > > suitable for ListVector.
> > >
> > > Please give your valuable feedback.
> > >
> > > Best,
> > > Liya Fan
> > >
> >
> >
> > --
> > Thanks and regards,
> > Ravindra.
> >

Re: [DISCUSS][Java] Should null values in VariableWidthVector/ListVector always takes 0 space?

Posted by Fan Liya <li...@gmail.com>.
Hi Jacques and Ravindra,

Thanks for your valuable feedback.

Please let me talk more about contiguous memory:
For some operations (like memory segment comparison, hash code computation,
etc.), if we we chose option 1 or 2, we can get the result with a single
call, without any reference to the validity buffer.

With option 3, we need to split the memory into continuous regions
separated by undefined regions (based on validity buffer), and then we
calculate the result for each region and finally combine them. This is less
efficient.

Ravindra's idea sounds interesting, especially when most values are null or
non-null.

What do you think?

Best,
Liya Fan

On Thu, Aug 29, 2019 at 1:26 PM Ravindra Pindikura <ra...@dremio.com>
wrote:

> On Wed, Aug 28, 2019 at 12:32 PM Fan Liya <li...@gmail.com> wrote:
>
> > Dear all,
> >
> > In the discussion of this PR (https://github.com/apache/arrow/pull/5073
> ),
> > we are faced with a problem:
> >
> > Normally, in a VariableWidthVector (e.g. VarCharVector), a null value is
> > supposed to take no space in the data buffer. In particular, for a null
> > value, we have
> >
> > start index == end index
> >
> > Where start index and end index are the start/end positions of the value
> in
> > the data buffer. This problem is also related to the ListVector.
> >
> > However, it seems that for some scenarios, a null value can take
> non-empty
> > space (please see this comment
> > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491).
> >
> > Since this is an important issue, we should make it clear in the
> > specification. Otherwise, some unexpected problems may occur in client
> > code.
> >
> > It seems we are faced with 3 options:
> >
> > 1. a null value always takes no space.
> > 2. a null value can take non-empty space, and the content of the
> non-empty
> > space is always 0.
> > 3. a null value can take non-empty space, and the content of the
> non-empty
> > space is undefined.
> >
> > Option 1 makes the data buffer of a VariableWidthVector a continuous
> region
> > (not interleaved by undefined regions). So optimization can be applied.
>
> However, it may lead to memory copy/move (as indicated in the above comment
> > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491)
> >
> > Option 3 can address the above problem of memory copy/move. However, it
> > splits memory into un-continuous regions, so optimizations cannot be
> > performed. In addition, it may cause unexpected problems in client code.
> >
>
> We could still apply the optimisation for the contiguous "valid regions".
> eg. if the entire vector is valid (called array in cpp), then compare data
> buffers. If there are only two null entries in the vector, compare the
> three consecutive regions in the data buffer, ..
>
>
>
> >
> > Option 2 seems like a trade-off between the two. However, it is not
> > suitable for ListVector.
> >
> > Please give your valuable feedback.
> >
> > Best,
> > Liya Fan
> >
>
>
> --
> Thanks and regards,
> Ravindra.
>

Re: [DISCUSS][Java] Should null values in VariableWidthVector/ListVector always takes 0 space?

Posted by Ravindra Pindikura <ra...@dremio.com>.
On Wed, Aug 28, 2019 at 12:32 PM Fan Liya <li...@gmail.com> wrote:

> Dear all,
>
> In the discussion of this PR (https://github.com/apache/arrow/pull/5073),
> we are faced with a problem:
>
> Normally, in a VariableWidthVector (e.g. VarCharVector), a null value is
> supposed to take no space in the data buffer. In particular, for a null
> value, we have
>
> start index == end index
>
> Where start index and end index are the start/end positions of the value in
> the data buffer. This problem is also related to the ListVector.
>
> However, it seems that for some scenarios, a null value can take non-empty
> space (please see this comment
> https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491).
>
> Since this is an important issue, we should make it clear in the
> specification. Otherwise, some unexpected problems may occur in client
> code.
>
> It seems we are faced with 3 options:
>
> 1. a null value always takes no space.
> 2. a null value can take non-empty space, and the content of the non-empty
> space is always 0.
> 3. a null value can take non-empty space, and the content of the non-empty
> space is undefined.
>
> Option 1 makes the data buffer of a VariableWidthVector a continuous region
> (not interleaved by undefined regions). So optimization can be applied.

However, it may lead to memory copy/move (as indicated in the above comment
> https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491)
>
> Option 3 can address the above problem of memory copy/move. However, it
> splits memory into un-continuous regions, so optimizations cannot be
> performed. In addition, it may cause unexpected problems in client code.
>

We could still apply the optimisation for the contiguous "valid regions".
eg. if the entire vector is valid (called array in cpp), then compare data
buffers. If there are only two null entries in the vector, compare the
three consecutive regions in the data buffer, ..



>
> Option 2 seems like a trade-off between the two. However, it is not
> suitable for ListVector.
>
> Please give your valuable feedback.
>
> Best,
> Liya Fan
>


-- 
Thanks and regards,
Ravindra.

Re: [DISCUSS][Java] Should null values in VariableWidthVector/ListVector always takes 0 space?

Posted by Jacques Nadeau <ja...@apache.org>.
#3 is the correct behavior and how the code was meant to be written. I
don't see any problems with that pattern. This allows someone to (if they
so decide) to null a value without having to rewrite the data. #3 is also a
consistent
behavior with all other vectors. Null values can use up space but their
data is undefined.

I don't agree with your comment on noncontiguous memory.


On Wed, Aug 28, 2019, 12:02 AM Fan Liya <li...@gmail.com> wrote:

> Dear all,
>
> In the discussion of this PR (https://github.com/apache/arrow/pull/5073),
> we are faced with a problem:
>
> Normally, in a VariableWidthVector (e.g. VarCharVector), a null value is
> supposed to take no space in the data buffer. In particular, for a null
> value, we have
>
> start index == end index
>
> Where start index and end index are the start/end positions of the value in
> the data buffer. This problem is also related to the ListVector.
>
> However, it seems that for some scenarios, a null value can take non-empty
> space (please see this comment
> https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491).
>
> Since this is an important issue, we should make it clear in the
> specification. Otherwise, some unexpected problems may occur in client
> code.
>
> It seems we are faced with 3 options:
>
> 1. a null value always takes no space.
> 2. a null value can take non-empty space, and the content of the non-empty
> space is always 0.
> 3. a null value can take non-empty space, and the content of the non-empty
> space is undefined.
>
> Option 1 makes the data buffer of a VariableWidthVector a continuous region
> (not interleaved by undefined regions). So optimization can be applied.
> However, it may lead to memory copy/move (as indicated in the above comment
> https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491)
>
> Option 3 can address the above problem of memory copy/move. However, it
> splits memory into un-continuous regions, so optimizations cannot be
> performed. In addition, it may cause unexpected problems in client code.
>
> Option 2 seems like a trade-off between the two. However, it is not
> suitable for ListVector.
>
> Please give your valuable feedback.
>
> Best,
> Liya Fan
>